12 Transformers

We introduce transformers by saying that they take a set of tokens and transform them into a richer vector representation.

The input to a transformer is a set of vectors $\{x_n\}$ each of dimensionality $D$ (i.e. $x_n \in \R^D)$.

We can then write this set of vectors that the transformer will process as a matrix $X \in \R^{N \times D}$:

Now, we simply state for now, that a transformer layer takes such a data matrix $X$ and transforms it:

$\~X = \text{TransformerLayer[X]}$

By stacking these layers, we get a deep network.

Attention

Ok, consider now that we have a set of output vectors $\{y_n\}$ for our input set $\{x_n\}$ both of which are of equal size.

Now, we demand that our output vectors depend on any input vector do different degrees. One way to do this is to define each $y_n$ as a linear combination of the input vectors with weights, which we call attention coefficients that sum up to 1 and are all non-negative floats:

$$ y_n = \sum_{i = 1}^{N} a_{n,i}x_i $$

Self-Attention

Now the question is: how can we set the attention coefficients such that they depend on the input itself. Why would we want this? If they were fixed, then our computation would create outputs where the importance of each input only depends on the magnitude of the input and not on the particular composition of the input.

One way, to force weighting that is input-dependent is self-attention. Here, we simply say that our attention coefficients should correspond to the similarity of the vector for which we want an output $y_n$ and the vector that’s weighted (i.e. $x_n$). An easy way to do so, is the dot product. Since our coefficients should normalise to 1, we get:

$$ a_{ni} = \frac{\exp(x^T_nx_i)}{\sum^N_{m = 1} \exp(x^T_nx_{m})}
$$

where the denominator is essentially just summing up all the attention coefficients for a particular output.

In matrix notation, the dot product would be $XX^T$ and the normalisation operation described above is the softmax operation and thus:

$$ Y = \text{Softmax}[XX^T]X $$