I have a difficult time understanding the "multi-head" notion in the original transformer paper. What makes the learning in each head unique? Why doesn't the neural network learn the same set of parameters for each attention head? Is it because we break query, key and value vectors into smaller dimensions and feed each portion to a different head?
2 Answers
The reason each head is different is because they each learn a different set of weight matrices $\{ W_i^Q, W_i^K, W_i^V \}$ where $i$ is the index of the head. To clarify, the input to each attention head is the same. For attention head $i$:
\begin{align} Q_i(x) &= x W_i^Q \\ K_i(x) &= x W_i^K \\ V_i(x) &= x W_i^V \\ \text{attention}_i(x) &= \text{softmax} \left(\frac{Q_i(x) K_i(x)^T}{\sqrt{d_k}} \right) V_i(x). \end{align}
Notice that the input to each head is $x$ (either the semantic + positional embedding of the decoder input for the first decoder layer, or the output of the previous decoder layer). More info
The question as to why gradient descent learns each set of weight matrices $\{ W_i^Q, W_i^K, W_i^V \}$ to be different across each attention head is very similar to "Is there anything that ensures that convolutional filters don't end up the same?", so maybe you might find the answer there helpful for you:
No, nothing really prevents the weights from being different. In practice though they end up almost always different because it makes the model more expressive (i.e. more powerful), so gradient descent learns to do that. If a model has n features, but 2 of them are the same, then the model effectively has n−1 features, which is a less expressive model than that of n features, and therefore usually has a larger loss function.
- 103
- 3
- 1,700
- 9
- 16
Multiple attention heads in a single layer in a transformer is analogous to multiple kernels in a single layer in a CNN: they have the same architecture, and operate on the same feature-space, but since they are separate 'copies' with different sets of weights, they are hence 'free' to learn different functions.
In a CNN this may correspond to different definitions of visual features, and in a Transformer this may correspond to different definitions of relevance:1
For example:
| Architecture | Input | (Layer 1) Kernel/Head 1 |
(Layer 1) Kernel/Head 2 |
|---|---|---|---|
| CNN | Image | Diagonal edge-detection | Horizontal edge-detection |
| Transformer | Sentence | Attends to next word | Attends from verbs to their direct objects |
Notes:
There is no guarantee that these are human interpretable, but in many popular architectures they do map accurately onto linguistic concepts:
While no single head performs well at many relations, we find that particular heads correspond remarkably well to particular relations. For example, we find heads that find direct objects of verbs, determiners of nouns, objects of prepositions, and objects of possesive pronouns...
Multiple heads was originally proposed as a way to mitigate the lack of descriptive power of a single head in self-attention has:
In these models, the number of operations required to relate signals from two arbitrary input or output positions grows in the distance between positions [...] This makes it more difficult to learn dependencies between distant positions [12]. In the Transformer this is reduced to a constant number of operations, albeit at the cost of reduced effective resolution due to averaging attention-weighted positions, an effect we counteract with Multi-Head Attention...
- Attention is All You Need (2017)
- 1,031
- 9
- 24