0

My question is why the attention head matrices $W^Q$, $W^K$, $W^V$ should not be the same $W = W^Q =W^K= W^V$. In my understanding of transformer-based language models one attention head is responsible for one syntactic or semantic relation between any two words in the context. One might think that such a relation is represented by one matrix $W$ that projects the full word embeddings $x_i$ from their full semantic space to a semantic subspace responsible for this relation. Here we could - in principle - calculate scores $\sigma_{ij}$ as "similiarities" between two projected words $Wx_i$ and $Wx_j$ and then calculate the weighted sum of the projected tokens $Wx_k$.

I wonder why this would not work, and why we need three different matrices.

Another way around: What does it mean to calculate the score as the dot-product of two vectors from two different semantic subspaces? Is this still some kind of similiarity (which lies at the heart of word embeddings)? And doesn't it sound like comparing apples and pears?

Or viewed differently: How similar are the three matrices of an attention head in practice, e.g. when considering some 100$\times$100 attention heads of a large transformer model like ChatGPT?

Hans-Peter Stricker
  • 931
  • 1
  • 8
  • 23

2 Answers2

1

Let me come up with a possibly too naive (and too high-level) answer: Since the weights of the word embedding layer (i.e. the word embeddings) are not trained in advance (via cooccurrence) and independently from the training of the transformer weights (but happens in the same training runs), the word embeddings don't necessarily indicate semantic similarity when taking the dot product (as they do when using Word2Vec or so). So this doesn't have to be expected neither when choosing the same matrix $W = W^Q =W^K= W^V$ in each attention head: the dot product doesn't give scores to be interpreted as similarities but just of some (hard to interpret and only sometimes symmetric) relation.

Hans-Peter Stricker
  • 931
  • 1
  • 8
  • 23
0

Actually a multi-head attention (MHA) layer has multiple "heads", where each head is made by three *projection matrices like: $H_i = (W_i^Q, W_i^K, W_i^V)$, for the $i$-th head. A typical number of heads value is eight, for example: each head is responsible of learning to attend to different parts of the sequence, thus you can (partially) interpret transformer-based models by inspecting their attention masks.

A word embedding $x_i$ is not directly fed to MHA. I mean MHA accepts three matrices $Q$ (query), $K$ (key) and $V$ (value), which got projected by $W^Q$, $W^K$ and $W^V$, respectively, when computing the attention scores (i.e. the scaled dot-product attention.) According to the paper (see section 3.2.1) and here, keys $d_k$ and value $d_k$ dimensions can be different, simply because these can be different sequences of (word) embeddings. For example, in a Question-Answer model $Q$ can be the embeddings from the question, while $K$ or $V$ of the expected answer. So learning different weight matrices, $W^Q$, $W^K$, and $W^V$, allows to capture different relevant aspects for each input. Even when $Q=K=V$, i.e. the so called self-attention, the matrices are still fundamental.

In general, consider that queries attend to keys to yield the attention probabilities (via dot-product and softmax) which are used to weight the values. So they basically say which timestep (and so which work embedding in the sequence) is relevant in values (the learned representation by projection.) In other words, the general meaning is which words (in $V$) are relevant to attend at time $t$: so something that goes beyond a simple similarity (which may occur in the self-attention case.)

Luca Anzalone
  • 3,216
  • 4
  • 15