10

The seminal Attention is all you need paper (Google Brain team, 2017) introduces Transformers and implements the attention mecanism with "queries, keys, values", in an analogy to a retrieval system.

I understand the whole process of multi-head attention and such (i.e., what is done with the $Q$, $K$, $V$ values and why), but I'm confused on how these values are computed in the first place. AFAICT, the paper seems to completely leave that out.

Both Figure 2 of the paper and equations explaining Attention and Multihead attention start with $Q$,$K$,$V$ already there :

enter image description here

The answers regaridng the origin of $Q$,$K$,$V$ I've found so far haven't satisfied me :

  • In this similar question, the accepted answer says "The proposed multihead attention alone doesn't say much about how the queries, keys, and values are obtained, they can come from different sources depending on the application scenario.". If this is the case, then why isn't the computing of $Q$,$K$,$V$ made more clear in the paper, at the very least for the task of language translation for which they show some numerical results and so obviously did compute $Q$,$K$,$V$ in some way ?

  • I also see some answers (eg this one on the same question) which say that $Q$, $K$, $V$ are the result of multiplication of the input embedding with some matrices $W$. This is also what is shown in the popular blog post The Illustrated Transformer :

enter image description here

These "projection" matrices ($W^Q$, $W^K$, $W^V$) do seem to appear in the the definition of attention in the definition of $head_i$ (see top figure), but according to that equation, these matrices are multiplied by $Q$, $K$, $V$ (still appearing out of thin air, so the problem of their definition remains) and so the resulting product can't also be $Q$, $K$, $V$.

How are the $Q$, $K$, $V$ values computed ?

Soltius
  • 311
  • 1
  • 2
  • 10

2 Answers2

6

(OP auto-answer) After having dug further in and read more papers on attention, and with help from Chillston in the comments, I think I've got it narrowed down to an issue of confusing notation. If anyone thinks this is not the right answer, please don't hesitate to submit another one, which I'll mark as correct if I think it's better.


$Q$, $K$ and $V$ values are defined in the paper, and they do come from multiplication with learnt projection matrices. Those matrices are $W^Q_i$, $W^K_i$ and $W^V_i$, defined in section 3.2.2 of the paper.

The confusion stems from the fact that the notation used in the multihead attention equation and in Figure 2 (right) of the paper is wrong/confusing. The equation would be be clearer if it read : ![enter image description here And Figure 2 right could be modified accordingly :
enter image description here

In this new notation, $X$ and $Y$ are the inputs to the current attention unit.

  • For self attention, we'd have $X = Y$ which would both be the previous en/decoder block output (or word embedding for the first encoder block).
  • For cross-attention, $X$ would be the output of the last encoder block and $Y$ the output of the previous decoder block.

Technically, the way it's written in the paper could be correct but you need to consider that $Q, K, V$ refer to different tensors when they're written :

  • in the multihead($Q,K,V$) equation where they represent inputs, $i.e.$ what they call $V$ is $X$ in my suggested re-writing ;
  • in the attention($Q,K,V$) equation where they represent "true" query/key/values, meaning inputs multiplied by projections matrices, $i.e.$ what they call $V$ is $XW^V_i$ in my suggested re-writing.
Soltius
  • 311
  • 1
  • 2
  • 10
1

As I understand it (and I'm not an AI researcher, so any helpful comments from folks who know the topic better will be illuminating) the output of layer $l \in 1 ...\bf{L}$, $\bf{X}^l$, is

enter image description here

where $a\in 1...A$ is the head number, and $f$ is some function like RELU or whatever and the $\bf{b}$s are biases ($M$ is the attention mask and $d_E$ is the size of the embedding). The first bit corresponds to @Soltius's correction (and the second bit is the FFN). (And $\underset{\mathsf{vocab}}{\mathsf{softmax}}\left(\bf{X}^L\bf{W}_E^{-1}\right)$ is what's used in calculating cost).

orome
  • 113
  • 6