2

In the Attention is all you need paper, on the 4th page, we have equation 1, which describes the self-attention mechanism of the transformer architecture

$$ \text { Attention }(Q, K, V)=\operatorname{softmax}\left(\frac{Q K^{T}}{\sqrt{d_{k}}}\right) V $$

Everything is fine up to here.

Then they introduce the multi-head attention, which is described by the following equation.

$$ \begin{aligned} \text { MultiHead }(Q, K, V) &=\text { Concat}\left(\text {head}_{1}, \ldots, \text {head}_{\mathrm{h}}\right) W^{O} \\ \text { where head}_{\mathrm{i}} &=\text {Attention}\left(Q W_{i}^{Q}, K W_{i}^{K}, V W_{i}^{V}\right) \end{aligned} $$

Once the multi-head attention is motivated at the end of page 4, they state that for a single head (the $i$th head), the query $Q$ and key $K$ inputs are first linearly projected by $W_i^Q$ and $W_i^K$, then dot product is calculated, let's say $Q_i^p = Q W_i^Q$ and $K_i^p = K W_i^K$.

Therefore, the dot product of the projected query and key becomes the following from simple linear algebra.

$$Q_i^p {K_i^p}^\intercal = Q W_i^Q {W_i^K}^T K^T = Q W_i K^T,$$

where

$$W_i = W_i^Q {W_i^K}^T$$

Here, $W$ is the outer product of query projection by the key projection matrix. However, it is a matrix with shape $d_{model} \times d_{model}$. Why did the authors not define only a $W_i$ instead of $W_i^Q$ and $W_i^K$ pair which have $2 \times d_{model} \times d_{k}$ elements? In deep learning applications, I think it would be very inefficient.

Is there something that I am missing, like these 2 matrices $W_i^Q$ and $W_i^K$ should be separate because of this and that?

2 Answers2

2

I'll use notation from the paper you cited, and any other readers should refer to the paper (widely available) for definitions of notation. The utility of using $W^Q$ and $W^K$, rather than $W$, lies in the fact that they allow us to add fewer parameters to our architecture. $W$ has dimension $d_{model} \times d_{model}$, which means that we are adding $d_{model}^2$ parameters to our architecture. $W^Q$ and $W^K$ each have dimension $d_{model} \times d_k$, and $d_k=\frac{d_{model}}{h}$. If we use these two matrices, we only add $2\frac{d_{model}^2}{h}$ parameters to our architecture, even though their multiplication (with the transpose) allows us to have the correct dimensions for matrix multiplication with $Q$ and $K$.

We do use $h$ attention heads, which then brings our number of parameters back up, but the multiple heads let the model attend to different pieces of information in our data.

BioBroo
  • 148
  • 5
0

In practice, matrices $W^Q, W^K, W^V$ (each of size $d_{model}$ x $d_{model}$) are completely removed instead, and Transformer implementations just learn a single set of matrices $\{ W_i^{Q*}, W_i^{K*}, W_i^{V*} \}$ (each of size $d_{model}$ x $\frac{d_{model}}{h}$) for each head, where

$W_i^{Q*} = W^Q W_i^Q \\ W_i^{K*} = W^K W_i^K \\ W_i^{V*} = W^V W_i^V $

so that:

$Q_i(x) = x W_i^{Q*} = x W^Q W_i^{Q*} = Q W_i^Q\\ K_i(x) = x W_i^{K*} = x W^K W_i^{K*} = K W_i^K\\ V_i(x) = x W_i^{V*} = x W^V W_i^{V*} = V W_i^V\\ head_i(x) = softmax \left(\frac{Q_i(x) K_i(x)^T}{\sqrt{d_k}} \right) V_i(x)$.

I can confirm this with the original Transformer implementation in Tensor2Tensor, and also the BERT code that uses the encoding part of the Transformer.

user3667125
  • 1,700
  • 9
  • 16