12

In section 3.2.1 of Attention Is All You Need the claim is made that:

Dot-product attention is identical to our algorithm, except for the scaling factor of $\frac{1}{\sqrt{d_k}}$. Additive attention computes the compatibility function using a feed-forward network with a single hidden layer. While the two are similar in theoretical complexity, dot-product attention is much faster and more space-efficient in practice, since it can be implemented using highly optimized matrix multiplication code.

It does not make sense why dot product attention would be faster. Additive attention is nearly identical computation wise; the main difference is $Q + K$ instead of $Q K^T$ in dot product attention. $Q K^T$ requires at least as many addition operations as $Q + K$, so how can it possibly be faster?

nbro
  • 42,615
  • 12
  • 119
  • 217
user3180
  • 648
  • 5
  • 15

3 Answers3

3

The additive attention method that the researchers are comparing to corresponds to a neural network with 3 layers (it is not actually straight addition). Computing this will involve one multiplication of the input vector by a matrix, then by another matrix, and then the computation of something like a softmax. Smart implementation of a dot-product will not break out the whole matrix multiplication algorithm for it, and it will basically be a tight, easily parallelized loop.

nbro
  • 42,615
  • 12
  • 119
  • 217
John Doucette
  • 9,452
  • 1
  • 19
  • 52
3

In additive attention (as described in the paper by Bahdanau et al. (2014)), the alignment model $a$ is represented by a feedforward neural network. If you look in their appendix, they actually implement this as

$$ e_{ij} = V_a^T \tanh \left(W_a s_{i-1} + U_a h_j\right) = V_a^T \tanh \left(Q + K\right).$$

In contrast, in dot-product attention (as described in the paper by Vaswani et al. (2017)), the alignment model is implemented as

$$ e_{ij} = W_Q s_{i-1} \left(W_K h_j\right)^T = Q K^T.$$

The computational advantage is that the dot-product alignment model has only two weight matrices and only needs matrix multiplication, for which highly-optimized code exists.

2

Two are similar in theoretical complexity.

But matrix multiplication is better optimized. According to this paper, non-matmul ops in transformer are 15x slower than matmul operation.

Making Deep Learning Go Brrrr From First Principles discuses about similar topics.

Eric
  • 121
  • 1