2

As far as I understand, the attention model in a LLM is used to mitigate the vanishing gradient problem.

When using activation functions like the sigmoid function, deep neural networks may lead to gradients that are very close to zero (because you are multiplying tails that are close to zero), which in a way reduces the usefulness of these models.

However, couldn't you simply use a different activation function like ReLU to avoid this issue?

A. Darwin
  • 163
  • 3

1 Answers1

0

The attention module can be seen in many way. Indeed, you can just see as a way to construct a set of weights on the fly (in poor words this is what it does): $$ A = softmax(QK/\sqrt{d})\\ output = A\cdot V $$ Indeed, you can see it as a way to aggregate length-independent sequence (a dense layer does not have this property as it requires a fixed size input).
Such mechanisms can be found in literature as "hyper networks" (you use a network to estimate the weights of another network)

Nothing stops you from using a different activation function... whoever you will loose the "interpretation" of that matrix being an attention/distribution

What you gain instead is scalability. Indeed, the problem of the $O(n^2)$ complexity comes from the fact that you need the normalization of the softmax. However if you use another activation, as ReLU, or even linear, you can go back to the $O(n)$ beloved complexity

Alberto
  • 2,863
  • 5
  • 12