12

Why don't people use nonlinear activation functions after projecting the query key value in attention?

It seems like doing this would lead to much-needed nonlinearity, otherwise, we're just doing linear transformations.

This observation applies to the transformer, additive attention, etc.

nbro
  • 42,615
  • 12
  • 119
  • 217
user3180
  • 648
  • 5
  • 15

2 Answers2

5

It seems like doing this would lead to much-needed nonlinearity, otherwise, we're just doing linear transformations.

Attention is broadly defined as a following operation ($\text{softmax}$ is sometimes replaced by $\tanh$) :

$$\text{Attention}(Q,K,V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V $$

Where $Q$, $K$ and $V$ are matrices that are some functions of the inputs.
There are three nonlinear operations there:

  1. The inner projection $QK^T$ is nonlinear. We have multiplication of two functions of the inputs. For example, in case of self-attention $Q=X W_Q$ and $K = XW_K$ are two linear transforms of the same $X$, so $QK^T = X \left(W_Q W_K^T\right) X^T$ is a quadratic function of the inputs.
  2. The $\text{softmax}(x_i) = e^{x_i} /\sum_n e^{x_n} $ function is obviously nonlinear ($\tanh$ as well)
  3. The final $\text{softmax}(\dots) V$ product is also nonlinear for the same reasons as (1)

I would say that it is pretty clear that it is definitely not just a linear transformation - there's quite a lot of nonlinearities in the attention block.


This observation applies to the transformer, additive attention, etc.

Let's see what happens next with the outputs of the attention layers:

In the transformer model, outputs of the multi-head-self-attention are fed into a feed-forward network inside each block:

Transformer figure1 cutout

"Feed-forward" means that the inputs are multiplied by a weight matrix and then a nonlinear activation function is applied.

The additive attention approach, directly applies another $\text{softmax}$ on the outputs of what one would call the attention block:

$$e_{ij} = v_a^T \tanh\left(W_as_{i-1} + U_a h_j\right)$$

$$\alpha_{ij} = \frac{\exp(e_{ij})}{\sum_k \exp(e_{ik})}$$


To summarize - I don't think that the premise of the question is correct. Various nonlinearities are present both inside the attention blocks and, typically, are applied after the attention is computed.

Kostya
  • 2,667
  • 12
  • 24
0

here are a couple of reasons why nonlinear activation functions aren't typically used after projecting the query, key, and value vectors in attention mechanisms like those found in transformers:

Redundancy: The attention mechanism itself already introduces non-linearity through the softmax function. Softmax takes the attention scores (which are linear dot products) and squashes them into probabilities, creating a non-linear relationship between the input and output.

Later Layers Handle Non-Linearity: The transformer architecture addresses the need for non-linearity in later stages. Following the multi-head attention layer, there's a fully-connected feed-forward network (MLP) with one or more hidden layers. These hidden layers typically use ReLU or similar non-linear activation functions, allowing the network to learn complex relationships between the attention outputs and the final prediction.