Why don't people use nonlinear activation functions after projecting the query key value in attention?

Question

It seems like doing this would lead to much-needed nonlinearity, otherwise, we're just doing linear transformations.

This observation applies to the transformer, additive attention, etc.

Kostya · Answer 1 · 2023-03-12T13:33:11.777

It seems like doing this would lead to much-needed nonlinearity, otherwise, we're just doing linear transformations.

Attention is broadly defined as a following operation ($\text{softmax}$ is sometimes replaced by $\tanh$) :

$$\text{Attention}(Q,K,V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V $$

Where $Q$, $K$ and $V$ are matrices that are some functions of the inputs.
There are three nonlinear operations there:

The inner projection $QK^T$ is nonlinear. We have multiplication of two functions of the inputs. For example, in case of self-attention $Q=X W_Q$ and $K = XW_K$ are two linear transforms of the same $X$, so $QK^T = X \left(W_Q W_K^T\right) X^T$ is a quadratic function of the inputs.
The $\text{softmax}(x_i) = e^{x_i} /\sum_n e^{x_n} $ function is obviously nonlinear ($\tanh$ as well)
The final $\text{softmax}(\dots) V$ product is also nonlinear for the same reasons as (1)

I would say that it is pretty clear that it is definitely not just a linear transformation - there's quite a lot of nonlinearities in the attention block.

This observation applies to the transformer, additive attention, etc.

Let's see what happens next with the outputs of the attention layers:

In the transformer model, outputs of the multi-head-self-attention are fed into a feed-forward network inside each block:

"Feed-forward" means that the inputs are multiplied by a weight matrix and then a nonlinear activation function is applied.

The additive attention approach, directly applies another $\text{softmax}$ on the outputs of what one would call the attention block:

$$e_{ij} = v_a^T \tanh\left(W_as_{i-1} + U_a h_j\right)$$

$$\alpha_{ij} = \frac{\exp(e_{ij})}{\sum_k \exp(e_{ik})}$$

To summarize - I don't think that the premise of the question is correct. Various nonlinearities are present both inside the attention blocks and, typically, are applied after the attention is computed.

score 0 · Answer 2 · answered May 24 '24 at 12:08

here are a couple of reasons why nonlinear activation functions aren't typically used after projecting the query, key, and value vectors in attention mechanisms like those found in transformers:

Redundancy: The attention mechanism itself already introduces non-linearity through the softmax function. Softmax takes the attention scores (which are linear dot products) and squashes them into probabilities, creating a non-linear relationship between the input and output.

Later Layers Handle Non-Linearity: The transformer architecture addresses the need for non-linearity in later stages. Following the multi-head attention layer, there's a fully-connected feed-forward network (MLP) with one or more hidden layers. These hidden layers typically use ReLU or similar non-linear activation functions, allowing the network to learn complex relationships between the attention outputs and the final prediction.

Why don't people use nonlinear activation functions after projecting the query key value in attention?

2 Answers2