Why do we need cosine positional encoding in multi-head attention based transformer?

Question

My understanding is that all tokens are passed to a transformer at once, and positional encodings help it understand their order in the sequence. And cosine type of positional encoding helps capture the short-term and long-term dependencies between the tokens (this is due to different frequencies for the cosine functions).

First doubt: why doesn't word ordering implicitly emerge from the multi-head attention mechanism? My thought was that after training a transformer on tons of data, the mechanism would eventually be able to look at an unordered set of words and give it some order? Another thing: isn't order implicit in the way the rows of the Q, K, V and embedding matrices are ordered? i.e. 1st row is for 1st token, i-th row for i-th token, etc.

Or is it that using positional encodings, we're able to take some of that "learning load" off of the attention mechanism, leaving it to focus on inter-token dependencies? This seems like a very hand-wavy explanation, hence I thought I should clarify with you all.

Second doubt: even if positional encoding is needed, why can't we just use the normalized position of the token in the sequence? e.g. "My name is Bob" can become [0, 1, 2, 3] and this vector can be normalized to get the positional encoding. Multi-head self attention definitely is more than capable of capturing short, medium and long term token dependencies. So why do the effort of using cosine encodings?

score 1 · Answer 1 · answered Jan 13 '25 at 20:26

Your intuition is right when you use causal language models like GPTs which can learn sequence implicitly, positional encodings (PE) could further reduce the learning burden by injecting sequence order explicitly. However, attention mechanism itself is permutation-invariant, thus for bidirectional masked language models like BERT there's no implicit position learning without PE.

QKV attention is equivariant with respect to re-ordering the queries (rows of ${\displaystyle \mathbf {Q} }$); and invariant to re-ordering of the key-value pairs in ${\displaystyle \mathbf {K} ,\mathbf {V} }$... The permutation invariance and equivariance properties of standard QKV attention do not hold for the masked variant.

As for the disadvantage of your normalised PE such as a simple linear function, cosine and sine functions encode positions across a range of frequencies, allowing the model to capture both short-term and long-term periodic dependencies by all the dimensions of PE. Some NLP tasks require understanding periodic or hierarchical recursive dependencies like word repetitions in proses vs poems. Finally relative position is more important than absolute position, cosine PEs directly describe relative positional information well while linear PE only provides absolute position information without more processing.

score 0 · Answer 2 · answered Jan 13 '25 at 14:53

First question:

In a transformer model, the QKV projection is applied token wise i.e. The same projection is applied to each token. Otherwise a different number of weights would be required for different length sequences. Therefore, this encodes no information about token position.
If we disregard the sequence of tokens altogether then this is called a bag of words approach. Whilst this works well for some simpler classifiers, it would not work well for next token prediction as the word order can be very important.

For a basic example:

I would like 3 oranges and 4 bananas.
I would like 4 oranges and 3 bananas.

Would be equivalent, however we would like to be able to distinguish between the two.

Second questions:

A normalized token position would have issues with variable length sequences. It would not be possible to learn local features.

For example:

I went to the river bank.
I am going swimming in the river. Later, I will go to the bank.

The two tokens river and bank have a very different meaning if they are next to each other or separated by a few tokens.

Using the normalised token positioning, tokens that are next to each other in a short input and tokens that are far away from eachother in a long input would have the same embedding dot product.

Using cosine positional encodings, a constant divisor is used for all lengths of sequences. This means that the difference dot product between token n and n+k will be the same for any length of sequence, so the attention weight will be the same.

import torch
def get_positional_encoding(n, d_model)
    position = torch.tensor(n).float()
# Create an array of dimensions indices
div_term = torch.exp(torch.arange(0, d_model, 2).float() * -(torch.log(torch.tensor(10000.0)) / d_model))

# Create empty positional encoding
pe = torch.zeros(d_model)

# Fill with alternating sin and cos
pe[0::2] = torch.sin(position * div_term)
pe[1::2] = torch.cos(position * div_term)

return pe



>>> get_positional_encoding(105, 16).dot(get_positional_encoding(106, 16))
tensor(7.4852)
>>> get_positional_encoding(5, 16).dot(get_positional_encoding(6, 16))
tensor(7.4852)
>>> get_positional_encoding(5, 16).dot(get_positional_encoding(10, 16))
tensor(6.1370)

Why do we need cosine positional encoding in multi-head attention based transformer?

2 Answers2