Why are biases (typically) not used in attention mechanism?

Question

Watching this video implementing attention in a transformer. He set query, key, and value biases to False and said "Typically, people don't use biases for these".

Even in official PyTorch code the default bias is False:

add_bias_kv: If specified, adds bias to the key and value sequences at dim=0. Default: False.

What is the reason behind that?

Marc Dumon · Accepted Answer · 2023-04-30T23:03:05.583

For certain types of layers, such as transformers and convolutional layers, including a bias term is unnecessary and adds unnecessary overhead to the model.

The reason for this is that these layers are typically followed by a normalization layer, such as Batch Normalization or Layer Normalization. These normalization layers center the data at mean=0 (and std=1), effectively removing any bias.

Therefore, it is common practice to omit the bias term in transformers and convolutional layers that are preceded by a normalization layer.

Why are biases (typically) not used in attention mechanism?

1 Answers1

Linked

Related