10

Watching this video implementing attention in a transformer. He set query, key, and value biases to False and said "Typically, people don't use biases for these".

Even in official PyTorch code the default bias is False:

add_bias_kv: If specified, adds bias to the key and value sequences at dim=0. Default: False.

What is the reason behind that?

Robin van Hoorn
  • 2,780
  • 2
  • 12
  • 33
Peyman
  • 624
  • 1
  • 6
  • 14

1 Answers1

12

For certain types of layers, such as transformers and convolutional layers, including a bias term is unnecessary and adds unnecessary overhead to the model.

The reason for this is that these layers are typically followed by a normalization layer, such as Batch Normalization or Layer Normalization. These normalization layers center the data at mean=0 (and std=1), effectively removing any bias.

Therefore, it is common practice to omit the bias term in transformers and convolutional layers that are preceded by a normalization layer.

Marc Dumon
  • 144
  • 1
  • 4