5

In PyTorch, transformer (BERT) models have an intermediate dense layer in between attention and output layers whereas the BERT and Transformer papers just mention the attention connected directly to output fully connected layer for the encoder just after adding the residual connection.

Why is there an intermediate layer within an encoder block?

For example,

encoder.layer.11.attention.self.query.weight
encoder.layer.11.attention.self.query.bias
encoder.layer.11.attention.self.key.weight
encoder.layer.11.attention.self.key.bias
encoder.layer.11.attention.self.value.weight
encoder.layer.11.attention.self.value.bias
encoder.layer.11.attention.output.dense.weight
encoder.layer.11.attention.output.dense.bias
encoder.layer.11.attention.output.LayerNorm.weight
encoder.layer.11.attention.output.LayerNorm.bias
encoder.layer.11.intermediate.dense.weight
encoder.layer.11.intermediate.dense.bias

encoder.layer.11.output.dense.weight
encoder.layer.11.output.dense.bias
encoder.layer.11.output.LayerNorm.weight
encoder.layer.11.output.LayerNorm.bias

I am confused by this third (intermediate dense layer) in between attention output and encoder output dense layers

3 Answers3

5

Feedforward layer is an important part of the transformer architecture.

Transformer architecture, in addition to the self-attention layer, that aggregates information from the whole sequence and transforms each token due to the attention scores from the queries and values has a feedforward layer, which is mostly a 2-layer MLP, that processes each token separately: $$ y = W_2 \sigma(W_1 x + b_1) + b_2 $$

Where $W_1, W_2$ are the weights, and $b_1, b_2$ - biases, $\sigma$ - is the nonlinearity ReLU, GeLU, e.t.c.

enter image description here

This is kind of a pointwise nonlinear transformation of the sequence.

I suspect, that intermediate here corresponds to $W_1, b_1$ and the output is about $W_2, b_2$.

0

documentaion of Bert on Hugging Face

you will find that

intermediate_size (int, optional, defaults to 3072) — Dimensionality of the “intermediate” (often named feed-forward) layer in the Transformer encoder.

so intermediate layer is feed-forward layer.

Ya Wen
  • 1
0

So I came across this myself when trying to explore why models require intermediate layers in general. BERT is fundamentally different than the previous transformer architectures. The intermediate layer acts as a way to ensure the different multi attention heads are able to utilize pre-trained information. So for example if you examine the block directly you will notice they upscale and descale in my experience something like 768 dim to 3072 back down to 768, I am approximating dimensions from memory. This does not correspond to the KQV in the attention layer weights. Now it is still a point of research concern why need the scale and descale of the embedded information to allow for fine tuning of the models. When we remove the intermediate layers we have considerable drop in transfer learning performance and therefore pre-training experience is lost. You will also notice more recent models have this intermediate layer as opposed to the original transformer paper.