Why does the BERT encoder have an intermediate layer between the attention and neural network layers with a bigger output?

Question

I am reading the BERT paper BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding.

As I look at the attention mechanism, I don't understand why in the BERT encoder we have an intermediate layer between the attention and neural network layers with a bigger output ($4*H$, where $H$ is the hidden size). Perhaps it is the layer normalization, but, by looking at the code, I'm not certain.

score 2 · Answer 1 · edited Dec 15 '21 at 09:25

The paper Undivided Attention: Are Intermediate Layers Necessary for BERT? should answer it.

In the abstract, they write

All BERT-based architectures have a self-attention block followed by a block of intermediate layers as the basic building component. However, a strong justification for the inclusion of these intermediate layers remains missing in the literature.

In the conclusion, they write

In this work we proposed a modification to the BERT architecture focusing on reducing the number of intermediate layers in the network. With the modified BERTBASE network we show that the network complexity can be significantly decreased while preserving accuracy on fine-tuning tasks.

Why does the BERT encoder have an intermediate layer between the attention and neural network layers with a bigger output?

1 Answers1