11

enter image description here

They only reference in the paper that the position embeddings are learned, which is different from what was done in ELMo.

ELMo paper - https://arxiv.org/pdf/1802.05365.pdf

BERT paper - https://arxiv.org/pdf/1810.04805.pdf

nbro
  • 42,615
  • 12
  • 119
  • 217
Skinish
  • 163
  • 1
  • 1
  • 9

2 Answers2

4

Sentences (for those tasks such as NLI which take two sentences as input) are differentiated in two ways in BERT:

  • First, a [SEP] token is put between them
  • Second, a learned embedding $E_A$ is added to every token of the first sentence, and another learned vector $E_B$ to every token of the second one

That is, there are just two possible "segment embeddings": $E_A$ and $E_B$.

Positional embeddings are learned vectors for every possible position between 0 and 512-1. Transformers don't have a sequential nature as recurrent neural networks, so some information about the order of the input is needed; if you disregard this, your output will be permutation-invariant.

3

These embeddings are nothing more than token embeddings.

You just randomly initialize them, then use gradient descent to train them, just like what you do with token embeddings.

soloice
  • 531
  • 4
  • 7