What kind of word embedding is used in the original transformer?

Question

I am currently trying to understand transformers.

To start, I read Attention Is All You Need and also this tutorial.

What makes me wonder is the word embedding used in the model. Is word2vec or GloVe being used? Are the word embeddings trained from scratch?

In the tutorial linked above, the transformer is implemented from scratch and nn.Embedding from pytorch is used for the embeddings. I looked up this function and didn't understand it well, but I tend to think that the embeddings are trained from scratch, right?

score 18 · Accepted Answer · edited Feb 06 '21 at 15:14

I have found a good answer in this blog post The Transformer: Attention Is All You Need:

we learn a “word embedding” which is a smaller real-valued vector representation of the word that carries some information about the word. We can do this using nn.Embedding in Pytorch, or, more generally speaking, by multiplying our one-hot vector with a learned weight matrix W.

There are two options for dealing with the Pytorch nn.Embedding weight matrix. One option is to initialize it with pre-trained embeddings and keep it fixed, in which case it’s really just a lookup table. Another option is to initialize it randomly, or with pre-trained embeddings, but keep it trainable. In that case the word representations will get refined and modified throughout training because the weight matrix will get refined and modified throughout training.

The Transformer uses a random initialization of the weight matrix and refines these weights during training – i.e. it learns its own word embeddings.

Brian O'Donnell · Answer 2 · 2021-02-05T21:06:54.713

5

No, neither Word2Vec nor GloVe is used as Transformers are a newer class of algorithms. Word2Vec and GloVe are based on static word embeddings while Transformers are based on dynamic word embeddings.

The embeddings are trained from scratch.

edited Feb 05 '21 at 21:06

answered Feb 05 '21 at 20:25

Brian O'Donnell

1,997
9
23

Raul Alvarez · Answer 3 · 2021-05-01T12:43:56.187

0

As "initial" word embeddings (those without any positional or context information for each word or sub word) are used from the very beginning It seems to me that someone has to provide a trained embedding for each word at the very beginning.

edited May 01 '21 at 12:43

answered May 01 '21 at 07:06

Raul Alvarez

132
1
12

What kind of word embedding is used in the original transformer?

3 Answers3

Linked