Does a decoder in transformer model generate output embeddings like the following?

Question

Encoder:

Input: [A, B, C, D] (word embeddings)
Output: [C1, C2, C3, C4] (contextual representations)
The encoder processes the input sequence [A, B, C, D] and generates contextual representations [C1, C2, C3, C4]. The specific calculations involved in the encoder, such as self-attention and feed-forward layers, are not shown in this example.

Decoder:

Input: [C1, C2, C3, C4] (contextual representations)
Output: [A', B', C', D'] (word embeddings)
The decoder takes the contextual representations [C1, C2, C3, C4] as input and generates word embeddings [A', B', C', D'] for the output sequence. The decoder generates one word embedding at a time, conditioned on the previously generated embeddings and the contextual representations.

Let's illustrate the decoding process step by step:

Step 1:
Input: [C1] (contextual representation of the <start> token)
Output: [A'] (word embedding for the first output word)
Step 2:
Input: [CA', C2] (contextual representations and word embedding generated so far)
Output: [B'] (word embedding for the second output word)
Step 3:
Input: [A', B',C3] (contextual representations and word embeddings generated so far)
Output: [C'] (word embedding for the third output word)
Step 4:
Input: [A', B', C',C4] (contextual representations and word embeddings generated so far)
Output: [D'] (word embedding for the fourth output word)

Nicola Fanelli · Answer 1 · 2023-12-05T11:51:59.780

The encoder-decoder architecture is typically employed to work with two different sequences: the input sequence and the output sequence. The presence of both depends on the task; for machine translation, the transformer encoder processes the text in the source language, and the decoder generates the corresponding text in the destination language. A generative model, such as GPT, could also consist of only the decoder component.

I will focus on encoder-decoder architectures.

In the original implementation, once the encoder obtains the contextual representations [C1, C2, C3, C4], these are considered fixed inputs for the decoder. In the simplest case, the decoder should produce a sequence of indices corresponding to words in the vocabulary as its output. Therefore, the target sequence should be a sequence of indices/words, not word embeddings.

The decoder obtains two inputs at each step: the encoder output (fixed) and the partially generated sequence. I will use E(x) to refer to the word embedding of the word x. The step-by-step generation (during inference) will be:

Step 1:

Inputs:
[C1, C2, C3, C4] (encoder output)
[E(<start>)] (partial decoder output)
Outputs:
["the"]

Step 2:

Inputs:
[C1, C2, C3, C4] (encoder output)
[E(<start>), E(the)] (partial decoder output)
Outputs:
[*, "cat"]

Step 3:

Inputs:
[C1, C2, C3, C4] (encoder output)
[E(<start>), E(the), E(cat)] (partial decoder output)
Outputs:
[, , "eats"]

Step 4:

Inputs:
[C1, C2, C3, C4] (encoder output)
[E(<start>), E(the), E(cat), E(eats)] (partial decoder output)
Outputs:
[, , *, "<end>"]

At each step, the decoder can perform masked self-attention (to attend to the previous tokens of its partial output) and cross-attention over the entire encoder output sequence.

The output of a single step isn't actually a word; instead, it is a probability distribution over the word vocabulary. This distribution can be used to sample a word according to the probabilities, or we can simply choose the word that corresponds to the argmax of the distribution. Furthermore, more complex techniques such as beam search can be used to generate better sequences. Once the word is chosen, it is appended to the partial decoder output and used in the subsequent step. The asterisks correspond to the outputs of the decoder for previous timesteps, which are no longer interesting.

Does a decoder in transformer model generate output embeddings like the following?

1 Answers1