Encoder:
Input: [A, B, C, D] (word embeddings)
Output: [C1, C2, C3, C4] (contextual representations)
The encoder processes the input sequence [A, B, C, D] and generates contextual representations [C1, C2, C3, C4]. The specific calculations involved in the encoder, such as self-attention and feed-forward layers, are not shown in this example.
Decoder:
Input: [C1, C2, C3, C4] (contextual representations)
Output: [A', B', C', D'] (word embeddings)
The decoder takes the contextual representations [C1, C2, C3, C4] as input and generates word embeddings [A', B', C', D'] for the output sequence. The decoder generates one word embedding at a time, conditioned on the previously generated embeddings and the contextual representations.
Let's illustrate the decoding process step by step:
Step 1:
Input: [C1] (contextual representation of the <start> token)
Output: [A'] (word embedding for the first output word)
Step 2:
Input: [CA', C2] (contextual representations and word embedding generated so far)
Output: [B'] (word embedding for the second output word)
Step 3:
Input: [A', B',C3] (contextual representations and word embeddings generated so far)
Output: [C'] (word embedding for the third output word)
Step 4:
Input: [A', B', C',C4] (contextual representations and word embeddings generated so far)
Output: [D'] (word embedding for the fourth output word)