12

I'm trying to implement transformer model using this tutorial. In the decoder block of the Transformer model, a mask is passed to "pad and mask future tokens in the input received by the decoder". This mask is added to attention weights.

import tensorflow as tf

def create_look_ahead_mask(size): mask = 1 - tf.linalg.band_part(tf.ones((size, size)), -1, 0) return mask

Now my question is, how is doing this step (adding mask to the attention weights) equivalent to revealing the words to model one by one? I simply can't grasp the intuition of it's role. Most tutorials won't even mention this step like it's very obvious. Please help me understand. Thanks.

Uchiha Madara
  • 173
  • 1
  • 1
  • 8

3 Answers3

13

The Transformer model presented in this tutorial is an auto-regressive Transformer. Which means that prediction of next token only depends on it's previous tokens.

So in order to predict next token, you have to make sure that only previous token are attended. (If not, this would be a cheating because model already knows whats next).

So attention mask would be like this
[0, 1, 1, 1, 1]
[0, 0, 1, 1, 1]
[0, 0, 0, 1, 1]
[0, 0, 0, 0, 1]
[0, 0, 0, 0, 0]

For example: If you are translating English to Spanish
Input: How are you ?
Target: < start > Como estas ? < end >
Then decoder will predict something like this
< start > (it will be given to decoder as initial token)
< start > Como
< start > Como estas
< start > Como estas ?
< start > Como estas ? < end >

Now compare this step by step prediction sequences to attention mask given above, It would make sense now to you

Leo
  • 435
  • 4
  • 6
6

We give the target input into the transformer decoder while training the model. So it is easy for the model to "peek ahead" and learn what the next word would be. To ensure that this doesn't happen we apply an additive mask after the dot product between Query and Key. In the original paper "Attention is all you need", the triangular matrix had 0's in the lower triangle and -10e9 (You can see negative infinity used in recent examples) in the upper triangle. So when the mask is added to the attention score the attention scores in the upper triangle would be really low. When this matrix is passed through the softmax function, these really low values become close to 0, which essentially means not to attend to the words after timestep t. To put in matrix format,

[8.1, 0.04, 5.2, 4.2]
[0.5, 9.2, 2.33, 0.7]
[0.2, 0.4, 6.11, 1.0]
[3.1, 2.1. 2.19, 8.1]

Let the above matrix A the result of the dot product between query and key. The A[0][0] contains the attention score of the first-word query to the first word of the key, A[0][1] contains the attention score of the first word of the query to the second of the key, and so on. So as you can see the after adding the mask and performing softmax on A, the result would be,

[8.1, 0.0, 0.0, 0.0]
[0.5, 9.2, 0.0, 0.0]
[0.2, 0.4, 6.11, 0.0]
[3.1, 2.1. 2.19, 8.1]

This forces the transformer only to attend to words that are before it. You can check out the Transformer lecture available in CS224n for full detail.

6

the mask is needed to prevent the decoder from "peeking ahead" at ground truth during training, when using its Attention mechanism.

Encoder:

  • Both runtime or training:

    the encoder will always happen in a single iteration, because it will process all embeddings separately, but in parallel. This helps us save time.


Decoder:

  • runtime:

    Here the decoder will run in several non-parallel iterations, generating one "output" embedding at each iteration. Its output can then be used as input at the next iteration.

  • training:

    Here the decoder can do all of it in a single iteration, because it simply receives "ground truth" from us. Because we know these "truth" embeddings beforehand, they can be stored into a matrix as rows, so that they can be then submitted to decoder to be processed separately, but in parallel.

    As you can see during training, actual predictions by the decoder are not used to build up the target sequence (like LSTM would). Instead, what essentially is used here is a standard procedure called "teacher forcing".

    As others said, the mask is needed to prevent the decoder from "peeking ahead" at ground truth during training, when using its Attention mechanism.

As a reminder, in transformer, embeddings are never concatenated during input. Instead, each word flows through encoder and decoder separately, but simultaneously.

Also, notice that the mask contains negative infinities, not zeros. This is due to how the Softmax works in Attention.

We always first run the encoder, which always takes 1 iteration. The encoder then sits patiently on the side, as the decoder uses its values as needed.

Kari
  • 280
  • 3
  • 9