We give the target input into the transformer decoder while training the model. So it is easy for the model to "peek ahead" and learn what the next word would be. To ensure that this doesn't happen we apply an additive mask after the dot product between Query and Key. In the original paper "Attention is all you need", the triangular matrix had 0's in the lower triangle and -10e9 (You can see negative infinity used in recent examples) in the upper triangle. So when the mask is added to the attention score the attention scores in the upper triangle would be really low. When this matrix is passed through the softmax function, these really low values become close to 0, which essentially means not to attend to the words after timestep t. To put in matrix format,
[8.1, 0.04, 5.2, 4.2]
[0.5, 9.2, 2.33, 0.7]
[0.2, 0.4, 6.11, 1.0]
[3.1, 2.1. 2.19, 8.1]
Let the above matrix A the result of the dot product between query and key. The A[0][0] contains the attention score of the first-word query to the first word of the key, A[0][1] contains the attention score of the first word of the query to the second of the key, and so on. So as you can see the after adding the mask and performing softmax on A, the result would be,
[8.1, 0.0, 0.0, 0.0]
[0.5, 9.2, 0.0, 0.0]
[0.2, 0.4, 6.11, 0.0]
[3.1, 2.1. 2.19, 8.1]
This forces the transformer only to attend to words that are before it. You can check out the Transformer lecture available in CS224n for full detail.