1

I have been looking for the answer in other questions but no one tackled that. I want to ask you how is the padding mask considered in the formula of attention?

The attention formula taking into account a causal mask is: $Attention(Q, K, V) = softmax(\frac{QK^{T} + CausalMask}{\sqrt{d_{k}}})V$

But how do we add the padding mask? The aim of a padding mask is to mask the padding positions as they're used just to make the batching feasible. But I don't know how this mask is added in the Attention formula.

Does it make sense if we do element-wise multiplication of the Attention matrix with a tensor of ones of shape (batch size, sequence length, $d_{model}$) and whatever sentence $s$ in that batch and wherever position $p$ is a padding token then the tensor[s, p, :] is zeros?

Thank you in advance for your help!

Daviiid
  • 585
  • 5
  • 17

2 Answers2

2

Entries of an attention mask are typically either $0$ or $-\infty$.

So, adding such a mask gives either the original entry of $QK^T$ or $-\infty$.

The issue with entrywise multiplication with a binary matrix, is that $0$ values still contribute to softmax.

$$softmax(z)_i = \frac{e^{z_i}}{\sum_k e^{z_k}}$$

$e^0$ is $1$, so element-wise product before softmax is not really a mask. After softmax doesn't work either, since the output is no longer a probability distribution. To ensure the masked elements do not contribute at all, you need them to be $-\infty$, which mask addition does.

If you know that certain indices of the input of are padded, that is, they are all zero vectors, then the padding tokens are already ignored by matmul (since they are zero). The issue is the rows of $QK^T$ corresponding to the pad tokens are zero and not $-\infty$.

To ignore pad tokens, you would create a mask with $-\infty$ along columns corresponding to the location of padding, and add this to $QK^T$ before softmax. So it's just the same attention formula, different value of $M$.

Venna Banana
  • 406
  • 1
  • 3
2

For a given sequence x you calculate the attention scores using the formula:

\begin{equation} A = \frac{(xQ K^Tx^T)}{\sqrt{d_k}}, \end{equation} where $Q, K$ are the query and key matrices of the attention layer.

The result is a square matrix of size $T \times T$ where $T$ is the length of the sequence $x$. The entry $A_{i,j}$ gives the attention score between $x_{i}$ and $x_{j}$ (note $A_{i,j} \neq A_{j,i}$). So basically row $i$ gives you the attention scores for token $x_i$, which other tokens of the sequence it should attend to. However, you actually want to use these scores to perform a weighted average over the value encodings given by $xV$ ($V$ is the value matrix of the attention layer). But the scores can be arbitrary real numbers, and you want to have positive weights that sum to $1.$ That is why we apply a softmax layer to convert the scores into attention weights.

Now, if your sequence contains pad tokens you don't want $x_i$ to attend to them. So you want to "remove" the attention between $x_i$ and the pad tokens. You could set the attention weights directly to $0.$, but then the sum will not be $1.$ You actually want to mask the attention before applying the softmax function. Thus, you set the attention scores between $x_i$ and the pad tokens to a large negative number. Now applying the softmax will produce $0.$ attention weights. You can set the scores to -float("inf"), but I think setting them to $-1e9$ is more than enough.

For a concrete example see this github repo. Also here is an extensive blog post that I wrote about the Transformer, you might like it.

pi-tau
  • 995
  • 6
  • 12