1

I have a question about the token compression in the native sparse attention in https://arxiv.org/pdf/2502.11089.

  1. When we compute the attention of $q_t$ and $K^{\sim cmp }_t$, is $K^{\sim cmp }_t$ one of the $\varphi$ in formula (7) or the vector formed by all the $\varphi$?
  2. Also, is the length of $K^{\sim cmp }_t$ in formula 7 changing with t?
HIH
  • 121
  • 2

1 Answers1

2

Since your Native Sparse Attention (NSA) paper claims before formula (7) that:

By aggregating sequential blocks of keys or values into block-level representations, we obtain compressed keys and values that capture the information of the entire block... where $$ is the block length, $$ is the sliding stride between adjacent blocks, and $$ is a learnable MLP with intra-block position encoding to map keys in a block to a single compressed key. $\tilde K^{cmp}_t ∈ R^{_×⌊\frac{−}{}⌋}$ is tensor composed by compression keys... Compressed representations capture coarser-grained higher-level semantic information and reduce computational burden of attention

Therefore $\tilde K^{cmp}_t$ is a tensor of shape $(d_k, ⌊\frac{−}{}⌋)$ formed by concatenating all the $$-compressed key vectors along the second dimension (columns). And since the number of compressed keys is given by $⌊\frac{−}{}⌋$, so as the sequence length $t$ increases, the number of blocks and hence the number of columns in $\tilde K^{cmp}_t$ also increases.

cinch
  • 11,000
  • 3
  • 8
  • 17