Token compression in native sparse attention

Question

I have a question about the token compression in the native sparse attention in https://arxiv.org/pdf/2502.11089.

When we compute the attention of $q_t$ and $K^{\sim cmp }_t$, is $K^{\sim cmp }_t$ one of the $\varphi$ in formula (7) or the vector formed by all the $\varphi$?
Also, is the length of $K^{\sim cmp }_t$ in formula 7 changing with t?

score 2 · Accepted Answer · answered Mar 20 '25 at 04:48

Since your Native Sparse Attention (NSA) paper claims before formula (7) that:

By aggregating sequential blocks of keys or values into block-level representations, we obtain compressed keys and values that capture the information of the entire block... where $$ is the block length, $$ is the sliding stride between adjacent blocks, and $$ is a learnable MLP with intra-block position encoding to map keys in a block to a single compressed key. $\tilde K^{cmp}_t ∈ R^{_×⌊\frac{−}{}⌋}$ is tensor composed by compression keys... Compressed representations capture coarser-grained higher-level semantic information and reduce computational burden of attention

Therefore $\tilde K^{cmp}_t$ is a tensor of shape $(d_k, ⌊\frac{−}{}⌋)$ formed by concatenating all the $$-compressed key vectors along the second dimension (columns). And since the number of compressed keys is given by $⌊\frac{−}{}⌋$, so as the sequence length $t$ increases, the number of blocks and hence the number of columns in $\tilde K^{cmp}_t$ also increases.

Token compression in native sparse attention

1 Answers1