For a transformer decoder, how exactly are K, Q, and V for each decoding step?

Question

Assume my input prompt is "today is a" (good day).

At t= 0 (generation step 0): K, Q, and V are the projections of the sequence ("today is a") Then say the next token generated is "good".

At t=1 (generation step 1): Which one is true:

K, Q, and V are the projections of the sequence ("today is a good")
K, Q, are the projections of the sequence ("today is a"), and V is the projection of the sequence ("good")?

score 2 · Accepted Answer · answered May 09 '23 at 12:52

(This type of) autoregressive LLM always works by predicting one next token based on a series of previous tokens. First you run the model with input "today is a" and the prediction is "good". Then you run the model with input "today is a good" and the prediction is "day", and so on. Each token is predicted by running the entire model from start to finish on its previous input.

For a transformer decoder, how exactly are K, Q, and V for each decoding step?

1 Answers1