According to AI and Memory Wall, serving GPT models "involves repeated matrix-vector multiplications", but I don't understand why.
Let's suppose I am the sole user of a LLM server, so we have a batch size of 1.
However, the linear layers in each attention block (i.e., q, k, v, and out) have an input shape [batch, n_seq, n_embd] and an output shape [batch, n_seq, n_embd], where both n_seq and n_embd are typically greater than 1000. The multiplication [1, n_seq, n_embd] x [n_embd, n_embd] -> [1, n_seq, n_embd] is surely GEMM instead of GEMV, isn't it?
Bonus question: If GPT inference indeed involves GEMM, then why is it still memory-bound instead of compute-bound?