Why does serving GPT models typically involve GEMV instead of GEMM?

Asked Sep 17 '24 at 17:10

Active Sep 17 '24 at 17:10

Viewed 92 times

According to AI and Memory Wall, serving GPT models "involves repeated matrix-vector multiplications", but I don't understand why.

Let's suppose I am the sole user of a LLM server, so we have a batch size of 1. However, the linear layers in each attention block (i.e., q, k, v, and out) have an input shape [batch, n_seq, n_embd] and an output shape [batch, n_seq, n_embd], where both n_seq and n_embd are typically greater than 1000. The multiplication [1, n_seq, n_embd] x [n_embd, n_embd] -> [1, n_seq, n_embd] is surely GEMM instead of GEMV, isn't it?

Bonus question: If GPT inference indeed involves GEMM, then why is it still memory-bound instead of compute-bound?

asked Sep 17 '24 at 17:10

nalzok

Why does serving GPT models typically involve GEMV instead of GEMM?

0 Answers0