Why the maximum amount of output tokens is usually 4096?

Question

LLMs' maximum amount of output tokens is usually capped to 4096. Is there a reason for this?

score 0 · Accepted Answer · answered Dec 02 '24 at 23:57

This number is a little old and not some magic, both the latest chatgpt-4o-latest and gpt-4o-mini models have higher output token number as $16,384$ according to OpenAI API.

In general since for self-attention mechanism each position of the output sequence must attend to all other positions, it requires $O(n^2)$ memory for sequence length $n$. This means $4096^2 = 16,777,216$ attention matrix elements for the said cap which already required a lot of memory and acted as some heuristic threshold for many old GPT models with their corresponding hardware resources. Also $4096$ tokens are close to human readability and comprehension limits and most real-world use cases don't need longer outputs whose quality may tend to degrade with very long generations

Why the maximum amount of output tokens is usually 4096?

1 Answers1