LLMs' maximum amount of output tokens is usually capped to 4096. Is there a reason for this?
1 Answers
This number is a little old and not some magic, both the latest chatgpt-4o-latest and gpt-4o-mini models have higher output token number as $16,384$ according to OpenAI API.
In general since for self-attention mechanism each position of the output sequence must attend to all other positions, it requires $O(n^2)$ memory for sequence length $n$. This means $4096^2 = 16,777,216$ attention matrix elements for the said cap which already required a lot of memory and acted as some heuristic threshold for many old GPT models with their corresponding hardware resources. Also $4096$ tokens are close to human readability and comprehension limits and most real-world use cases don't need longer outputs whose quality may tend to degrade with very long generations
- 11,000
- 3
- 8
- 17