0

LLMs' maximum amount of output tokens is usually capped to 4096. Is there a reason for this?

1 Answers1

0

This number is a little old and not some magic, both the latest chatgpt-4o-latest and gpt-4o-mini models have higher output token number as $16,384$ according to OpenAI API.

In general since for self-attention mechanism each position of the output sequence must attend to all other positions, it requires $O(n^2)$ memory for sequence length $n$. This means $4096^2 = 16,777,216$ attention matrix elements for the said cap which already required a lot of memory and acted as some heuristic threshold for many old GPT models with their corresponding hardware resources. Also $4096$ tokens are close to human readability and comprehension limits and most real-world use cases don't need longer outputs whose quality may tend to degrade with very long generations

cinch
  • 11,000
  • 3
  • 8
  • 17