Aren't context lengths for transformers an artificial restriction?

Question

Let's focus on the case of decoder-only transformers, where I am using algorithm 10 from "Formal Algorithms for Transformers" by Mary Phung and Marcus Hutter as a reference.

: https://i.sstatic.net/ZWC9o.png

Previously I thought that the maximum context length is very much built into the transformer, for example as the dimension of a layer of weights. After studying this algorithm I am surprised because it seems more like an artificial restriction! Because this is a topic of active research I would like to know if I am misunderstanding something.

The way I see it, if I had access to the weights of GPT-2 right now, I could almost execute it on any number of tokens I like right away (If I had sufficient memory to compute this). The MHA algorithm is just carried out over a larger sequenece. There are only two issues, which are points where the context window $l_{max}$ appears:

The positional encoding has only $l_{max}$ positions
During training the weights were never optimized to attend over more than $l_{max}$ tokens.

But these issues seem rather easy to resolve:

Use some positional encoding which has infinitely many positions. The first encoding vectors are nicely spread around while the later ones are closer to eachother, due to the nature of fitting an infinite sequence of vectors into more or less a unit ball/sphere. But this is not an issue: It is natural for the positional encoding to become more vague as the token is further and further in the past.
Train 50% on context lengths around $l_{max}$, 25% on context lengths around $2 l_{max}$, 12.5% on context lengths around $4 l_{max}$ and so on...

I can imagine the following issues appearing:

A) Memory becomes larger than what is available on a single "unit" (GPU?) so you have to start moving data back and forth to execute your transformer, which is just terribly inefficient during training and also inference, so it is really pointless to train on such large context windows.

B) Perhaps the transformer just doesn't learn well with this procedure for some reason.

These issues are still rather "soft" issues though. As far as I can tell, I could use the architecture of GPT-2 (modified positional encoding) to create 1000000 context window LLMs, in theory. So, am I missing something?

Thank you!

Alexander Wan · Accepted Answer · 2023-10-04T18:32:33.327

Yes, you have the right idea. There's been a lot of work recently regarding extending the context-length of existing models, mostly looking at the Llama family of models.

You should check out this post/paper on LongChat by the team behind the Vicuna models (LMSYS). They extend the pretrained Llama models (with 2048-token contexts) to be able to handle 16k-token contexts.

You may seem some parallels between the techniques they discuss and what you mentioned in your post.

(1) They use Rotary Embeddings, which have an infinite number of positions. However, instead of extending past the original length, they instead interpolate between the existing values.

(2) They also finetune the model on long-context sequences (long conversational data).

The resulting performance on different context lengths: This plot from the LMSYS org shows the LongChat model (described above) versus accuracy on a retrieval task, compared to some other models.

This technique is elaborated on in this paper: "Extending Context Window of Large Language Models via Position Interpolation"

Position Interpolation can easily enable very long context windows (e.g. 32768), requiring only fine-tuning for 1000 steps on the Pile (Gao et al., 2020) to achieve a good quality. The cost of fine-tuning is negligible compared to the pre-training costs. This confirms our hypothesis that it is relatively easy for the models to adapt to interpolated position encodings.

This work suggests that your second suggestion may even be overkill -- you don't need to do too much additional tuning to allow models to use the larger context. They also discuss why one should interpolate between embeddings rather than extrapolate beyond the original context length: namely, that it can result in catastrophic explosions in attention scores when the position differences are too large.

Aren't context lengths for transformers an artificial restriction?

1 Answers1