Why do transformers have a fixed input length?

Question

From what I understand, Transformer Encoders and Decoders use a fixed number of tokens as input, e.g., 512 tokens. In NLP for instance, different text sentences have a different number of tokens, and the way to deal with that is to truncate the longer sentences and pad the shorter ones. As an additional input, a padding mask must be supplied to the Transformer so that its attention is only focused on the relevant tokens.

My question is: Is there something in the architecture that forces the transformer to have a fixed number of tokens as input? (and not adopt dynamically to the actual input length like RNNs for instance?)

For comparison, I think of fully-convolutional networks or RNNs with variable input lengths. They are agnostic to the actual input dimension because they perform pointwise operations on the different patches. When applying an RNN model to an n-tokens sentence, you compute the same block n times, and when computing it on a k-tokens sentence you will apply it k times. So this architecture does not require padding or truncating (at least not in theory, I do not refer here to implementation considerations). In transformers: embedding the tokens, computing attention, and feed-forward can be performed on different lengths of sequences since the weights are applied per token, right? So why do we still truncate and pad to a fixed size? Or perhaps it is feasible but not implemented in practice for other reasons?

I must be missing something...

I'll ask it differently to make my question more clear: Say I have an already-trained transformer model, trained on 512 fixed-sized inputs (truncated and padded). At inference time, if I would like to process a single, shorter sentence. Do I have to pad it or not?

Thanks

waxalas · Answer 1 · 2022-11-02T23:19:47.807

Edits to reflect edits in question: If you train your transformer on length = n then yes, you need to pad inputs to length = n. This is not a requirement in the mathematical architecture, it's a requirement in the implementation.

There seem to be two separate ideas in your question:

Why do transformers have a fixed input length?
Why do transformers have input length limitations?

I am not sure which one you are asking, so I will answer both.

1) Saying transformers have a fixed input length is misleading.

Transformers accept variable length inputs just like RNNs. You can think of padding/truncating as an extra embedding step if you want.

We don't need padding in RNNs because they process inputs (sentences) one element (token/word) at a time.

Transformers process inputs all at once. If you are passing in several sentences, you have to do something to regularize sentence length, hence padding and truncating.

2) Transformers often limit input length to 512 or 1024 because of performance issues.

If you are wondering why we don't let our transformers accept inputs as long as possible, the answer is that there are computational and performance limitations. The algorithm complexity is quadratic.

This is where the max_length parameter of a transformer comes in. If your input has length 1,000, the transformer will throw an error because it can only handle inputs up to 512 in length.

score 1 · Answer 2 · answered Oct 26 '22 at 16:41

To add something to pip.pip answer (thumbs up cause is totally on point), consider that transformers can't be fully convolutional, since as the name suggest, a fully convolutional model perform only convolutions, while transformers include dense layers, which expect a fixed input dimension.

Despite being possible to overcome the limit of fixed dimensionality imposed by the dense layers, for example by using pyramid pooling, that would only add complexity to the training regime and there's no guarantee that the performance will increase.

lastly, from a linguistic perspective, 512 and 1024 tokens are already quite a lot to learn most long dependencies (which let's recall was the main issue why transformers were introduced in place of RNNs). So the game of making transformers input size independent is not really worth the effort.

Why do transformers have a fixed input length?

2 Answers2

Linked