Do Vision Transformers handle arbitrary sequence lengths the same way as normal Transformers?

Question

Does ViT do handle arbitrary sequence lengths using masking the same way the normal Transformer does?

The ViT paper doesn't mention anything about it, so I assume it uses masking like the normal Transformer.

score 6 · Accepted Answer · answered Nov 04 '21 at 06:49

Yes, they can handle sequences with arbitrary length sequence, but with some remarks.

In the paper Training data-efficient image transformers & distillation through attention authors train models in the resolution 224x224 (1 + 14x14 tokens) and then finetune to the 384x384 (1 + 28x28 tokens).

Weights to produce queries, keys, values, as well as feedforward layers, operate only on a single token and are agnostic to the sequence length.

However, the size of the sequence is required in the positional embeddings, where one has a specific weight for each location of the token.

In order to make this construction work for inputs of other sizes, one needs to transform this positional embedding in a certain way. Bicubic interpolation of positional embeddings, used in DeiT, works pretty well. One can use simpler Bilinear or Nearest interpolation - but it seems like this harms accuracy.

In my own experience, when I took the DeiT-tiny with base accuracy of 72.2% after several epochs of finetuning gave accuracy of 77.0%.

Do Vision Transformers handle arbitrary sequence lengths the same way as normal Transformers?

1 Answers1