5

Does ViT do handle arbitrary sequence lengths using masking the same way the normal Transformer does?

The ViT paper doesn't mention anything about it, so I assume it uses masking like the normal Transformer.

nbro
  • 42,615
  • 12
  • 119
  • 217
Dean R
  • 53
  • 4

1 Answers1

6

Yes, they can handle sequences with arbitrary length sequence, but with some remarks.

In the paper Training data-efficient image transformers & distillation through attention authors train models in the resolution 224x224 (1 + 14x14 tokens) and then finetune to the 384x384 (1 + 28x28 tokens).

Weights to produce queries, keys, values, as well as feedforward layers, operate only on a single token and are agnostic to the sequence length.

However, the size of the sequence is required in the positional embeddings, where one has a specific weight for each location of the token.

In order to make this construction work for inputs of other sizes, one needs to transform this positional embedding in a certain way. Bicubic interpolation of positional embeddings, used in DeiT, works pretty well. One can use simpler Bilinear or Nearest interpolation - but it seems like this harms accuracy.

In my own experience, when I took the DeiT-tiny with base accuracy of 72.2% after several epochs of finetuning gave accuracy of 77.0%.