ViTs such as RT-DETR require fixed input sizes because of rigid patch tokenization and their associated positional embeddings. If these embeddings are learned for a fixed grid size like 10×10 patches from your 640×640 images, the model has learned to attend to positions assuming that particular grid structure. When you feed an image of a totally different size, the learned positional encodings no longer align with the image layout. The transformer then misinterprets the spatial relationships, leading to erroneous detections. Finally models like RT-DETR use transformers with fixed-dimensional object queries where the decoder expects a specific number of encoded tokens which depends on the input size during training. In short, transformers expect a fixed-length sequence or at least positional embeddings that match the input token count.
By contrast, convolutional layers slide filters across an image which naturally accommodates varying input sizes. This spatial invariance means that a CNN model trained on 640×640 images can generalize to different sizes as long as the learned filters such as edges and textures remain valid at different scales. Additionally, CNNs often use adaptive pooling layers that further enable the network to handle different input resolutions gracefully.
To enable ViTs to handle varying resolutions use relative instead of absolute positional embedding like Swin Transformer or hierarchical ViTs like HVT that mimic CNN feature pyramids, which scale better with size compared to simple interpolation or rescaling of positional embeddings.
Challenges in adapting Transformer from language to vision arise from differences between the two domains, such as large variations in the scale of visual entities and the high resolution of pixels in images compared to words in text. To address these differences, we propose a hierarchical Transformer whose representation is computed with shifted windows. The shifted windowing scheme brings greater efficiency by limiting self-attention computation to non-overlapping local windows while also allowing for cross-window connection. This hierarchical architecture has the flexibility to model at various scales and has linear computational complexity with respect to image size. These qualities of Swin Transformer make it compatible with a broad range of vision tasks, including image classification (87.3 top-1 accuracy on ImageNet-1K) and dense prediction tasks such as object detection (58.7 box AP and 51.1 mask AP on COCO test-dev) and semantic segmentation (53.5 mIoU on ADE20K val).
However, the routine of the current ViT model is to maintain a full-length patch sequence during inference, which is redundant and lacks hierarchical representation. To this end, we propose a Hierarchical Visual Transformer (HVT) which progressively pools visual tokens to shrink the sequence length and hence reduces the computational cost, analogous to the feature maps downsampling in Convolutional Neural Networks (CNNs). It brings a great benefit that we can increase the model capacity by scaling dimensions of depth/width/resolution/patch size without introducing extra computational complexity due to the reduced sequence length... To demonstrate the improved scalability of our HVT, we conduct extensive experiments on the image classification task. With comparable FLOPs, our HVT outperforms the competitive baselines on ImageNet and CIFAR-100 datasets.