Why does the DETR model for object detection have 17.8M parameters in transformer blocks?

Question

The DETR paper specifies that there are 17.8M parameters in transformer blocks, given a hidden dimension of 256, 6 encoder blocks, and 6 decoder blocks.

But if I use the approximate formula 28 x N x d^2 as derived in this post for an encoder-decoder stack of N encoder or decoder blocks and hidden size d, I get only 11M.

Can anyone resolve the discrepancy?

score 0 · Answer 1 · answered Jan 12 '25 at 04:29

Indeed from the approximate formula in your reference we get about $28d_{model}^2$ parameters per encoder-decoder transformer block where $d_{model}=256$ for the DETR paper. Though we can ignore parameters of layer-normalization, bias, and all other places within the transformer MHA+FFN blocks, but we cannot ignore parameter counts from input embeddings, position embeddings, output heads, and possibly proprietary modules of the DETR paper's transformer since they all approximately have parameters of non-negligible scale of $d_{model}^2$ depending on the specific embedding model adopted and they're all transformer components. The author of your referenced article also mentioned this.

I will not consider the input embedding layer with positional encoding and final output layer (linear + softmax) as Transformer components, focusing only on Encoder and Decoder blocks. I do so since these components are specific to the task and embedding approach, while both Encoder and Decoder stacks formed the basis of many other architectures later.

Why does the DETR model for object detection have 17.8M parameters in transformer blocks?

1 Answers1