I've been trying to implement a YOLO-like model for object detection. I came up with the following approach:
- Images (B, 3, 224, 224) are fed to a possibly pretrained ResNet backbone
- Each cell of the resulting feature map of shape (B, 512, 7, 7) is mapped to confidence score (estimated probability that the cell includes some object), shift of the center of the bounding box within the cell and width and height of that bounding box.
- At inference, outputs are post-processed with Soft-NMS to remove duplicates caused by interpreting the confidence map distribution as independent wrt coordinates within the map.
I predict a single bounding box per cell and no classes as my dataset has one object per image at most and a single class. I tried to use CE loss for confidence and mix of MSE and SIoU for boxes.
However, despite the model finding the objects and their center nicely, it just refused to infer relevant width and height, always spitting out the mean bounding box almost independently from the actual object size.
To fix that, I tried adding anchor boxes, obtained by K-Means of the dataset, and predicting multiple boxes per cell, assigning ground truth box to anchor box with best IoU. After that, the model started working correctly even in cases where the required ground truth box was not similar to any of anchor boxes.
Why does that happen? I understand the idea of anchor boxes (that we are predicting corrections to some set of predefined anchor boxes instead of directly predicting the box at once), however I just don't see why it works better. It seems like we are predicting the same information, just in a more convoluted way!