I'm fine-tuning two different CNNs for an image classification task:
- The first CNN uses a ResNet101 backbone, and the second uses a MobileNetV2 backbone. Both are pre-trained on ImageNet.
- I use the same classification head for both models: a dense layer with 1024 neurons, followed by another dense layer with 1024 neurons, then a Dropout(0.3) layer, and finally a softmax layer with 2 output neurons (for 2 classes).
- I use the same dataset(~500 images).
- No image augmentation (for experimentation purposes).
- I train both models for the same number of epochs (20 epochs).
- I use the same optimizer and learning rate (5e-6).
- I freeze the entire backbone and only train the head.
- The code is written in TensorFlow (tf.keras).
However, I noticed something odd: The ResNet101 model barely reaches 55% accuracy, while the MobileNetV2 model achieves around 90% accuracy under the exact same setup. Since ResNet101 is deeper and generally more powerful than MobileNetV2, I expected it to perform better, but the opposite happened.
My questions are:
1- Why does ResNet101 perform so poorly compared to MobileNetV2 in this setup?
2- Are there specific considerations when fine-tuning deeper networks like ResNet101 compared to lighter models like MobileNetV2?