4

During transfer learning in computer vision, I've seen that the layers of the base model are frozen if the images aren't too different from the model on which the base model is trained on.

However, on the NLP side, I see that the layers of the BERT model aren't ever frozen. What is the reason for this?

nbro
  • 42,615
  • 12
  • 119
  • 217
Bunny Rabbit
  • 141
  • 2

1 Answers1

0

Corrections and other answers are welcome, but here are a few thoughts:

There are several approaches in terms of which weights get frozen (and also other considerations, see for example Fig. 5 in "Galactica: A Large Language Model for Science").

Which of the approaches yields higher-quality results depends on the architecture (and hyperparameters) and dataset.

There can be rules of thumb, for example this old snapshot of a "Documentation" of Transformer architectures at Hugging Face said:

we are directly fine-tuning the whole model without taking any precaution. It actually works better this way for Transformers model

but this explanation apparently was removed from the new vesion of this page. Maybe it turned out that such rules of thumb aren't right in general.

Quality of results is also not the only thing being optimized. Some choices are made due to memory or compute considerations. For example, when freezing the first layers, their output features can be computed only once for all samples, saved, and used thereafter; moreover, computing the gradient of the loss with respect to the weights of the first frozen network block is not necessary.

root
  • 101
  • 2