Should I be layer freezing when fine-tuning an LLM?

Question

I've had it in my head that generally speaking, it's better to freeze layers when fine-tuning an LLM, as per this quote from HuggingFace's article:

PEFT approaches only fine-tune a small number of (extra) model parameters while freezing most parameters of the pretrained LLMs, thereby greatly decreasing the computational and storage costs. This also overcomes the issues of catastrophic forgetting, a behaviour observed during the full finetuning of LLMs. PEFT approaches have also shown to be better than fine-tuning in the low-data regimes and generalize better to out-of-domain scenarios. It can be applied to various modalities, e.g., image classification and stable diffusion dreambooth.

I think what I might be confused by is what is meant by the "(extra)" part. It led me to try fine-tuning a BERT model in PyTorch by freezing all parameters except for the final feed-forward of the transformer responsible for sequence classification:

for param in model.parameters():
    param.requires_grad = False
for param in model.classifier.parameters():
    param.requires_grad = True

However, this caused my model to get significantly worse evaluation metrics on my test set than before I did this. This lead me to the following conclusions:

My dataset of ~100K datapoints is not of a "low-data regime" and therefore doesn't benefit from PEFT? But doesn't it say this generalizes better to "out-of-domain scenarios"? How do I know the particular seq classification I'm doing with BERT is out-of-domain? Because it isn't specifically a next-sequence prediction task?
Is this the cost of misinterpreting the "(extra)" model parameters part? I'm fine-tuning a small number of extant model parameters here, not extra.

I'm just confused here. The quote I've showed here makes me believe my PEFT model should've outperformed a regular fine-tuning.

score 0 · Answer 1 · answered May 18 '23 at 21:04

I have seen researchers use both approaches. Typically freezing is useful if you have very few examples (<100 in my cases), Otherwise

Do you have enough resources to Fine-Tune with an unfrozen model?
Is performance better with an unfrozen or frozen model?

Many people have turned to LoRA to avoid this issue. https://github.com/microsoft/LoRA

Should I be layer freezing when fine-tuning an LLM?

1 Answers1