Why fine tuning does not work as well as RAG?

Question

I cannot find a definite answer to this question. Suppose I want to build a QA (question answering) system on a set of personal documents. It looks that RAG (retrieval augmented generation) is the way to go for this task, but I do not understand why some flavour of fine tuning would not work. The LLMs are able to answer factual questions about the models they are trained on (as we see each time we make a query to a LLM), so why should not they be able to answer questions about data they are fine tuned on as well? Is there a good reference or discussion of this issue and some code showcasing this behavior ?

Franck Dernoncourt · Answer 1 · 2024-06-06T18:30:31.110

From LIMA: Less Is More for Alignment (2023) by Chunting Zhou et al.:

These results strongly suggest that almost all knowledge in large language models is learned during pretraining, and only limited instruction tuning data is necessary to teach models to produce high quality output.

From Does Fine-Tuning LLMs on New Knowledge Encourage Hallucinations? (2024) by Zorik Gekhman, Gal Yona, Roee Aharoni, Matan Eyal, Amir Feder, Roi Reichart, Jonathan Herzig:

When large language models are aligned via supervised fine-tuning, they may encounter new factual information that was not acquired through pre-training. It is often conjectured that this can teach the model the behavior of hallucinating factually incorrect responses, as the model is trained to generate facts that are not grounded in its pre-existing knowledge. In this work, we study the impact of such exposure to new knowledge on the capability of the fine-tuned model to utilize its pre-existing knowledge. Our results highlight the risk in introducing new factual knowledge through fine-tuning, and support the view that large language models mostly acquire factual knowledge through pre-training, whereas fine-tuning teaches them to use it more efficiently.

Why fine tuning does not work as well as RAG?

1 Answers1