Why are LLMs able to reproduce bodies of known text exactly?

Question

Mathematically, I wouldn't expect LLMs to be able to reproduce source texts exactly unless the source text was the probable outcome given some prompt. However, I have now tested HuggingFaceH4/zephyr-7b-beta, TheBloke/Llama-2-7B-Chat-GGUF, and ChatGPT and they are all able to accurately reproduce Moby Dick.

HuggingFaceH4/zephyr-7b-beta

Prompt:

Print the first page of Moby Dick.

Response:

I'm unable to print documents directly, but I can provide you with the text of the first page of moby dick by herman melville:
call me ishmael. some years ago—never mind how long precisely—having little or no money in my purse, and nothing particular to interest me on shore, I thought I would sail about a little and see the watery part of the world. It is a way I have of driving off the spleen, and regulating the circulation.

...SNIP...

TheBloke/Llama-2-7B-Chat-GGUF

print the first page of moby dick

Certainly! Here is the first page of Moby Dick by Herman Melville:
Call me Ishmael. Some years ago—never mind how long precisely—having little or no money in my purse, and nothing particular to interest me on shore, I thought I would sail about a little and see the watery part of the world. It is a way I have of driving off the spleen, and regulating the circulation. Whenever I find myself growing grim about the mouth; whenever it is a damp, drizzly November in my soul; whenever I find myself involuntarily pausing before coffin-makers' shops, and 
...SNIP...

I would have expected them to generate a probabilistic answer but this appears to be exact. Is that simply because it was the likely output? Is there some alternate facility that is producing these results? I used h2ogpt for testing.

Update

It seems like it's probably just that it's mathematically likely because if you ask it other question it does just generate nonsense. Ex: What are the chapters in Moby Dick? I haven't read Moby Dick in English, but I don't recall a chapter 90, "The Whale's Eye Cups"

score 15 · Answer 1 · answered Jan 04 '24 at 20:04

Google "call me ishmael. some years ago—never mind how long precisely—having" and you'll see a fair number of results. LLM training sets are likely to have several copies of it as well, hence your observation. Note that some LLM training datasets do try to deduplicate data, e.g. https://huggingface.co/datasets/uonlp/CulturaX

score 14 · Answer 2 · answered Jan 05 '24 at 13:36

LLMs are information-theoretically just very lossy compression of their entire corpora, and are large enough for the "decompression" of parts to be recognizable and reasonably faithful. I don't think there's a really good understanding of which parts are recoverable to this extent. Naively one might think it's only very common material that appears repeatedly in the corpus, but they've also been found to reproduce exact or near-exact text for single-instance personal information. I'm not familiar with it but I suspect this is an exciting research area.

"The probable outcome given some prompt" is one way of looking at what is computationally being done with the model, but it's important to realize that this assumes a probability model, which was derived from the training corpus. It might help as an exercise to imagine doing the model with a naive Markov chain instead, where it's mentally tractable to see how this kind of text can be encoded reproducibly in the probability model.

NSFW · Answer 3 · 2024-01-10T20:29:43.573

When we create and train neural networks, the goal is to get them to model a general representation of the input data, so that they will produce the desired output for inputs that are generally similar to inputs that were in the training data.

But if your network has too many parameters, it doesn't need to generalize during the training stage. It might instead use those parameters to instead store representations of the input data and generate the desired output precisely - basically storing input/output pairs for everything input/output pair that's present in the training data. That won't necessarily happen, but it can happen. If you give it enough parameters, and train it long enough, it probably will happen.

And then you feed it something that wasn't in the training set, and it produces garbage because it didn't generalize at all, it just encoded the inputs and outputs that it was trained with.

This is one of the reasons that it's a good idea to train with one set of input/output pairs, and test with another set. If the network is generalizing, it will do a pretty good job with the test set even though those input/output pairs aren't part of the training set. But if the network is memorizing, it will do a pretty terrible job with the test set, because those input/output pairs were not memorized.

We generally* want the network to have the smallest number of parameters that will give acceptably good results. If you constrain the network to have a small number of parameters, it can't just encode the inputs and outputs verbatim - to perform well on the test set, it pretty much needs to encode a generalization of the training data, rather than the specifics of the training data.

In the context of the "predict the next word" approach used with LLMs, memorizing input/output pairs basically means that when prompted with a fragment of training data, it will regurgitate the remainder of that training data.

So, if an LLM reproduces lots of training data verbatim, that's probably a pretty good sign that the model has far too many parameters. A model with fewer parameters can probably perform just as well (probably better) by generalizing more and memorizing less.

* All generalizations are false, but some generalizations are useful.

Why are LLMs able to reproduce bodies of known text exactly?

HuggingFaceH4/zephyr-7b-beta

TheBloke/Llama-2-7B-Chat-GGUF

Update

3 Answers3