Mathematically, I wouldn't expect LLMs to be able to reproduce source texts exactly unless the source text was the probable outcome given some prompt. However, I have now tested HuggingFaceH4/zephyr-7b-beta, TheBloke/Llama-2-7B-Chat-GGUF, and ChatGPT and they are all able to accurately reproduce Moby Dick.
HuggingFaceH4/zephyr-7b-beta
Prompt:
Print the first page of Moby Dick.
Response:
I'm unable to print documents directly, but I can provide you with the text of the first page of moby dick by herman melville:
call me ishmael. some years ago—never mind how long precisely—having little or no money in my purse, and nothing particular to interest me on shore, I thought I would sail about a little and see the watery part of the world. It is a way I have of driving off the spleen, and regulating the circulation.
...SNIP...
TheBloke/Llama-2-7B-Chat-GGUF
print the first page of moby dick
Certainly! Here is the first page of Moby Dick by Herman Melville:
Call me Ishmael. Some years ago—never mind how long precisely—having little or no money in my purse, and nothing particular to interest me on shore, I thought I would sail about a little and see the watery part of the world. It is a way I have of driving off the spleen, and regulating the circulation. Whenever I find myself growing grim about the mouth; whenever it is a damp, drizzly November in my soul; whenever I find myself involuntarily pausing before coffin-makers' shops, and
...SNIP...
I would have expected them to generate a probabilistic answer but this appears to be exact. Is that simply because it was the likely output? Is there some alternate facility that is producing these results? I used h2ogpt for testing.
Update
It seems like it's probably just that it's mathematically likely because if you ask it other question it does just generate nonsense. Ex: What are the chapters in Moby Dick? I haven't read Moby Dick in English, but I don't recall a chapter 90, "The Whale's Eye Cups"