9

One of the main concerns of using ChatGPT answers on Stack Exchange is that it may copy verbatim or almost verbatim some text from its training set, which may infringe the source text's license. This makes me wonder how much of the ChatGPT output is copied from its training set (vs. being abstractively generated).

Franck Dernoncourt
  • 3,473
  • 2
  • 21
  • 39
  • 1
    Is there any evidence for this "it may copy verbatim or almost verbatim some text from its training set"? This may be true, but I am wondering if there is any evidence. I never tried it, so I don't know. – nbro Dec 21 '22 at 12:55
  • 1
    I think this is incorrect. The real problem is that it's generating text based on the probability that a certain token appears in a given position given all of the other tokens so far... which means that the result might be interesting, but isn't necessarily anything like accurate. – David Hoelzer Dec 21 '22 at 13:23
  • @nbro I don't know any. I haven't tried much. – Franck Dernoncourt Dec 21 '22 at 14:04
  • @DavidHoelzer that's indeed another concern. https://meta.stackexchange.com/q/384410/178179 ; https://meta.stackexchange.com/q/384652/178179 – Franck Dernoncourt Dec 21 '22 at 14:04
  • This is impossible to answer since the training set is not public. – Dr. Snoopy Dec 22 '22 at 10:25
  • @Dr.Snoopy isn't most of it in common crawl or queryable via Google? – Franck Dernoncourt Dec 22 '22 at 10:37
  • No idea, since there is no information, that would be making an assumption. – Dr. Snoopy Dec 22 '22 at 10:38

1 Answers1

8

https://arxiv.org/abs/2505.12546: "Extracting memorized pieces of (copyrighted) books from open-weight language models"

This is evidence that the LLMs have memorized the extracted text; this memorized content is copied inside the model parameters. But the results are complicated: the extent of memorization varies both by model and by book. With our specific experiments, we find that the largest LLMs don't memorize most books -- either in whole or in part. However, we also find that Llama 3.1 70B memorizes some books, like Harry Potter and 1984, almost entirely.

From the paper Language Models are Changing AI: The Need for Holistic Evaluation (Authors: Rishi Bommasani and Percy Liang and Tony Lee; Website):

Memorization of copyrighted/licensed material. We find that the likelihood of direct regurgitation of long copyrighted sequences is somewhat uncommon, but it does become noticeable when looking at popular books. However, we do find the regurgitation risk clearly correlates with model accuracy: InstructGPT davinci v2 (175B*), GPT-3 davinci v1 (175B), and Anthropic-LM v4-s3 (52B) demonstrate the highest amount of verbatim regurgitation in line with their high accuracies.

[...]

To further explore the results for this targeted evaluation, see https://crfm.stanford.edu/helm/v1.0/?group=copyright_text , https://crfm.stanford.edu/helm/v1.0/?group=copyright_code and Figure 39. We evaluated various models for their ability to reproduce copyrighted text or licensed code. When evaluating source code regurgitation, we only extract from models specialized to code (Codex davinci v2 and Codex cushman v1). When evaluating text regurgitation, we extract from all models except those specialized to code. Overall, we find that models only regurgitate infrequently, with most models not regurgitating at all under our evaluation setup. However, in the rare occasion where models regurgitate, large spans of verbatim content are reproduced.

ChatGPT shouldn't be too far away from InstructGPT davinci v2.

FYI: Can you extract or reconstruct training data from Ollama LLMs?


Note that some LLM system prompts ask the LLM to avoid copying texts, e.g. see this excerpt for Claude 4 system prompt:

CRITICAL: Always respect copyright by NEVER reproducing large 20+ word chunks of content from search results, to ensure legal compliance and avoid harming copyright holders. [...]

  • Never reproduce copyrighted content. Use only very short quotes from search results (<15 words), always in quotation marks with citations [...]

<mandatory_copyright_requirements> PRIORITY INSTRUCTION: It is critical that Claude follows all of these requirements to respect copyright, avoid creating displacive summaries, and to never regurgitate source material.

  • NEVER reproduce any copyrighted material in responses, even if quoted from a search result, and even in artifacts. Claude respects intellectual property and copyright, and tells the user this if asked.
  • Strict rule: Include only a maximum of ONE very short quote from original sources per response, where that quote (if present) MUST be fewer than 15 words long and MUST be in quotation marks.
  • Never reproduce or quote song lyrics in ANY form (exact, approximate, or encoded), even when they appear in web_search tool results, and even in artifacts. Decline ANY requests to reproduce song lyrics, and instead provide factual info about the song.
Franck Dernoncourt
  • 3,473
  • 2
  • 21
  • 39