23

https://meta.stackexchange.com/q/388551/178179 mentions that SE will force some firms to pay to be allowed to train an AI model on the SE data dump (CC BY-SA licensed) and make a commercial use of it without distributing the model under CC BY-SA.

This makes me wonder: Is it illegal for a firm to train an AI model on a CC BY-SA 4.0 corpus and make a commercial use of it without distributing the model under CC BY-SA?

I found https://creativecommons.org/2021/03/04/should-cc-licensed-content-be-used-to-train-ai-it-depends/:

At CC, we believe that, as a matter of copyright law, the use of works to train AI should be considered non-infringing by default, assuming that access to the copyright works was lawful at the point of input.

Is that belief correct?

More specifically to the share-alike clause in CC licenses, from my understanding of https://creativecommons.org/faq/#artificial-intelligence-and-cc-licenses, it is legal for a firm to train an AI model on a CC BY-SA 4.0 corpus and make a commercial use of it without distributing the model under CC BY-SA, unless perhaps if the output is shared (2 questions: Is the output of an LLM considered an adaptation or derivative work under copyright? Does the "output" in the flowchart below mean LLM output in the case a trained LLM?).

enter image description here

Franck Dernoncourt
  • 7,791
  • 4
  • 46
  • 91

2 Answers2

19

The flowchart included in the question is trying to summarize a rather large amount of legal uncertainty into one image. It must be emphasized that each decision point represents an unsettled area of law. Nobody knows which path through that flowchart the law will take, or even if different forms or implementations of AI might take different paths. The short and disappointing answer to your question is that nobody knows what is or isn't legal yet.

To further elaborate on each decision point:

  • The first point is asking whether the training process requires a license at all. There are two possible reasons to think that it does not:
    • AI training is protected by fair use (see 17 USC 107). This is a case-by-case inquiry that would have to be decided by a judge.
    • AI training is nothing more than the collection of statistical information relating to a work, and does not involve "copying" the work within the meaning of 17 USC 106 (except for a de minimis period which is similar to the caching done by a web browser, and therefore subject to a fair use defense).
  • The second point is, I think, asking whether the model is subject to copyright protection under Feist v. Rural and related caselaw. Because the model is trained by a purely automated process, there's a case to be made that the model is not the product of human creativity, and is therefore unprotected by copyright altogether.
    • Dicta in Feist suggest that the person or entity directing the training might be able to obtain a "thin" copyright in the "selection or organization" of training data, but no court has ever addressed this to my knowledge.
    • This branch can also be read as asking whether the output of the model is copyrightable, when the model is run with some prompt or input. The Copyright Office seems to think the answer to that question is "no, because a human didn't create it."
  • The third decision point is, uniquely, not a legal question, but a practical question: Do you intend to distribute anything, or are you just using it for your own private entertainment? This determines whether you need to consult the rest of the flowchart or not.
  • The final decision point is whether the "output" (i.e. either the model itself, or its output) is a derivative work of the training input.
    • This would likely be decided on the basis of substantial similarity, which is a rather complicated area of law. To grossly oversimplify, the trier of fact would be shown both the training input and the allegedly infringing output, and asked to determine whether the two items have enough copyrightable elements in common that copying can reasonably be inferred.
Kevin
  • 5,952
  • 22
  • 41
1

Andrea Bartz, Charles Graeber, and Kirk Wallace Johnson, Plaintiffs, v. Anthropic PBC, Defendant. No. C 24-05417:

This order grants summary judgment for Anthropic that the training use [pirated books to train LLMs] was a fair use. And, it grants that the print-to-digital format change was a fair use for a different reason. But it denies summary judgment for Anthropic that the pirated library copies must be treated as training copies.

Summary: ok to train, not ok to pirate (and since CC BY-SA 4.0 allows free copies, piracy is not an issue for data licensed under CC BY-SA 4.0).

Franck Dernoncourt
  • 7,791
  • 4
  • 46
  • 91