6

Example: the Stack Exchange data dump has a new access agreement, which states:

I understand that this file is being provided to me for my own use and for projects that do not include training a large language model (LLM), and that should I distribute this file for the purpose of LLM training, Stack Overflow reserves the right to decline to allow me access to future downloads of this data dump.

Does that condition ("for my own use and for projects that do not include training a large language model (LLM), and that should I distribute this file for the purpose of LLM training") only apply to the file, does the condition also apply to its content itself too? In other words, if one changes the content format and creates a new file, is one allowed to train LLMs on the content, or share the content? Note that the content itself is under CC BY-SA (at least, if accessed by browsing the website).

For additional context, consider that according to the site Terms of Service

You agree that any and all content, including without limitation any and all text, graphics, logos, tools, photographs, images, illustrations, software or source code, audio and video, animations, and product feedback (collectively, “Content”) that you provide to the public Network (collectively, “Subscriber Content”), is perpetually and irrevocably licensed to Stack Overflow on a worldwide, royalty-free, non-exclusive basis pursuant to Creative Commons licensing terms (CC BY-SA 4.0), and you grant Stack Overflow the perpetual and irrevocable right and license to access, use, process, copy, distribute, export, display and to commercially exploit such Subscriber Content, even if such Subscriber Content has been contributed and subsequently removed by you as reasonably necessary to

all "Subscriber Content" is licensed under two different licenses:

  • a version of the Creative Common license (CC BY-SA 4.0 for newer content, previous versions for older posts)
  • a secondary license that Stack Exchange grants to itself.

Based on the Creative Common FAQS page I find hard to understand exactly what the company can or can't do to implement additional artificial restriction on the distribution of the "Subscriber Content". More specifically, I don't get if the company can claim to be distributing the dumps under their own license - thus being freed by any restriction the CC would cause them - while the actual content is still attached to the Creative Common.

feetwet
  • 22,409
  • 13
  • 92
  • 189
Franck Dernoncourt
  • 7,791
  • 4
  • 46
  • 91

3 Answers3

17

The complete sentence is relevant. It says:

[I understand that if I do X,] Stack Overflow reserves the right to decline to allow me access to future downloads of this data dump.

This term does not actually create conditions or put any obligations on the user of the data dump. It puts people on notice about how Stack Overflow might exercise its discretion in the future. In the context of the example you provided, X can be anything and it would not change the substantive rights of the parties, so the interpretation is a red herring.

X could "apply to" "this file"; or X could "apply to" "the contents of this file" — neither interpretation would alter the meaning of this clause.1

Absent contractual or statutory obligations to the contrary, any provider of a file always reserves the right to decline others access to a file that the provider itself creates, even if they don't say so. If an entity creates a file, they can provide that file to anyone or no one.

This is a point recognized by Xander Henderson in two comments in the meta Q&A: 1, 2.


1. I have put "apply to" in scare quotes because in the context of the particular example, where the so-called condition is meaningless, it is likewise meaningless to consider whether the condition "applies" to something. In a different context, whether an obligation relates only to a particular file or whether it applies to the contents of the file will depend on all the normal principles of contractual interpretation, where the words are interpreted in their full context, including the surrounding language, purpose of the contract, and the background information that should have been known to the parties at the time of contractual formation. See generally Sattva Capital Corp. v. Creston Moly Corp., 2014 SCC 53 and several other Q&As on this site explaining this (1, 2, 3). Given that the full context includes a statutory copyright regime and a simultaneous open-source licence, unlike phoog, I cannot conclude that the phrase in question would "of course" have any particular meaning.

Jen
  • 87,647
  • 5
  • 181
  • 381
9

does the condition also apply to its content itself too?

Of course it does.

In other words, if one changes the content format and creates a new file, is one allowed to train LLMs on the content, or share the content?

Of course not.

One can make use of the "reasonable person" test here: any reasonable person would recognize that Stack Overflow isn't interested in prohibiting only the use of the file itself but rather of the data it contains, so reformatting the data and using the resulting different file for LLM training is something that a reasonable person should expect would lead to Stack Overflow declining future access. A related way of looking at this is to conclude that reformatting the data for LLM training is close enough to "redistributing the file for LLM training" to fall within the meaning of the access agreement.

phoog
  • 42,299
  • 5
  • 91
  • 143
5

You seem to be proposing that you download file A, and use that file to create file B, and then use file B to train an LLM, and that this was always your intention in downloading file A.

I am not a lawyer, but I doubt that a court would have much difficulty in following the chain of events here, and concluding that in fact you did "use file A to train the model". If it were ever relevant in a court case, that is.

For an analogy: if you used OCR to scan a book, and then used the resulting text to train a model, then I don't think anyone would have any difficulty concluding that you used the book to train the model, even though the model itself interacts only with digitised text, not with paper. It's just the English meaning of the word "use", which can include using something indirectly.

What the consequences would be is another matter: you have obtained access to the download by ticking a box stating that you understand the purposes for which Stack Overflow chooses to make the download available. You have also acknowledged your understanding that Stack Overflow might choose not to make it available to you in future, if you distribute it for the purposes of LLM training.

I am still not a lawyer, but I don't see that this should be construed as you unlawfully accessing a computer system, or any such crime. You merely acknowledge you are aware of SO's purpose in making it available (which is different from your purpose in downloading it), and that they might block you in future. The tickybox doesn't say that you agree not to use it to train an LLM, which it easily could have said, had SO wanted you to make that commitment. It also doesn't even say that they'll block you for using it to train an LLM, only that they "reserve the right" to block you for distributing it for the purpose of training an LLM. This is a right that they have, so "reserving" that right asks nothing at all of you: it merely makes absolutely clear that they are not relinquishing that right. That is to say, they are not promising that it will always be available to you regardless of you distributing it for the purpose of training LLMs.

So, if you are training a tiny LLM on your own PC, and using the SO download for the purpose, you haven't been threatened with anything. You haven't distributed the file. However, you are acting against SO's stated purpose in making the file available to you to download, and so you might choose to respect SO's wishes by not doing this.

As you rightly point out, the content is CC BY-SA. If you obtain the same content, or part of it, by some means other than the download, that doesn't have any conditions attached, then the copyright in the content would be all that's relevant. The tickybox is for access to a specific service (downloading that content from SO's server in a big file), nothing to do with copyright licensing. Of course, CC BY-SA is potentially quite an onerous license, if you intend to create a derived work of the entire archive, with hundreds of thousands of authors.

Consider also the text they were originally going to use, and decided not to:

I agree that I will use this file for non-commercial use. I will not use it for any other purpose, and I will not transfer it to others without permission from Stack Overflow. I certify that I am not downloading this file on behalf of my employer, for use in a for-profit enterprise. I have read and agree to the Terms of Service and the Acceptable Use Policy, and have read and understand Stack Overflow's privacy notice.

https://meta.stackexchange.com/questions/401324/announcing-a-change-to-the-data-dump-process

So, all that's what the new text is not saying. You are not promising to make non-commercial use of it (so, SO can't sue you if they somehow suffer damages as a result of you not using it at all, because in the new text you aren't promising to use it. In the old text you made that solemn promise). And so on. I guess the current text is probably less of a tire-fire than the first version was ;-)

Steve Jessop
  • 1,098
  • 8
  • 11