Is it legal to publish hashes of words or phrases from copyrighted articles?

Question

It it legal to publish a list of cryptographic hashes of words and two- and three-word phrases from an article, in random order? "Lorem ipsum dolor sit amet" would become:

Lorem
Ipsum
Dolor
Sit
Amet
Lorem ipsum
Ipsum dolor
Dolor sit
Sit amet
Lorem ipsum dolor
Ipsum dolor sit
Dolor sit amet

The words and phrases would then be hashed with an algorithm such as MD5, and the hash list would be published in a random order. They will be mixed with such lists from thousands of other articles with no correlation between hashes and articles, so it will be virtually impossible to recover an article's text, even after recovering the words and phrases with a dictionary or brute-force attack.

score 1 · Answer 1 · answered Nov 23 '22 at 17:41

It is legal to publish clear-text words and short phrases from an article, because word and short phrases are not protected by copyright. It is therefore legal to publish a transformation of such words and phrases.

You can consider a given novel to be a sequence of words and short phrases. It is not legal to sequentially publish the words and phrases of a protected novel, for example as a web page with a million sequentially numbered chapters each containing a couple words of original text. It would be legal to alphabetically list all of the words employed in a novel, along with a token count.

In order to prove copyright infringement, the plaintiff in the lawsuit would have to prove that the allegedly-infringing work as a certain degree of similarity to the protected work. Suppose that the fifth word of the original and supposed copy is "the" – that is not sufficient similarity to prove copying. The question of degree of similarity is a very fact-intensive inquiry.

If a novel is in fact copied and run through the blender to produce unrecognizable word slurry, the plaintiff has the burden of proving the necessary degree of similarity between the original and the copy. This involves exposing the reconstruction procedure. Given any sufficiently large database of words (the word-count list mentioned above), one can also select words from the database so that the selection procedure maximally matches the original text (then you devise a mapping post hoc that "reconstructs" the original text from the slurry). The defense can counter this argument by positing a different algorithm that converts the slurry into Mother Goose rhymes. Whether not the plaintiff's argument that there was copying depends crucially on the uniqueness of the reconstruction.

So, the summary. First, words and phrases are not protected by copyright law. Second, entire works, which contain words and phrases, cannot be copied at all regardless of any transformations you apply. The law is stated in terms of copying, and not in terms of "creating a text that has a particular relation to a protected word". However, the law also has to deal with the known facts, and cannot require time-travel as a means of proving that there was copying. Therefore, the law takes "similarity" as a substitute for direct knowledge of an action in the past. It is therefore possible that a person can get away with actual copying, if the plaintiff cannot prove that the basis of the secondary text must have been the plaintiff's protected text.

Is it legal to publish hashes of words or phrases from copyrighted articles?

1 Answers1