Authors sue Anthropic for allegedly using pirated copyrighted work to train Claude

midian182

Posts: 10,047   +134
Staff member
What just happened? Anthropic has become the latest artificial intelligence startup to be sued by authors, who claim on this occasion that it used pirated copies of their work to train its AI model, Claude. The three authors in the class-action lawsuit say Anthropic "built a multibillion-dollar business by stealing hundreds of thousands of copyrighted books."

Writers and journalists Andrea Bartz, Charles Graeber, and Kirk Wallace Johnson brought the lawsuit, which seeks class action status.

The suit alleges that Anthropic downloaded known pirated versions of the plaintiffs' work, made copies of them, and used these pirated copies to train its LLMs.

"Anthropic styles itself as a public benefit company, designed to improve humanity. For holders of copyrighted works, however, Anthropic already has wrought mass destruction," the complaint reads. "It is no exaggeration to say that Anthropic's model seeks to profit from strip-mining the human expression and ingenuity behind each one of those works."

The suit also argues that Claude's ability to generate text, particularly "cheap book content," has been made possible by training it on people's writing without their permission or compensation.

"For example, in May 2023, it was reported that a man named Tim Boucher had 'written' 97 books using Anthropic's Claude (as well as OpenAI's ChatGPT) in less than [a] year, and sold them at prices from $1.99 to $5.99. Each book took a mere 'six to eight hours' to 'write' from beginning to end," the complaint states.

"Claude could not generate this kind of long-form content if it were not trained on a large quantity of books, books for which Anthropic paid authors nothing."

Anthropic is alleged to have knowingly used The Pile and Books3 datasets for its training, which incorporate Bibliotik, an alleged "notorious pirated collection." The suit said this allowed Anthropic to avoid paying licensing costs.

AI companies being sued for stealing copyrighted work to train their LLMs has become a common sight. OpenAI and others have also been hit with lawsuits from authors, artists, publishers, music firms, and more.

OpenAI previously claimed it would be impossible to train AI models without using copyrighted content. The leader in the generative AI field has today signed a deal with Condé Nast, allowing ChatGPT to reference stories from The New Yorker, Bon Appetit, Vogue, Vanity Fair, and Wired.

Anthropic was sued by three music publishers last October for infringement of their copyrighted song lyrics. Earlier this week, the company asked a US federal court to dismiss much of the case to focus on whether training AI using copyrighted work falls under fair use, which is something many AI firms and executives have claimed.

I startups Udio and Suno, who are also being sued by music groups, have admitted to scraping copyrighted tracks, argue that doing so is fair use and that the music it generates doesn't feature samples straight from the original songs.

In June Anthropic launched its Claude 3.5 Sonnet AI model, claiming it beats GPT-4 Omni in several metrics. The company is founded by former members of OpenAI, and last year received a $4 billion investment from Amazon and $2 billion from Google.

Permalink to story:

 
As other legal scholars have said, but it will ultimately be up to the courts to decide (or legislators): training on copyrighted works is probably fair use. Inference that generates near verbatim copies, on the other hand, probably isn't. An analogy to the old-school tech world, it's the difference between making a recording of a copyright broadcast TV show or redistributing that recording. The Supreme Court ruled that making the recording is definitely covered under fair use. (https://en.wikipedia.org/wiki/Sony_Corp._of_America_v._Universal_City_Studios,_Inc.)
Also, from the Register:
https://www.theregister.com/2024/08/20/anthropic_claude_copyright/
 
As other legal scholars have said, but it will ultimately be up to the courts to decide (or legislators): training on copyrighted works is probably fair use. Inference that generates near verbatim copies, on the other hand, probably isn't. An analogy to the old-school tech world, it's the difference between making a recording of a copyright broadcast TV show or redistributing that recording. The Supreme Court ruled that making the recording is definitely covered under fair use. (https://en.wikipedia.org/wiki/Sony_Corp._of_America_v._Universal_City_Studios,_Inc.)
Also, from the Register:
https://www.theregister.com/2024/08/20/anthropic_claude_copyright/

Good points - we also have the case of the Author's Guild vs Google, where Google was doing mass scanning of copyrighted books (sometimes even pulling them from libraries), to create its book search database, and the courts also rules that the project fell under fair use.
 
Once an original text has been written by a human author and embodied on a medium by a human being or a machine, then it's automatically copyrighted.

This simply means that the human author has the exclusive but restricted right to make exact copies of that exact text. He doesn't have the (exclusive) right to determine who (human) can read that text, nobody has that right because it will be an insult in the personality. Additionally access to knowledge is a fundamental ritght (Article 19 of the Universal Declaration of Human Rights (1948), states that the fundamental right of freedom of expression encompasses the freedom to “to seek, receive and impart information and ideas through any media and regardless of frontiers”).

So copyright is all about the "exact copies". And because copyright is applied automatically, it also means that everyone learns to write by reading copyrighted texts, it's something very common. People don't pray to God to give them the gift of writing. They simply read texts and identify the abstract patterns in them. With these abstract patterns (or slightly modified versions) in their heads, they produce new texts.

But machines are not legal entities. They do not interact with the law; the law applies only to humans. As machines, they cannot be blamed for making illegal copies and cannot produce copyrighted output on their own (by whom? The CPU? The GPU? The RAM? The neural network numbers?). Hardware and software are not legal entities. Thus, all outputs generated by an artificial intelligence (llm) on a medium are by definition and by default in public domain and common property. And presumably any derivative work based on these elements will also be common property, because public domain means that the copyright belongs to the human race as a whole (the "public domain"), which as a whole also has the right to create derivative works, and the author (the llm) is immortal, so this right will be in perpetuity.
 
Once an original text has been written by a human author and embodied on a medium by a human being or a machine, then it's automatically copyrighted.
Nothing is automatically copyrighted, the work has to be registered.
Additionally access to knowledge is a fundamental ritght (Article 19 of the Universal Declaration of Human Rights (1948), states that the fundamental right of freedom of expression encompasses the freedom to “to seek, receive and impart information and ideas through any media and regardless of frontiers”).
Yes, unfortunately that Declaration has little to no force. But even if it did, there are no unlimited rights, they all have boundaries to which they do not apply.

But machines are not legal entities. They do not interact with the law; the law applies only to humans. As machines, they cannot be blamed for making illegal copies and cannot produce copyrighted output on their own
You forget that it is humans (and also corporations, which the law very much treats as persons), which own and interact with these machines, so the law does indeed interact with them, as well. If I were to buy your argument, then copyright would have no meaning. If I want a copyrighted book without paying for it, I'll have a printer make it. If the book is copyright it doesn't matter, per your logic, because the printer is a machine, and thus what it produces is not subject to the restrictions of copyright. The law very much covers how people use technology. Sure, the machine cannot be blamed, but the responsibility on the people who built, maintained, and/or used that machine does not vanish.
 
Nothing is automatically copyrighted, the work has to be registered.

Yes, unfortunately that Declaration has little to no force. But even if it did, there are no unlimited rights, they all have boundaries to which they do not apply.

You forget that it is humans (and also corporations, which the law very much treats as persons), which own and interact with these machines, so the law does indeed interact with them, as well. If I were to buy your argument, then copyright would have no meaning. If I want a copyrighted book without paying for it, I'll have a printer make it. If the book is copyright it doesn't matter, per your logic, because the printer is a machine, and thus what it produces is not subject to the restrictions of copyright. The law very much covers how people use technology. Sure, the machine cannot be blamed, but the responsibility on the people who built, maintained, and/or used that machine does not vanish.
Copyright law says that when an author's original work is incorporated into a medium, it is automatically protected by copyright. No one checks whether it is a derivative work rather than an original, because there are always influences. It can also be registered in a public database for additional protection by taking timestamps, but it is automatically copyrighted when it is incorporated into a medium. This is 100% certain, check the law with a Google search if you don't believe me.
-
This Universal Declaration of Human Rights has been adopted by the governments of almost all the nations of the world and has precedence over the common laws, which is similar to the constitution. A law that compromises a fundamental right is at the very least very problematic, if not invalid.
-
Okay, people own machines, but copyright requires authorship. If someone asks an AI to write a poem and he has no idea how to write poems, that doesn't make him the author of the poem the AI has written. If someone knows basic English and writes a prompt to the AI saying "write a novel about a prince and a princess", they cannot have authorship of that novel without even knowing the language and without having written any part of that novel. So when it comes to authorship, it doesn't matter who owns the computer (the neural network numbers for their weights are not in the list with copyrighted formats, so no one has a copyright on them to license, they're in the public domain).

Without human to own the authorship the copyright of an output from AI goes to the human race because it’s their fundamental right to have access to the knowledge.
 
Copyright law says that when an author's original work is incorporated into a medium, it is automatically protected by copyright. No one checks whether it is a derivative work rather than an original, because there are always influences. It can also be registered in a public database for additional protection by taking timestamps, but it is automatically copyrighted when it is incorporated into a medium. This is 100% certain, check the law with a Google search if you don't believe me.
I stand corrected - what you say is generally true. There are caveats (not all nations adhere to this), and, at least in the U.S., where automatic copyright was not a thing until 1989 (the year the U.S. joined the Berne Convention), one cannot pursue damages in court without registration. It's actually a shame that copyright is automatic, it would do wonders in clearing the backlog of court cases were copyright registration to be required again, with those who produce series of content able to copyright future (clearly indicated) works in those series without separate registration, of course, and other similar exceptions.

This Universal Declaration of Human Rights has been adopted by the governments of almost all the nations of the world and has precedence over the common laws, which is similar to the constitution. A law that compromises a fundamental right is at the very least very problematic, if not invalid.
Approved by the General Assembly, yes, but not ratified, except in parts indirectly by treaty. Occasionally it serves as customary law, but it is generally nonbinding.

"
The Declaration is generally considered to be a milestone document for its universalist language, which makes no reference to a particular culture, political system, or religion.[4][5] It directly inspired the development of international human rights law, and was the first step in the formulation of the International Bill of Human Rights, which was completed in 1966 and came into force in 1976. Although not legally binding, the contents of the UDHR have been elaborated and incorporated into subsequent international treaties, regional human rights instruments, and national constitutions and legal codes.[6][7][8]

All 193 member states of the United Nations have ratified at least one of the nine binding treaties influenced by the Declaration, with the vast majority ratifying four or more.[1] While there is a wide consensus that the declaration itself is non-binding and not part of customary international law, there is also a consensus in most countries that many of its provisions are part of customary law,[9][10] although courts in some nations have been more restrictive on its legal effect.[11][12] Nevertheless, the UDHR has influenced legal, political, and social developments on both the global and national levels, with its significance partly evidenced by its 530 translations.[13]"

Okay, people own machines, but copyright requires authorship. If someone asks an AI to write a poem and he has no idea how to write poems, that doesn't make him the author of the poem the AI has written. If someone knows basic English and writes a prompt to the AI saying "write a novel about a prince and a princess", they cannot have authorship of that novel without even knowing the language and without having written any part of that novel. So when it comes to authorship, it doesn't matter who owns the computer (the neural network numbers for their weights are not in the list with copyrighted formats, so no one has a copyright on them to license, they're in the public domain).

Without human to own the authorship the copyright of an output from AI goes to the human race because it’s their fundamental right to have access to the knowledge.
I agree with some of your points here, but not all. If that "novel about a prince and a princess" is a near verbatim copy of a copyrighted work, that could be infringement, and the courts would likely classify it as such, especially if that were produced as part of a paid-for service to a model (and thus part of a commercial offering, one of the evaluation criteria for fair use). Additionally, if the person takes presumably non-infringing output from a model, edits it, and publishes it, do they not enjoy the automatic copyright that any other work would receive? How much editing has to be done remains an open question, but copyright offices have granted copyright registrations to such works.

You say that there is no human to own the authorship, but these models do not act on their own: they act on a prompt, which can be driven by machines, but can also be driven by humans. Now, when automation gets to the point of machines prompting each other to create works, that's an interesting question. If those machines can be said to be employees, then in the UK their employer would be granted the copyright. Lots of ifs there. But as long as humans are the ones doing the prompting and interactions, I don't expect the law to throw copyright to the wind just by nature of the fact that tool is more advanced than we have been accustomed to in the past.
 
It could be said that the concept of copyright has been a fundamental aspect of our world since we were born. Given the evolution of intellectual property rights, we have become accustomed to the norms established by copyright, which has shaped our understanding of this concept. It could be argued, however, that it is not a particularly useful concept. It is only of benefit to some individuals on an occasional basis for a relatively short period of time (around 2-5 years, after which most works are not sell well anymore).

Copyright it was created at a time when printing presses were first introduced with the intention of affording writers a degree of control over their work. However, these AIs have already reached a level of intelligence exceeding 150 IQ and are just in their first generation. It seems likely that they will become the dominant force in creative and scientific work in the near future, which could render copyright effectively irrelevant.
 
I stand corrected - what you say is generally true. There are caveats (not all nations adhere to this), and, at least in the U.S., where automatic copyright was not a thing until 1989 (the year the U.S. joined the Berne Convention)
This isn't quite correct. Even prior to Berne, the 1976 Copyright Act gave automatic copyright to works once expressed in tangible form:


It's actually a shame that copyright is automatic, it would do wonders in clearing the backlog of court cases were copyright registration to be required again....
True. It would also solve the growing "orphan works" problem. It'd also be extraordinarily helpful to shorten copyright duration back to its original length or even less -- it's absolutely senseless to allow copyrights to extend for a full century.
 
Back