Courts say AI training on copyrighted material is legal
A ruling in a U.S. District Court has effectively given permission to train artificial intelligence models using copyrighted works, in a decision that's extremely problematic for creative industries.

Anthropic logo on top of coding and court imagery
Content creators and artists have been suffering for years, with AI companies scraping their sited and scanning books to train large language models (LLMs) without permission. That data is then used for generative AI and other machine learning tasks, and then monetized by the scraping company with no compensation for the original host or author.
Following a ruling by a U.S. District Court for the Northern District of California issued on Tuesday, companies are being given free rein to train with just about any published media that they want to harvest.
The ruling is based on a lawsuit from Andrea Bartz, Charles Graeber, and Kirk Wallace Johnson against Anthropic dating back to 2024. The suit accused the company of using pirated material to train its Claude AI models.
This included Anthropic creating digital copies of printed books for AI model training.
The ruling from Judge William Alsup -- a judge very familiar to readers of AppleInsider -- rules in favor of each side in various ways. However, the weight of the ruling certainly sides with Anthropic and AI scrapers in this instance.
Under the ruling, Judge Alsup says that copies used to train specific LLMs was justifiable as fair use.
"The technology at issue was among the most transformative many of us will see in our lifetimes," Alsup commented.
For physical copies that were converted from a print library to a digital library, this was also deemed fair use. Furthermore, using that content to train LLMs was also fair use.
Alsup compared the author's complaint to if the same argument was used against an effort to train schoolchildren how to write well. It's not clear how that applies, given that artificial intelligence models are not considered "schoolchildren" in any legal sense.
In that argument, Alsup ruled that the Copyright Act is intended to advance original works of authorship, not to "protect authors against competition."
Where the authors saw a small amount of success was in the usage of pirated works. Creating a library of pirated digital books, even if they are not used for the training of a model, does not constitute fair use.
That also remains the case if Anthropic later bought a copy of a pirated book after pirating it in the first place.
On the matter of the piracy argument, the court will be holding a trial to determine damages against Anthropic.
In May, it was reported that Apple was working with Anthropic to integrate the Claude Sonnet model into a new AI-powered version of Xcode, to help reshape developer workflows.
Bad news for content makers
The ruling is terrible for artists, musicians, and writers. Other professions where machine learning models could be a danger to their livelihoods will have issues too -- like judges who once said that they took a coding class once, and therefore knew what they were talking about with tech.
AI models take advantage of the hard work and life experiences of media creators, and pass it off as its own. At the same time, it leaves content producers with few options to take to combat the phenomenon.
As it stands, the ruling will clearly be precedent in other lawsuits in the AI space, especially when dealing with the producers of original works that are pillaged for training purposes.
Over the years, AI companies were attacked for grabbing any data they could to feed the LLMs, even content scraped from the Internet without permission.
This is a problem that manifests in quite a few ways. The most obvious is in generative AI, as the models could be trained to create images in specific styles, which devalues the work of actual artists.
A example of a fightback is a lawsuit from Disney and Universal against Midjourney, which surfaced in early June. The company behind the AI image generator is accused of mass copyright infringement, for training the models on image of the most recognizable characters from the studio.
The studios unite in calling Midjourney "a bottomless pit of plagarism," built on the unauthorized use of protected material.
When you have two major media companies that are usually bitter rivals uniting for a single cause, you know it's a serious issue.
It's also a growing issue for websites and publishers, like AppleInsider. Instead of using a search tool and viewing websites for information, a user can simply ask for a customized summary from an AI model, without needing to visit the site that it has sourced the information from in the first place.
And, that information is often wrong, combined with data from other sources, polluting the original meaning of the content. For instance, we've seen our tips on how to do something plagiarized with sections reproduced verbatim, and mashed up out of order with that from other sites, making a procedure that doesn't work.
The question of how to deal with compensating the lost revenues of publishers is still one that has not yet been answered in a meaningful way. There are some companies that have been trying to stay on the more ethical side of things, with Apple among them.
Apple has offered news publishers millions to license content, for training its generative AI. It has also paid for licenses from Shutterstock, which helped develop its visual engines used for Apple Intelligence features.
Major publishers have also taken to blocking AI services from accessing their archives, doing so via robots.txt. However, this only stops ethical scrapers, not everyone. And, scraping an entire site takes server power and bandwidth -- which is not free for the hosting site that's getting scraped.
The ruling also follows after an increase in efforts from major tech companies to lobby for a block on U.S. states introducing AI regulation for a decade.
Meanwhile in the EU, there have been attempts to sign tech companies up to an AI Pact, to develop AI in safe ways. Apple is apparently not involved in either effort.
Read on AppleInsider
Comments
Somehow you’ve convinced yourself that AI is sentient and learning from influences. It’s not. It has a database of pirated data that it uses to essentially copy/paste responses from.
If I wrote a book called “Blue Eggs and Spam” and charged people for it, you better believe I’d be sued. It shouldn’t be any different when AI companies do it.
But you’re on the right track on one thing: we should definitely start beaming Dr. Seuss books to the stars.
These programs consist of all this accumulated information scraped from wherever it can be scraped, combined with sufficient computational power to brute force a most-probable sequence of words in response to a submitted query. There is no reasoning or thinking or even learning in the human sense involved.
Had I submitted that green eggs and ham query to human writers, many would simply tell me Dr. Seuss had already written that. Some more creative people might think about it and do a mash-up, rewriting, say, Horton Hears a Who, but changing the story to be about green eggs and ham. Someone else might actually write an entirely original story about green eggs and ham, using a fresh helping of nonsense along with Seuss's characteristic rhyme and meter conventions.
The LLM AI, however, doesn't think at all, but rather spits out collages made from other people's work. A middle school or high school student has absorbed a tiny fraction of the amount of information indexed by and LLM program, they have received a tiny fraction of the programming (e.g. classroom instruction) of a LLM program, and will then apply a tiny fraction of the computational power used by AI to produce a written paper in response to a written instruction or assignment, and yet, an average or better student will, without committing plagiarism, produce a better written, more accurate, less hallucinatory paper than AI will.
Ai does not learn, it scrapes and indexes. AI does not think or create, regurgitates.
There are good points on both sides of the training question. On one hand, AI programs are being trained based on the hard work of previous human artists. The AI companies are profiting, but the original artists get nothing.
On the other hand, the AI is not doing anything new. It's common for individuals to study the work of others, and use that study to inform their work. When interviewed, great directors often discuss how they have studied the works of great directors to learn their techniques and style. The AI programs are simply really good at this.
My understanding, is that an art student can study the works of a current artist, and produce new works in that style. I don't believe an artist's style is protectable by copyright. What an artist can't do, is to produce work that is essentially a copy of an existing copyrighted work, or that contains copyrighted elements (including copyrighted characters). An artist also has to be careful that work done in someone else's style is not represented as being that artist's work. If I were to write a book in the style of Dr. Seuss, I would need to make it very clear that the book was *not* a work by Dr. Seuss.
An issue with current AI, is that it doesn't understand the limitations of copyright law, and can sometimes produce results that would typically be considered copyright infringement.
Disclaimer: I am not an attorney, and this is not legal advice. It is merely my imperfect understanding of some of the issues.
This is a common challenge with new technology. In the past, certain activities were limited by the technology of the time. Therefore, certain activities could not rise to the level where they were a common issue. As technology improves, so do various abilities.
For instance, 50 years ago we didn't really need laws governing the ability for private companies to track people. If they wanted to track someone, they hired a private investigator, and he would follow the person of interest. If you wanted to track 50 people, you would need 50 private investigators. The available technology limited the collection of tracking data. If a company wanted to track someone, and sell that information, they could. It just wasn't a common thing.
Up until now, the courts have been the entity that decides whether a "work" that has been "used for profit" has "infringed" on someone else's work. That's a perfectly valid system for going forward. AI doesn't change anything here. If anyone uses AI to write a plagiarized work, then the persons who benefit from that plagiarization should be suable. But we shouldn't stop AI from creating fair use derivatives of other people's work, just as you shouldn't be sued for writing a song that sounds vaguely similar to an ABBA song. If you can take advantage of "fair use", then so can other people who use AI for the same thing. After all, half the videos on Youtube are taking advantage of fair use laws, by using someone else's video or audio.