Apple insists its AI training is ethical and respects publishers
In a new research paper, Apple doubles down on its claim of not training its Apple Intelligence models on anything scraped illegally from the web.

Apple Intelligence -- image credit: Apple
It's a fair bet that Artificial Intelligence systems have been scraping every part of the web they can access, whether or not they should. In 2023, both OpenAI and Microsoft were sued by the New York Times for copyright infringement, and that was far from the only suit.
Whereas also in 2023, Apple was reported to have attempted to buy the rights to train its large language models (LLMs) on work from publishers including Conde Nast, and NBC News. Apple was said to have offered publishers millions of dollars, although it was not clear at the time which had agreed or disagreed.
Now in a newly published research paper, Apple says that if a publisher does not agree to its data being scraped for training, Apple won't scrape it.
Apple details its ethics
"We believe in training our models using diverse and high-quality data," says Apple. "This includes data that we've licensed from publishers, curated from publicly available or open-sourced datasets, and publicly available information crawled by our web-crawler, Applebot."
"We do not use our users' private personal data or user interactions when training our foundation models, it continues. "Additionally, we take steps to apply filters to remove certain categories of personally identifiable information and to exclude profanity and unsafe material. "
Most of the research paper is concerned with how Apple goes about doing this scraping, and specifically how its internal Applebot system ensures getting useful information despite "the noisy nature of the web." But it does return to the overall issues regarding copyright, and each time insists that Apple is respecting rights holders.
"[We] continue to follow best practices for ethical web crawling, including following widely-adopted robots. txt protocols to allow web publishers to opt out of their content being used to train Apple's generative foundation models," says Apple. "Web publishers have fine-grained controls over which pages Applebot can see and how they are used while still appearing in search results within Siri and Spotlight."
The "fine-grained controls" appear to be based around the long-standing robots.txt system. That is not any kind of standard privacy system, but it is widely adopted and involves publishers including a text file called robots.txt on their sites.

ChatGPT logo - image credit: OpenAI
If an AI system sees that file, it is supposed to not scrape the site or specific pages that the file details. It's as simple as that.
What companies say and what they do
It's easy to say that a company's AI systems will respect robots.txt, and OpenAI implies -- but only implies -- that it does too.
"Decades ago, the robots.txt standard was introduced and voluntarily adopted by the Internet ecosystem for web publishers to indicate what portions of websites web crawlers could access," said OpenAI in a May 2024 blog post called "Our approach to data and AI."
"Last summer," it continued, "OpenAI pioneered the use of web crawler permissions for AI, enabling web publishers to express their preferences about the use of their content in AI. We take these signals into account each time we train a new model."
Even that last part about taking signals into account is not the same as saying OpenAI respects these signals. Then that key paragraph about signals directly follows the one about robots.txt, but does not explicitly say it pays any attention.
And seemingly a great many AI companies do not adhere to any robots.txt instructions. Market analysis firm TollBit said that in March 2025, there were over 26 million disallowed scrapes where AI firms ignored robots.txt entirely.
The same firm also reports that the number is rising. In Q4 2024, 3.3% of AI scrapes ignored robots.txt, and in Q1 2025 it was around 13%.
While TollBit does not speculate on the reasons for this, it's likely that the entire available internet has already been scraped. So the companies are pressing on, and in June 2025, a US District Court said they could.
Robots.txt is more than a simple no
When any AI system attempts to scrape a website, it identifies itself. So when Google does it, the site registers that Googlebot is accessing it, and returns a comprehensive list of permissions.
That list comprises which sections of the site the bot is not allowed to access. When Apple's system, Applebot, was revealed in 2015, Apple said that if a site doesn't recognize it, Applebot would follow any guidelines included for Googlebot.
The BBC said in 2023 that "we have taken steps to prevent web crawlers like those from OpenAI and Common Crawl from accessing BBC websites." Around the same time, a study of 1,156 news publishers found that 626 had blocked AI scraping, including that by OpenAI and Google AI.

A court case against Anthropic has concluded that AI can train on any material
But a company changed the name of its scraping tool, and it can just ignore blocks -- or at least be accused of doing so.
Perplexity.ai -- which Apple is repeatedly rumored to be buying -- marketed itself as an ethical AI too, with a detailed blog post about why ethics are so necessary.
But that was published in November 2024, and in the June before it, Forbes threatened Perplexity over it having scraped anyway. Perplexity CEO Aravind Srinivas later admitted to its search and scraping having some "rough edges."
Apple stands out in AI
Unless Apple's claims on ethical AI training are challenged legally, as Forbes at least started to do with Perplexity.ai, we will never know if they are true.
But OpenAI has been sued over this, Microsoft has, and Perplexity has been called out for doing it. So far, no one has claimed Apple has done anything unethical.
That's not the same thing as publishers being happy with any firm training its LLMs on the data, but so far, Apple may be the only one doing it all legally.
Read on AppleInsider
Comments
It's quite an innovative launch when you're not viewing it from the "AI will take over the world and replace all jobs" nonsense point of view. Grounded here in reality, Apple Intelligence is quite useful for the everyday consumer. Does Apple need a lying chatbot?
With ChatGPT I can take a photo of something and ask for assistance or directions and that works great, too. That isn’t something that Apple Intelligence can do, maybe in the future. But I ChatGPT couldn’t have done what I did with that phone call and kept it all private.
I was able to utilize this recently by taking a photo of a gift my friend received during a baby shower. Visual Intelligence let me do a quick reverse image search via Google and brought up the product in question. All private, with none of the data being used by Google or others. That's the innovation Apple is providing via its approach to accessible, on-device features and call outs to third-party AI.
No one else is doing this. And this feels way more useful than a Hatsune Miku I can have an affair with.
I'm building custom scripts for trading, looking forward to building apps for personal use, going to be using an AI agent to do my online shopping for me, completing tasks like building a Google sheet that will constantly scan the market and build me a watchlist based on my criteria, etc. I mean I'm really getting into learning all I can and I hope to one day get rid of Google Home and replace it with a custom built ChatGPT agent.
Apple is nowhere on this level and I honestly don't see them ever getting close! Apple piggybacking off ChatGPT is not them having a product in the field and I'm sure nobody in this field considers Apple competition. Apple Intelligence by itself is a nothing burger, and Apple writing some research paper about how ethical their non-existent product is means nothing, they need to get in the game.
This is so frustrating because instead of letting you replace Siri and Apple "Intelligence" with something more useful, we are stuck waiting on Apple to actually do something and not just talk about it
If you've been paying attention, Apple is building the framework for any and every task to run via Apple Intelligence. The on-device model being private and secure is an amazing boon, and soon developers will be able to use it for any general task that they can target the AI at. We'll see what people do and what the limitations are after the public release, but it sounds promising.
Again, "I don't want to use these tools" doesn't translate to "non-existent product." I don't want to let ChatGPT spend my money, but that doesn't mean that the feature doesn't exist. Apple Intelligence provides useful features and will only improve going forward. App Intents and contextual AI with Siri will be a game changer too, once it launches.
I use Apple Intelligence every day, but I don't use ChatGPT except in the rarest of occasions, and even then, I use it through Siri. Should I be declaring ChatGPT a useless AI tool? No, because that's silly.
AI chatbots, ChatGPT, etc are glorified features, not standalone products. They're a technology without a home. I wouldn't bet against Apple here.
Apple has been employing various levels of machine-learning tech for decades. The keyboard on the very first iPhone employed it ... 18 years ago.
It's only been this latest "GenAI" craze where they were caught with their pants down, but they still make a sale selling devices to consumers that access competing GenAI. So they win either way.
William, I appreciate your reporting in this area, but you have misinterpreted the earlier article about the Anthropic case. The Court was very clear that there is a difference between using materials acquired legally and those that were not. Also that case had nothing to do with web scraping; it was entirely about "pirated" books and books purchased legally.
Here's the conclusion about Anthropic's purported "fair use" of the pirated books:
Now, ignoring a robots.txt file and scraping publicly available web pages is not the same as downloading pirated books. That would be a legal question about terms of use, presumably, and that was not at issue in that case. So, maybe a court will rule that ignoring robots.txt files is fine, but the case cited does not say that.