Big-name publishers are refusing to let Apple Intelligence train on data

AppleInsider · August 29, 2024 12:52PM

Website owners have a simple mechanism to tell Apple Intelligence not to scrape the site for training purposes, and reportedly major platforms like Facebook and the New York Times are using it.

Future expansions to Apple Intelligence may involve more AI partners, paid subscriptions

Apple has been offering publishers millions of dollars for the right to scrape their sites, as opposed to Google which believes all data should be freely available to train AI large language modules. As part of this, Apple honors a system where a site can just say in a particular file that it does not want to be scraped.

That file is a simple text one called robots.txt, and according to Wired, very many major publishers are choosing to use this to block Apple's AI training.

This robots.txt file is no technical barrier to scraping, nor even really a legal one, and there are firms that are known to ignore being blocked.
Reportedly, many news sites that are blocking Apple Intelligence. Significant ones include:

The New York Times

Facebook

Instagram

Craigslist

Timblr

Financial Times

The Atlantic

USA Today

Conde Nast

In Apple's case, Wired says that two main studies in the last week have shown that around 6% to 7% of high-traffic websites are blocking Apple's search tool, called Applebot-Extended. Then a further study by Ben Welsh, also undertaken in the last week, says that just over a 25% of sites checked are blocking it.

The discrepancy is down to which sets of high-traffic websites were researched. The Welsh study, for comparison, found that OpenAI's bot is blocked by 53% of news sites checked, and Google's equivalent Google-Extended is blocked by almost 43%.

Wired concludes that while sites might not care whether Apple Intelligence is scraping them, the major reason for low blocking figures is that Apple's AI bot is too little known for firms to notice it.

Yet Apple Intelligence is not exactly hiding in the dark, and AppleBot-Extended is a superset of AppleBot. That was first spotted by sites in November 2014, and officially revealed by Apple in May 2015.

So for ten years, AppleBot has been searching and scraping websites, and doing so in order to power Siri and Spotlight searches.

Consequently, it's less likely that websites owners haven't heard of Apple Intelligence, and more likely that they have heard of Apple making deals worth millions. While negotiations are continuing, or just conceivably might start, some sites are consciously blocking Apple Intelligence.

That includes The New York Times, which is also suing OpenAI over copyright infringement because of its AI scraping.

"As the law and The Times' own terms of service make clear, scraping or using our content for commercial purposes is prohibited without our prior written permission" says the newspaper's Charlie Stadtlander. "Importantly, copyright law still applies whether or not technical blocking measures are in place."

Read on AppleInsider

gatorguy · August 29, 2024 1:03PM

AppleInsider said:

Apple has been offering publishers millions of dollars for the right to scrape their sites, as opposed to Google which believes all data should be freely available to train AI large language modules. As part of this, Apple honors a system where a site can just say in a particular file that it does not want to be scraped.

That file is a simple text one called robots.txt, and according to Wired, very many major publishers are choosing to use this to block Apple's AI training.

From this article's link to a previous AppleInsider article:

"According to The Guardian, Google has presented a case to Australian regulators that it be allowed to do what it wants and, okay, maybe publishers should be able to say no. But that's on the publishers, not Google."

Now substitute Apple for Google in that quote. Isn't that what Apple is doing too, scraping unless the publisher says no?
As for paying, both companies have shown a willingness to if the data is important enough. For instance, Google this year alone has signed multi-million deals for access to training data with both Reddit and Stack Overflow.

Apple on the other hand appears to be low-balling potential training data partners, offering less in total than Google is paying Reddit alone, with no evidence yet that any sites are biting.

edited August 2024

daalseth · August 29, 2024 1:06PM

Good, all sites should. Not just Apple’s AI, block all of them, and sue them out of existance if they break in.

cesar battistini maziero · August 29, 2024 2:40PM

New York times is Amazon, and Meta has a competing service.

I bet Meta, Open AI and google haven't asked permission to train on everyones data.

stabitha_christie · August 29, 2024 2:57PM

Cesar Battistini Maziero said:

New York times is Amazon, and Meta has a competing service.

I bet Meta, Open AI and google haven't asked permission to train on everyone’s data.

What are you talking about? Amazon doesn’t own the New York Times. Jeff Bezos, founder and former CEO, owns The Washington Post. The New York Times is owned by the New York Times Company which is a publicly owned company. You managed to get that 100% wrong.

gatorguy · August 29, 2024 2:58PM

Cesar Battistini Maziero said:

New York times is Amazon, and Meta has a competing service.

I bet Meta, Open AI and google haven't asked permission to train on everyones data.

I'm not as familiar with Meta and OpenAI, but as for Google you would likely lose that bet.

Besides actually paying for particularly valuable training data, Google offers exactly the same mechanisms as Apple does for publishers to opt-out. If you believe Google is doing the wrong thing, then so is Apple. The opposite is also true of course. If you're proud of the way Apple is approaching it, then you're OK with Google too.

That said, unlike the $multi-million deals Google has made to license data from private sites, I'm not aware of Apple paying for any private site data for AI training. But they are free-scraping them for it if not blocked from doing so, just like OpenAI and Google will in the absence of a licensing agreement.

EDIT: After Google and Meta signed deals with Shutterstock for training data, I see Apple followed in their wake and came to an agreement with them as well. So that's one.

edited August 2024

nubus · August 29, 2024 6:03PM

DAalseth said:

Good, all sites should. Not just Apple’s AI, block all of them, and sue them out of existance if they break in.

Indeed. Tired of this "your robots.txt didn't block our stealing". It doesn't make it legal for Apple or anyone else to take original content and make it into a product.

iadlib · August 29, 2024 6:24PM

This is laughable, given that Facebook and Instagram trained their AI models without explicit permission from content creators.

hexclock · August 29, 2024 6:28PM

I’m glad Apple won’t be training their AI on the New York Times.

forumpost · August 29, 2024 10:22PM

hexclock said:

I’m glad Apple won’t be training their AI on the New York Times.

Should not be training on BBC news contents too or the info will be lopsided

williamlondon · August 30, 2024 1:04AM

ForumPost said:

hexclock said:

I’m glad Apple won’t be training their AI on the New York Times.

Should not be training on BBC news contents too or the info will be lopsided

That's the most ridiculous of the trolls commenting in this thread. The BBC sets the bar for not pandering to big monied interests, that alone makes it more honest and authentic than any of its competitors. They set the bar for journalism in the UK, so even the for-profit ones are better than ALL US news outlets.

igorsky · August 30, 2024 2:39AM

Funny how all the sites on the list are all too happy making money off of Apple’s platform.

edited August 2024

jimh2 · August 30, 2024 1:47PM

Apple is following the rules by not scraping though I doubt any company can stop another from scraping their website of the publicly available data posted there. If you do not have an account there are no signed terms of service to agree to so you can do as you please.

If it becomes an issue you just pay another company to do the scraping for you.

danox · August 30, 2024 5:04PM

I would not be surprised if the true winner in the AI race performance wise will be the companies that highly curate the info that they put into their models up front in short garbage in and garbage out, any site like Reddit I just don’t see the value of scraping them, but we shall see, one things for sure it’s going to be fun going forward to see who got it right and who got it wrong.

canucklehead · August 31, 2024 9:35PM

williamlondon said:

ForumPost said:

hexclock said:

I’m glad Apple won’t be training their AI on the New York Times.

Should not be training on BBC news contents too or the info will be lopsided

That's the most ridiculous of the trolls commenting in this thread. The BBC sets the bar for not pandering to big monied interests, that alone makes it more honest and authentic than any of its competitors. They set the bar for journalism in the UK, so even the for-profit ones are better than ALL US news outlets.

I know right? It's like if the flat-earthers wanted equal exposure to tell their story. "All the scientific fact is biased against us! In order to be fair, they should end science education!"

No, the reason your side isn't getting equal consideration is because your side is f'ing NUTS!

edited August 2024

Big-name publishers are refusing to let Apple Intelligence train on data

Comments