Big-name publishers are refusing to let Apple Intelligence train on data
Website owners have a simple mechanism to tell Apple Intelligence not to scrape the site for training purposes, and reportedly major platforms like Facebook and the New York Times are using it.
Future expansions to Apple Intelligence may involve more AI partners, paid subscriptions
Apple has been offering publishers millions of dollars for the right to scrape their sites, as opposed to Google which believes all data should be freely available to train AI large language modules. As part of this, Apple honors a system where a site can just say in a particular file that it does not want to be scraped.
That file is a simple text one called robots.txt, and according to Wired, very many major publishers are choosing to use this to block Apple's AI training.
This robots.txt file is no technical barrier to scraping, nor even really a legal one, and there are firms that are known to ignore being blocked.
Reportedly, many news sites that are blocking Apple Intelligence. Significant ones include:
- The New York Times
- Craigslist
- Timblr
- Financial Times
- The Atlantic
- USA Today
- Conde Nast
In Apple's case, Wired says that two main studies in the last week have shown that around 6% to 7% of high-traffic websites are blocking Apple's search tool, called Applebot-Extended. Then a further study by Ben Welsh, also undertaken in the last week, says that just over a 25% of sites checked are blocking it.
The discrepancy is down to which sets of high-traffic websites were researched. The Welsh study, for comparison, found that OpenAI's bot is blocked by 53% of news sites checked, and Google's equivalent Google-Extended is blocked by almost 43%.
Wired concludes that while sites might not care whether Apple Intelligence is scraping them, the major reason for low blocking figures is that Apple's AI bot is too little known for firms to notice it.
Yet Apple Intelligence is not exactly hiding in the dark, and AppleBot-Extended is a superset of AppleBot. That was first spotted by sites in November 2014, and officially revealed by Apple in May 2015.
So for ten years, AppleBot has been searching and scraping websites, and doing so in order to power Siri and Spotlight searches.
Consequently, it's less likely that websites owners haven't heard of Apple Intelligence, and more likely that they have heard of Apple making deals worth millions. While negotiations are continuing, or just conceivably might start, some sites are consciously blocking Apple Intelligence.
That includes The New York Times, which is also suing OpenAI over copyright infringement because of its AI scraping.
"As the law and The Times' own terms of service make clear, scraping or using our content for commercial purposes is prohibited without our prior written permission" says the newspaper's Charlie Stadtlander. "Importantly, copyright law still applies whether or not technical blocking measures are in place."
Read on AppleInsider
Comments
"According to The Guardian, Google has presented a case to Australian regulators that it be allowed to do what it wants and, okay, maybe publishers should be able to say no. But that's on the publishers, not Google."
Now substitute Apple for Google in that quote. Isn't that what Apple is doing too, scraping unless the publisher says no?
As for paying, both companies have shown a willingness to if the data is important enough. For instance, Google this year alone has signed multi-million deals for access to training data with both Reddit and Stack Overflow.
Apple on the other hand appears to be low-balling potential training data partners, offering less in total than Google is paying Reddit alone, with no evidence yet that any sites are biting.
I bet Meta, Open AI and google haven't asked permission to train on everyones data.
Besides actually paying for particularly valuable training data, Google offers exactly the same mechanisms as Apple does for publishers to opt-out. If you believe Google is doing the wrong thing, then so is Apple. The opposite is also true of course. If you're proud of the way Apple is approaching it, then you're OK with Google too.
That said, unlike the $multi-million deals Google has made to license data from private sites, I'm not aware of Apple paying for any private site data for AI training. But they are free-scraping them for it if not blocked from doing so, just like OpenAI and Google will in the absence of a licensing agreement.
EDIT: After Google and Meta signed deals with Shutterstock for training data, I see Apple followed in their wake and came to an agreement with them as well. So that's one.
If it becomes an issue you just pay another company to do the scraping for you.
No, the reason your side isn't getting equal consideration is because your side is f'ing NUTS!