On-device Apple Intelligence training seems to be based on controversial technology
On Monday, Apple shared its plans to allow users to opt into on-device Apple Intelligence training using Differential Privacy techniques that are incredibly similar to its failed CSAM detection system.

Apple Intelligence to be trained on anonymized user data on an opt-in basis
Differential Privacy is a concept Apple embraced openly in 2016 with iOS 10. It is a privacy-preserving method of data collection that introduces noise to sample data to prevent the data collectors from figuring out where the data came from.
According to a post on Apple's machine learning blog, Apple is working to implement Differential Privacy as a method to gather user data to train Apple Intelligence. The data is provided on an opt-in basis, anonymously, and in a way that can't be traced back to an individual user.
The story was first covered by Bloomberg, which explained Apple's report on using synthetic data trained on real-world user information. However, it isn't as simple as grabbing user data off of an iPhone to analyze in a server farm.
Instead, Apple will utilize a technique called Differential Privacy, which, if you've forgotten, is a system designed to introduce noise to data collection so individual data points cannot be traced back to the source. Apple takes it a step further by leaving user data on device -- only polling for accuracy and taking the poll results off of the user's device.
These methods ensure that Apple's principles behind privacy and security are preserved. Users that opt into sharing device analytics will participate in this system, but none of their data will ever leave their iPhone.
Analyzing data without identifiers
Differential Privacy is a concept Apple leaned on and developed since at least 2006, but didn't make a part of its public identity until 2016. It started as a way to learn how people used emojis, to find new words for local dictionaries, to power deep links within apps, and as a Notes search tool.

Analyzing data with Differential Privacy. Image source: Apple
Apple says that starting with iOS 18.5, Differential Privacy will be used to analyze user data and train specific Apple Intelligence systems starting with Genmoji. It will be able to identify patterns of common prompts people use so Apple can better train the AI and get better results for those prompts.
Basically, Apple provides artificial prompts it believes are popular, like "dinosaur in a cowboy hat" and it looks for pattern matches in user data analytics. Because of artificially injected noise and a threshold of needing hundreds of fragment matches, there isn't any way to surface unique or individual-identifying prompts.
Plus, these searches for fragments of prompts only result in a positive or negative poll, so no user data is derived from the analysis. Again, no data can be isolated and traced back to a single person or identifier.
The same technique will be used for analyzing Image Playground, Image Wand, Memories Creation, and Writing Tools. These systems rely on short prompts, so the analysis can be limited to simple prompt pattern matching.
Apple wants to take these methods further by implementing them for text generation. Since text generation for email and other systems results in much longer prompts, and likely, more private user data, Apple took extra steps.
Apple is using recent research into developing synthetic data that can be used to represent aggregate trends in real user data. Of course, this is done without removing a single bit of text from the user's device.
After generating synthetic emails that may represent real emails, they are compared to limited samples of recent user emails that have been computed into synthetic embeddings. The synthetic embeddings closest to the samples across many devices prove which synthetic data generated by Apple are most representative of real human communication.
Once a pattern is found across devices, that synthetic data and pattern matching can be refined to work across different topics. The process enables Apple to train Apple Intelligence to produce better summaries and suggestions.
Again, the Differential Privacy method of Apple Intelligence training is opt-in and takes place on-device. User data never leaves the device, and gathered polling results have noise introduced, so even while user data isn't present, individual results can't be tied back to a single identifier.
These Apple Intelligence training methods sound very familiar
If Apple's methods here ring any bells, it's because they appear similar to the methods the company planned to implement, but abandoned, for CSAM detection. The system would have converted user photos into hashes that were compared to a database of hashes of known CSAM.

Apple's CSAM detection feature relied on hashing photos without violating privacy or breaking encryption
However, these are two very different systems with different goals. The new on-device Apple Intelligence training system is built to prevent Apple from learning anything about the user, while CSAM detection could lead to Apple discovering something about a user's photos.
That analysis would occur in iCloud photo storage. Apple would have been able to perform the photo hash matching using a method called Private Set Intersection, which is performed without ever looking at a user photo or removing a photo from iCloud.
When enough instances of potential positive results for CSAM hash matches occurred on a single device, it would trigger a system that sent affected images to be analyzed by humans. If the discovered images were CSAM, the authorities were notified.
The CSAM detection system preserved user privacy, data encryption, and more, but it also introduced many potential new attack vectors that may be abused by authoritarian governments. For example, if such a system could be used to find CSAM, people worried governments could compel Apple to use it to find certain kinds of speech or imagery.
Apple ultimately abandoned the CSAM detection system. Advocates have spoken out against Apple's decision, suggesting the company is doing nothing to prevent the spread of such content.
Note, that while the CSAM detection feature has several parallels to the new Apple Intelligence training system, they are built on different technologies. The noise introduced to the data sets to obfuscate user data, which is what makes it Differential Privacy, was not part of the CSAM detection feature, for example.
Since both systems involve converting user data into a comparable block of data, it can be easy to see similarities between the two. However, the technologies have very different foundations and goals.
Opting out of Apple Intelligence training
While parts of the implementation appear similar, it seems Apple has landed on a much less controversial use. Even so, there are those that would prefer not to offer data, privacy protected or not, to train Apple Intelligence.

Opt in or out using the data analytics settings
Nothing has been implemented yet, so don't worry, there's still time to ensure you are opted out. Apple says it will introduce the feature in iOS 18.5 and testing will begin in a future beta.
To check if you're opted in or not, open Settings, scroll down and select Privacy & Security, then selecting Analytics & Improvements. Toggle the "Share iPhone & Watch Analytics" setting to opt out of AI training if you haven't already.
Read on AppleInsider
Comments
It's the same with really any analytics sharing whether it be opt-in or opt-out. How much data is really scrubbed? How does Joe Consumer truly know whether or not his personal data has been effectively removed? And if it doesn't what sort of recourse does he have?
A more sensible approach would be to just turn it off (opt out) and wait 5-10 years until the technology has been deployed to see how truly safe it is. Then the individual can decide whether or not sending in their "anonymized data" [sic] is worth the risk. No one sane would want to sign up for this.
Apple needs to turn this off by default if they really care about user privacy.
As far as whether or not it is true or infallible, that's a whole different matter. First, Apple can't afford to lie, so I trust that what it says about the technology and the data available to the company is true. Otherwise all it would take is one person proving it isn't true for a huge scandal. Apple has been using these techniques publicly for nearly a decade: I think someone would have proven then inadequate by now.
Based on how the technology is described, I can't imagine a single way to trace data back to a user. How would you even start if all you have is polling data results, random noise, large aggregate data samples, and no identifiers. Even if you had the anonymous identifiers, how would you resolve those into individual users?
It's not just the how that seems mind boggling, but the why. Unless Apple has been hacked, how would the available data be abused in its current form? A single employee trying to resolve the results of a query on Genmoji use somehow identifies Sam Smith in Austin, Texas as the guy that asked for a burrito wearing a sombrero. Ok, now what?
The email portion of this is even more obfuscated. None of the email is included in data sent to Apple, so how would any information be obtained? Say a bad actor got the data, found an identifier attached to a poll result, and returned it to a user. All they've discovered is that someone somewhere has an iPhone with an email containing a fragment of words found in an artificially generated email.
This seems to be a well thought out system without any wiggle room for error. Though I would be interested if anyone actually can figure out any possible attack vectors or vulnerabilities. I'd be happy to be proven wrong in this case.
even if it is used in a dozen other places in the system that aren’t considered controversial and adds nothing specific to the controversy?
Same thing with Background Activity, Location Services (including precise location), Live Activities, Notifications and more. Lots of things to shut down.
Apples CSAM detection is not end to end encrypted, which requires asymmetric keys to ensure that the Sender, and then the Receiver are the only ones privy to the contents. Introduce any other mechanism to enable review by a Man In The Middle, is essentially a backdoor into the algorithm.
but as some may say, the scanning was on device, what’s the issue with that? On device scanning is a very useful tool; it makes finding things easier on your device. What I do have issue with is the reporting part. It’s a form surveillance. Of what’s on your device. Something that should be private.
Yes, it only scanned when a photo was sent via iCloud and reported when a threshold was met, but that’s written in software, and software can change. So when that reporting gets triggered can change. Let’s say the government did like the results, and required Apple to be more strict to help find more positive matches?
The only correct way to think of iCloud with CSAM on device scanning was to view your photos as being in a semi public (eg public) space.
On device should be private, communication through Apple should be considered semi public (the data would still be encrypted in transit to Apple, but Apple would technically have full access), unless otherwise specified by Apple as being end to end encrypted and verified by a third party (true E2E, not wish it were E2E).
it’s a terrible click-bait headline not supported by the article, and you seem to be defensive for being called out on it.
The risks of CSAM were blown way out of proportion by people that didn't understand it. Same thing is happening here.
Not true. There isn't just one approach. Apple is already exploring alternative approaches to machine learning that are not being used by the competition. Their research around this has already been published, although I don't have a link to share.