On-device Apple Intelligence training seems to be based on controversial technology

Jump to First Reply
Posted:
in iPhone edited April 14

On Monday, Apple shared its plans to allow users to opt into on-device Apple Intelligence training using Differential Privacy techniques that are incredibly similar to its failed CSAM detection system.

Smartphone camera lenses with a metallic finish in front of a colorful, abstract background with overlapping, glowing loops.
Apple Intelligence to be trained on anonymized user data on an opt-in basis



Differential Privacy is a concept Apple embraced openly in 2016 with iOS 10. It is a privacy-preserving method of data collection that introduces noise to sample data to prevent the data collectors from figuring out where the data came from.

According to a post on Apple's machine learning blog, Apple is working to implement Differential Privacy as a method to gather user data to train Apple Intelligence. The data is provided on an opt-in basis, anonymously, and in a way that can't be traced back to an individual user.

The story was first covered by Bloomberg, which explained Apple's report on using synthetic data trained on real-world user information. However, it isn't as simple as grabbing user data off of an iPhone to analyze in a server farm.

Instead, Apple will utilize a technique called Differential Privacy, which, if you've forgotten, is a system designed to introduce noise to data collection so individual data points cannot be traced back to the source. Apple takes it a step further by leaving user data on device -- only polling for accuracy and taking the poll results off of the user's device.

These methods ensure that Apple's principles behind privacy and security are preserved. Users that opt into sharing device analytics will participate in this system, but none of their data will ever leave their iPhone.

Analyzing data without identifiers



Differential Privacy is a concept Apple leaned on and developed since at least 2006, but didn't make a part of its public identity until 2016. It started as a way to learn how people used emojis, to find new words for local dictionaries, to power deep links within apps, and as a Notes search tool.

Flowchart illustrating message variations for scheduling activities, ranked by user preference indicators using differential privacy, shown with numerical labels and arrows.
Analyzing data with Differential Privacy. Image source: Apple



Apple says that starting with iOS 18.5, Differential Privacy will be used to analyze user data and train specific Apple Intelligence systems starting with Genmoji. It will be able to identify patterns of common prompts people use so Apple can better train the AI and get better results for those prompts.

Basically, Apple provides artificial prompts it believes are popular, like "dinosaur in a cowboy hat" and it looks for pattern matches in user data analytics. Because of artificially injected noise and a threshold of needing hundreds of fragment matches, there isn't any way to surface unique or individual-identifying prompts.

Plus, these searches for fragments of prompts only result in a positive or negative poll, so no user data is derived from the analysis. Again, no data can be isolated and traced back to a single person or identifier.

The same technique will be used for analyzing Image Playground, Image Wand, Memories Creation, and Writing Tools. These systems rely on short prompts, so the analysis can be limited to simple prompt pattern matching.

Apple wants to take these methods further by implementing them for text generation. Since text generation for email and other systems results in much longer prompts, and likely, more private user data, Apple took extra steps.

Apple is using recent research into developing synthetic data that can be used to represent aggregate trends in real user data. Of course, this is done without removing a single bit of text from the user's device.

After generating synthetic emails that may represent real emails, they are compared to limited samples of recent user emails that have been computed into synthetic embeddings. The synthetic embeddings closest to the samples across many devices prove which synthetic data generated by Apple are most representative of real human communication.

Once a pattern is found across devices, that synthetic data and pattern matching can be refined to work across different topics. The process enables Apple to train Apple Intelligence to produce better summaries and suggestions.

Again, the Differential Privacy method of Apple Intelligence training is opt-in and takes place on-device. User data never leaves the device, and gathered polling results have noise introduced, so even while user data isn't present, individual results can't be tied back to a single identifier.

These Apple Intelligence training methods sound very familiar



If Apple's methods here ring any bells, it's because they appear similar to the methods the company planned to implement, but abandoned, for CSAM detection. The system would have converted user photos into hashes that were compared to a database of hashes of known CSAM.

Flowchart illustrating Apple's CSAM detection: image hashes compared on-device, matched data uploaded, analyzed for threshold, and potentially reviewed and reported to authorities if a match is detected.
Apple's CSAM detection feature relied on hashing photos without violating privacy or breaking encryption



However, these are two very different systems with different goals. The new on-device Apple Intelligence training system is built to prevent Apple from learning anything about the user, while CSAM detection could lead to Apple discovering something about a user's photos.

That analysis would occur in iCloud photo storage. Apple would have been able to perform the photo hash matching using a method called Private Set Intersection, which is performed without ever looking at a user photo or removing a photo from iCloud.

When enough instances of potential positive results for CSAM hash matches occurred on a single device, it would trigger a system that sent affected images to be analyzed by humans. If the discovered images were CSAM, the authorities were notified.

The CSAM detection system preserved user privacy, data encryption, and more, but it also introduced many potential new attack vectors that may be abused by authoritarian governments. For example, if such a system could be used to find CSAM, people worried governments could compel Apple to use it to find certain kinds of speech or imagery.

Apple ultimately abandoned the CSAM detection system. Advocates have spoken out against Apple's decision, suggesting the company is doing nothing to prevent the spread of such content.

Note, that while the CSAM detection feature has several parallels to the new Apple Intelligence training system, they are built on different technologies. The noise introduced to the data sets to obfuscate user data, which is what makes it Differential Privacy, was not part of the CSAM detection feature, for example.

Since both systems involve converting user data into a comparable block of data, it can be easy to see similarities between the two. However, the technologies have very different foundations and goals.

Opting out of Apple Intelligence training



While parts of the implementation appear similar, it seems Apple has landed on a much less controversial use. Even so, there are those that would prefer not to offer data, privacy protected or not, to train Apple Intelligence.

iPhone settings screen displaying options to share analytics data, with toggles active next to 'Share iPhone & Watch Analytics' and 'Share iCloud Analytics'.
Opt in or out using the data analytics settings



Nothing has been implemented yet, so don't worry, there's still time to ensure you are opted out. Apple says it will introduce the feature in iOS 18.5 and testing will begin in a future beta.

To check if you're opted in or not, open Settings, scroll down and select Privacy & Security, then selecting Analytics & Improvements. Toggle the "Share iPhone & Watch Analytics" setting to opt out of AI training if you haven't already.



Read on AppleInsider

Comments

  • Reply 1 of 16
    mpantonempantone Posts: 2,377member
    This whole article is based on the premise that Differential Privacy is 100% infallible all time time for every situation forever which is rather difficult to believe.

    It's the same with really any analytics sharing whether it be opt-in or opt-out. How much data is really scrubbed? How does Joe Consumer truly know whether or not his personal data has been effectively removed? And if it doesn't what sort of recourse does he have?

    A more sensible approach would be to just turn it off (opt out) and wait 5-10 years until the technology has been deployed to see how truly safe it is. Then the individual can decide whether or not sending in their "anonymized data" [sic] is worth the risk. No one sane would want to sign up for this.

    Apple needs to turn this off by default if they really care about user privacy.
    edited April 14
     0Likes 0Dislikes 0Informatives
  • Reply 2 of 16
    Wesley Hilliardwesley hilliard Posts: 442member, administrator, moderator, editor
    mpantone said:
    This whole article is based on the premise that Differential Privacy is 100% infallible all time time for every situation forever which is rather difficult to believe.

    It's the same with really any analytics sharing whether it be opt-in or opt-out. How much data is really scrubbed? How does Joe Consumer truly know whether or not his personal data has been effectively removed? And if it doesn't what sort of recourse does he have?
    As consumers, we don't have a lot of power other than the ability to opt out. Don't believe it is true? Don't like the concept? Opt out. At least Apple gave us that much.

    As far as whether or not it is true or infallible, that's a whole different matter. First, Apple can't afford to lie, so I trust that what it says about the technology and the data available to the company is true. Otherwise all it would take is one person proving it isn't true for a huge scandal. Apple has been using these techniques publicly for nearly a decade: I think someone would have proven then inadequate by now.

    Based on how the technology is described, I can't imagine a single way to trace data back to a user. How would you even start if all you have is polling data results, random noise, large aggregate data samples, and no identifiers. Even if you had the anonymous identifiers, how would you resolve those into individual users?

    It's not just the how that seems mind boggling, but the why. Unless Apple has been hacked, how would the available data be abused in its current form? A single employee trying to resolve the results of a query on Genmoji use somehow identifies Sam Smith in Austin, Texas as the guy that asked for a burrito wearing a sombrero. Ok, now what?

    The email portion of this is even more obfuscated. None of the email is included in data sent to Apple, so how would any information be obtained? Say a bad actor got the data, found an identifier attached to a poll result, and returned it to a user. All they've discovered is that someone somewhere has an iPhone with an email containing a fragment of words found in an artificially generated email.

    This seems to be a well thought out system without any wiggle room for error. Though I would be interested if anyone actually can figure out any possible attack vectors or vulnerabilities. I'd be happy to be proven wrong in this case.
    lotones
     1Like 0Dislikes 0Informatives
  • Reply 3 of 16
    mattinozmattinoz Posts: 2,604member
    Wait so every subsystem used to get CSAM working is controversial now?
    even if it is used in a dozen other places in the system that aren’t considered controversial and adds nothing specific to the controversy?
    blastdoors.metcalfWesley Hilliardrandominternetperson
     3Likes 1Dislike 0Informatives
  • Reply 4 of 16
    Wesley Hilliardwesley hilliard Posts: 442member, administrator, moderator, editor
    mattinoz said:
    Wait so every subsystem used to get CSAM working is controversial now?
    even if it is used in a dozen other places in the system that aren’t considered controversial and adds nothing specific to the controversy?
    idk if you missed it, but on-device and iCloud CSAM detection using these tools were deemed highly controversial.
    s.metcalf
     0Likes 1Dislike 0Informatives
  • Reply 5 of 16
    DAalsethdaalseth Posts: 3,258member
    Thanks for the reminder. I just checked to verify that the Share switch was off on all my devices. 
    williamlondon
     0Likes 1Dislike 0Informatives
  • Reply 6 of 16
    mpantonempantone Posts: 2,377member
    DAalseth said:
    Thanks for the reminder. I just checked to verify that the Share switch was off on all my devices. 
    I review these settings every time I upgrade iOS, iPadOS, macOS or restore from a previous backup. It takes me at least an hour to review all of these when setting up a new device from scratch.

    Same thing with Background Activity, Location Services (including precise location), Live Activities, Notifications and more. Lots of things to shut down.
    edited April 14
    DAalseth
     1Like 0Dislikes 0Informatives
  • Reply 7 of 16
    SiTimesitime Posts: 27member
    Just went to setting to confirm if I had already opted-out of everything (I had). Thank you very much for the article to inform/remind us of this. It’s always helpful to check the privacy settings every now and then just to make sure everything is still off (and to check if any new settings had been added).
     0Likes 0Dislikes 0Informatives
  • Reply 8 of 16
    The CSAM detection system preserved user privacy, data encryption, and more

    I suppose this isn’t technically wrong. The data will be encrypted, but when a threshold was met, it would allow the data to be decrypted and reviewed and potentially then sent to authorities.  I had post several comments on this topic.

    Apples CSAM detection is not end to end encrypted, which requires asymmetric keys to ensure that the Sender, and then the Receiver are the only ones privy to the contents.  Introduce any other mechanism to enable review by a Man In The Middle, is essentially a backdoor into the algorithm.

    but as some may say, the scanning was on device, what’s the issue with that?  On device scanning is a very useful tool; it makes finding things easier on your device.  What I do have issue with is the reporting part.  It’s a form surveillance. Of what’s on your device.  Something that should be private.

    Yes, it only scanned when a photo was sent via iCloud and reported when a threshold was met, but that’s written in software, and software can change. So when that reporting gets triggered can change.  Let’s say the government did like the results, and required Apple to be more strict to help find more positive matches?

    The only correct way to think of iCloud with CSAM on device scanning was to view your photos as being in a semi public (eg public) space.

    On device should be private, communication through Apple should be considered semi public (the data would still be encrypted in transit to Apple, but Apple would technically have full access), unless otherwise specified by Apple as being end to end encrypted and verified by a third party (true E2E, not wish it were E2E).

    gatorguyrandominternetperson
     1Like 0Dislikes 1Informative
  • Reply 9 of 16
    mattinozmattinoz Posts: 2,604member
    mattinoz said:
    Wait so every subsystem used to get CSAM working is controversial now?
    even if it is used in a dozen other places in the system that aren’t considered controversial and adds nothing specific to the controversy?
    idk if you missed it, but on-device and iCloud CSAM detection using these tools were deemed highly controversial.
    I didn’t miss it but differential privacy is used extensively in the system and isn’t the thing that made CSAM controversial.
    blastdooranonymouserandominternetperson
     3Likes 0Dislikes 0Informatives
  • Reply 10 of 16
    swat671swat671 Posts: 167member
    I’m actually surprised Apple wasn’t doing something like this already. If they don’t have a way to see what people are doing with these systems (AI, photo editing/generation, etc), how do they have any way to improve? They can’t. That’s why Siri is so far behind Google and other platforms AI tech. Apple refuses to use the data they have to improve their tech. So until and unless they do start using the data they already have, Siri/AI will never improve. 
     0Likes 0Dislikes 0Informatives
  • Reply 11 of 16
    swat671swat671 Posts: 167member
    mpantone said:
    DAalseth said:
    Thanks for the reminder. I just checked to verify that the Share switch was off on all my devices. 
    I review these settings every time I upgrade iOS, iPadOS, macOS or restore from a previous backup. It takes me at least an hour to review all of these when setting up a new device from scratch.

    Same thing with Background Activity, Location Services (including precise location), Live Activities, Notifications and more. Lots of things to shut down.
    Why do that? Most of the stuff your phone can do is then worthless. Why get an iPhone if you don’t let it do anything in the background?
    SiTime
     0Likes 1Dislike 0Informatives
  • Reply 12 of 16
    blastdoorblastdoor Posts: 3,757member
    Note, that while the CSAM detection feature has several parallels to the new Apple Intelligence training system, they are built on different technologies. The noise introduced to the data sets to obfuscate user data, which is what makes it Differential Privacy, was not part of the CSAM detection feature, for example.
    On the one hand I’m glad this note was included in the article. On the other hand, it suggests the title of the article is misleading and alarmist.


    anonymouserandominternetpersoncoolfactor
     3Likes 0Dislikes 0Informatives
  • Reply 13 of 16
    s.metcalfs.metcalf Posts: 1,011member
    mattinoz said:
    Wait so every subsystem used to get CSAM working is controversial now?
    even if it is used in a dozen other places in the system that aren’t considered controversial and adds nothing specific to the controversy?
    idk if you missed it, but on-device and iCloud CSAM detection using these tools were deemed highly controversial.
    Mattinoz didn’t miss anything.  You seem to have missed his point.  The controversial aspects of Apple’s formerly planned CSAM scanning and the (in my view) legitimate backlash had nothing to do with the technology itself but how Apple had planned to implement it specifically.  It would’ve resulted in false positives being sent for review, creating significant privacy concerns.  Like the potential that personal photos or videos of your own kids doing normal things could get flagged and sent to Apple or others.  The things described here are nothing like that and are optional.

    it’s a terrible click-bait headline not supported by the article, and you seem to be defensive for being called out on it.
    Wesley Hilliardmuthuk_vanalingammattinoz
     2Likes 1Dislike 0Informatives
  • Reply 14 of 16
    Can we talk about the real issue at hand? At what point did people start calling tennis soccer?!
    randominternetperson
     1Like 0Dislikes 0Informatives
  • Reply 15 of 16
    coolfactorcoolfactor Posts: 2,368member

    The risks of CSAM were blown way out of proportion by people that didn't understand it. Same thing is happening here.
     0Likes 0Dislikes 0Informatives
  • Reply 16 of 16
    coolfactorcoolfactor Posts: 2,368member

    swat671 said:
    I’m actually surprised Apple wasn’t doing something like this already. If they don’t have a way to see what people are doing with these systems (AI, photo editing/generation, etc), how do they have any way to improve? They can’t. That’s why Siri is so far behind Google and other platforms AI tech. Apple refuses to use the data they have to improve their tech. So until and unless they do start using the data they already have, Siri/AI will never improve. 

    Not true. There isn't just one approach. Apple is already exploring alternative approaches to machine learning that are not being used by the competition. Their research around this has already been published, although I don't have a link to share.

     0Likes 0Dislikes 0Informatives
Sign In or Register to comment.