Apple reveals it keeps anonymized Siri data for up to 2 years

dominoxml · April 20, 2013 7:40AM

Quote:

Originally Posted by btracy713

No matter what you do, there is no such thing as privacy anymore. Unless you are truly off the grid

Privacy and freedom of speech are part of the constitutional rights in most democratic states.

Maybe I'm off the grid, but I judge keeping it relevant to be worth the effort. It's possible to protect it without significantly decreasing functionality and user experience on computing devices.

lightknight · April 20, 2013 2:17PM

Quote:

Originally Posted by DominoXML

But we are still at a point where your messages are only strongly, but not fully anonymized.

The reason is that your messages might contain contentual references to you like spoken words.

Your explanation is technically sound and very clear, but it clearly says what we also say: data is NOT fully anonymized, and it's unclear how it could ever be, since it's not random words, but series of words forming sentences useful to you, in particular.

I like Dragon, and I'd rather have such an offline Siri...

dominoxml · April 20, 2013 4:33PM

Quote:

Originally Posted by lightknight

Your explanation is technically sound and very clear, but it clearly says what we also say: data is NOT fully anonymized, and it's unclear how it could ever be, since it's not random words, but series of words forming sentences useful to you, in particular.

I like Dragon, and I'd rather have such an offline Siri...

You are right that an offline Siri would improve privacy even more.

On the other hand I'm satisfied with it's implementation because the major privacy rules are satisfied:

1. Transperancy through privacy policy and user messages when turning the service on

2. Collecting of only the technical necessary data

3. Anonymization strategy is in place

4. Possibility to delete connected data by opt-out

5. No direct link to the user profile (=not merging data with other data sources or services in one profile)

If we want more we have to disconnect our devices from the internet, which means no mail, messaging, web browsing, internet search, e-commerce and social networking.

I'm pretty careful with private data, running my own servers, but stopping to communicate and share data is not what I'd even consider.

BTW: Are you sure that Dragon on iOS is fully functional in offline mode? As far as I know it also sends data to the servers in order to improve the software and at least Dragon Dictation needs a network connection.

Edit: I think I have to clarify some points.

First of all protecting privacy is taking the responsibility for data. A part of this responsibility has to be taken by the user, the other part shifts to a third party automatically when it saves or routes the data.

A simple rule is that when a store data from someone else you automatically accept your part of responsibility. Transparency simply means that you provide the information how you handle it.

If you have to take a lot of responsibility you should do a risk-analysis. The result of this is your data processing and anonymization strategy.

There are a couple of points to consider here:

1. Functional data

If your app and service simply needs private data in order to work you will have to save it.

2. Financial and legal binding data

If you process this kind of data you are forced to save it and in most cases you are bound to fixed retention periods.

In both cases you have to think about how to protect it, taking you functional and legal obligations into account. In some cases you will find that those requirements are diametrically opposed to the privacy expectations of the data owners.

The common way to address the problems raised by the risk-analysis is to develop a security and privacy strategy. Apart from covering legal issues there are a lot ways to differentiate here based on company culture, specific demand because topic and persons and last but not least your strategic decision to cover risks.

This strategic decision might lead to different implementations. The common approaches vary between shifting the responsibility to the user while providing tools for adjustment for taking the responsibility by defining comparable strong presets or rules.

It's hard to judge which approach is better, because the first gives the user a higher flexibility to adjust privacy / security for his specific needs, while the second in most cases leads to a higher global privacy / security level.

I normally use privacy and security combined in discussions, because privacy is data security equal to prevention of unwanted access or use of personal data.

What's also worth to mention is my opinion about risk analysis. It's a strong concept for avoiding damages, but when it crosses the line of paranoia it's positive effect is reversed.

You simply can't cover all risks. That's why we still have to bemoan traffic deaths.

The other problem is that too much restrictions can lead to a lack of self-responsibility.

I think everyone who has children knows about these conflicts trying to cover the responsibility without damping their ability to build their own identity.

Now back to my third point anonymization. Anonymization is not just a preset but a process.

In order to clarify whether the stored Siri data is fully anonymized or not I'd like to discuss it by a common example: employee satisfaction survey.

You want to improve employee satisfaction. So you develop a survey form which is filled by pencil.

If you only have five employees in your department and the department manager does the evaluation the data isn't strongly anonymized, because the manager might recognize the employees by handwriting or they way they respond to some critical questions.

The common approach is to delegate the evaluation to an independent third party, which also judges the impartiality of the questions in order to control what's stored and analyzed.

Another important point is the scale of the survey. If there are 1000 employees asked even an highly critical remark of an employee can't be necessarily tracked down to the specific person.

And that's how I look at the longer term saved Siri queries. Your queries are a couple of billions saved in a strongly anonymized way. I consider this as "realizable full-anonymization".

Does it cover all aspects of a risk analysis? No, because it's theoretical possible that someone who knows your voice steps through your recordings and finds something he might use in an unintended way. But because of all other measures in place this is highly unlikely.

Privacy can cover the common risks that someone not entitled can easily ask for "give me all queries from John Dow from the year 2012".

The only way to cover the residual risk is to provide opt-out and deletion functionality.

dominoxml · April 21, 2013 6:56AM

I'm not sure if these topics are from high interest and they aren't meant to ignite heated discussions.

I wrote these words in order to provide some insight in the work of someone who is often involved into the privacy implementation and decision process. I'm aware that I give up a bit of my own privacy by discussing these topics in public.

kdarling · April 21, 2013 7:35AM

Hmm. Other people have pointed out that there's something else going on.

When you switch Siri off you get a warning that says:

"The information Siri uses to respond to your requests will be removed from Apple servers. If you want to use Siri later, it will take time to resend this information."

I seem to recall early reviews of Siri talking about how the current request context (previous questions) and any contact nicknames (e.g. "Mom = Mrs Smith") , plus probably what you like Siri to call you, are all sent up to the servers to help it figure out a correct response.

So when Apple answered the ACLU's question about Siri, were they talking only about the voice data, or also about the associated data that goes with the request? Perhaps it's all kept together as one lump under the random id?

--

Another wag brought up an interesting lack of privacy involved with anyone keeping voice clips. The data is only anonymous if the voice clip contains no identifying details.

Imagine the voice command, "Take a note. My social security number is 123-45-6789." or "Remind me to call Jim Jones in Teaneck about the weed."

dominoxml · April 21, 2013 8:19AM

Quote:

Originally Posted by KDarling

So when Apple answered the ACLU's question about Siri, were they talking only about the voice data, or also about the associated data that goes with the request? Perhaps it's all kept together as one lump under the random id?

--

Another wag brought up an interesting lack of privacy involved with anyone keeping voice clips. The data is only anonymous if the voice clip contains no identifying details.

Imagine the voice command, "Take a note. My social security number is 123-45-6789." or "Remind me to call Jim Jones in Teaneck about the weed."

I can't answer this 100 % accurate. The only thing I can say is that I disabled Siri in order to test what's going on and it "forgot" the related information.

"Please say my wife that I', late" asked me then to specify who my wife is.

You have also to differentiate the data from the commands. Your example "Take a note. May social number is..." covers different types of data.

The command to take the note, the spoken sentence and the text data saved in the notes-app.

This data isn't linked together and stored the same way. In your example your voice command is wiped while the note with the security number is still saved in your notes app linked to your iCloud account. If you have enabled iCloud for notes this piece of data is stored on the servers.

And no there isn't a lack of privacy because those three types of data are stored separate and not directly linked together like other services do it.

kdarling · April 21, 2013 8:46AM

Quote:

Originally Posted by DominoXML

The command to take the note, the spoken sentence and the text data saved in the notes-app.

This data isn't linked together and stored the same way. In your example your voice command is wiped while the note with the security number is still saved in your notes app linked to your iCloud account. If you have enabled iCloud for notes this piece of data is stored on the servers.

Apple says that the voice clip (and perhaps associated context) is not wiped, but kept for years.

Quote:

And no there isn't a lack of privacy because those three types of data are stored separate and not directly linked together like other services do it

Right, however I'm not talking about info that needs associations, but data that has value by itself.

A voice clip containing valid SSN is valuable data. Knowledge that so-and-so is a drug dealer is valuable info. Or a voice note request to, "Remind me to tell Forstall that we're keeping the same iPhone body style for three more years."

Now, again, I don't see this as a problem for most people with the major voice services (Apple, Google, Nuance) simply because it's in those companies' best interest to keep this stuff very private.

However, military or government or consulate personnel should not use such services, simply because they're a possible leak vector and the support personnel with clip access have probably not been through a security clearance. (Or maybe they have. It would be interesting to find out.)

(Decades ago, I was a voice intercept operator for the military branch of NSA)

dominoxml · April 21, 2013 10:12AM

Quote:

Originally Posted by KDarling

Apple says that the voice clip (and perhaps associated context) is not wiped, but kept for years.

Right, however I'm not talking about info that needs associations, but data that has value by itself.

A voice clip containing valid SSN is valuable data. Knowledge that so-and-so is a drug dealer is valuable info. Or a voice note request to, "Remind me to tell Forstall that we're keeping the same iPhone body style for three more years."

Now, again, I don't see this as a problem for most people with the major voice services (Apple, Google, Nuance) simply because it's in those companies' best interest to keep this stuff very private.

However, military or government or consulate personnel should not use such services, simply because they're a possible leak vector and the support personnel with clip access have probably not been through a security clearance. (Or maybe they have. It would be interesting to find out.)

(Decades ago, I was a voice intercept operator for the military branch of NSA)

I'm not talking about the pieces of data that are stored for functional improvements. Again, privacy is to anonymize the data as good as possible.

What I hope that I got right is the fact that this data can be wiped through opt-out. We are talking about data the user has willingly provided. At this point there's the responsibility to judge the risk of those stored. I think the measures to insure unintended access is blocked are well in place.

I'm not sure whether I'm so bad in explaining or you are not willing to accept that your Forstall voice sniped is anonymized in a way that it can hardly be found.

It's like searching for a needle in a haystack when there are no user references. And even if you find this voice snippet it's hard to impossible to track it back to the user in a way it could be used e.g. in court because anonymized SSN data might be valuable but it's hard to use it in court.

At this point we are leaving the technical discussion and entering the political / legal discussion whether access by national authorities can be taken as unintended access in terms of privacy law or whether it's backed up by constitutional jurisdiction. That's no topic I'm involved.

I think it's clear that when you store data in a public place (cloud) there's obviously a higher risk that the data get's leaked.

In my opinion governmental authorities, health care organizations etc. shouldn't use cloud services at all because of the nature of their data. (exception is the so called private cloud, where the provider only provides the infrastructure, while the administration is bound to a NDA contract)

I'm a bit surprised about your standpoint now knowing that you have worked for the NSA.

I thought it was clear that there are legal terms of reference that bring up conflicts with privacy like I pointed out earlier.

lightknight · April 21, 2013 10:20AM

Quote:

Originally Posted by DominoXML

BTW: Are you sure that Dragon on iOS is fully functional in offline mode? As far as I know it also sends data to the servers in order to improve the software and at least Dragon Dictation needs a network connection.

Dragon on iOS is not offline (and I also don't use it). I was insufficiently clear, I guess. I use Dragon both on Mac and Windows ^^

dominoxml · April 21, 2013 10:28AM

Quote:

Originally Posted by lightknight

Dragon on iOS is not offline (and I also don't use it). I was insufficiently clear, I guess. I use Dragon both on Mac and Windows ^^

Thanks for your reply - makes sense. I think that the problem is that the current mobile devices don't have the computing power and storage (+ batterie issues) to run the full functionality offline. That's most likely the reason why parts of the processing are shifted to the cloud. Perhaps we will see on offline mode here in future generations.

solipsismx · April 21, 2013 11:52AM

dominoxml wrote: »

Thanks for your reply - makes sense. I think that the problem is that the current mobile devices don't have the computing power and storage (+ batterie issues) to run the full functionality offline. That's most likely the reason why parts of the processing are shifted to the cloud. Perhaps we will see on offline mode here in future generations.

It's all of the processing. I think all the local device does is capture the audio file, compress and send it, then upon return it's just the code that needs to be executed locally to display and speak the proper results.

You can turn off Siri and get the pre-iPhone 4S voice activated feature set back. This will let you make calls, find music, etc. It's limited enough that it doesn't need to "think" about an increasingly wider array of possible options.

I mention this because one of the downsides of Siri is that if you don't have an internet connection, or if it's not good, Siri can't work. Why should it if you can't get to its servers, right? But my issue is why isn't the local device smart enough to know that it can't access Siri to then default to the localized system until it comes back online, or why the local system can't be smart enough to do rudimentary processing to know that if the first waveform is "call' or "play' it knows it's a contact or music related and therefor be a local service.

From Apple's PoV they may like to think that the device is always connected but that simply isn't the case. Also, if you are having to go through their servers to make a call or play a song they then have a record of that simple task. I don't think for a second they would sell that data or use it for nefarious ways but I do wonder if they really think it's best to go their servers. LTE in my area is fast but it's still slower to go through Siri to make a call or play a song than it is with the previous system, and these are the two most common things I do with Siri.

dominoxml · April 21, 2013 12:07PM

Thank you very much for your information SolipsismX. I wasn't aware that local processing is still in place. I'll try to check it out.

My use of Siri / Dictation is also often hampered by bad network connections. A fallback strategy would be nice.

Let's see what iOS 7 has to offer.

kdarling · April 21, 2013 12:19PM

Quote:

Originally Posted by DominoXML

What I hope that I got right is the fact that this data can be wiped through opt-out.

Yes sir, you got that right, and I got it wrong. I was thinking about the fact that data older than six months was kept even without an id. However, yes, opting out will delete data younger than six months.

Quote:

I'm not sure whether I'm so bad in explaining or you are not willing to accept that your Forstall voice sniped is anonymized in a way that it can hardly be found. It's like searching for a needle in a haystack when there are no user references. And even if you find this voice snippet it's hard to impossible to track it back to the user in a way it could be used e.g. in court because anonymized SSN data might be valuable but it's hard to use it in court.

Oh, I wasn't necessarily talking about using anything in court. I was talking about data that would be useful all on its own... no source necessary... perhaps for corporate or government espionage.

Quote:

I'm a bit surprised about your standpoint now knowing that you have worked for the NSA.

I thought it was clear that there are legal terms of reference that bring up conflicts with privacy like I pointed out earlier.

Things change. For example, back in the mid 70s, NSA (*) was not allowed to intercept American citizens in the USA. We had to swear and sign a ton of documents about it.

(*) Insiders don't add "the" in front of the name "NSA". The private joke is that there is only one NSA, and therefore... just as when talking about God... there is no need to say "the", as it's automatically implied. For example, you would say "I prayed to God" and "He works for NSA", in the same way.

gatorguy · April 21, 2013 12:23PM

Quote:

Originally Posted by DominoXML

Thanks for your reply - makes sense. I think that the problem is that the current mobile devices don't have the computing power and storage (+ batterie issues) to run the full functionality offline. That's most likely the reason why parts of the processing are shifted to the cloud. Perhaps we will see on offline mode here in future generations.

That doesn't seem likely either since Google's voice processing is done on the device itself on any old Android phone running 4.x.

dominoxml · April 21, 2013 12:31PM

Quote:

Originally Posted by KDarling

Things change. For example, back in the mid 70s, NSA (*) was not allowed to intercept American citizens in the USA. We had to swear and sign a ton of documents about it.

(*) Insiders don't add "the" in front of the name "NSA". The private joke is that there is only one NSA, and therefore... just as when talking about God... there is no need to say "the", as it's automatically implied. For example, you would say "I prayed to God" and "He works for NSA", in the same way.

Thanks for this information. I was under the impression that these main rules are still in place.

(*) Yes, I'm no insider.

lightknight · April 22, 2013 10:11AM

Quote:

Originally Posted by SolipsismX

It's all of the processing. I think all the local device does is capture the audio file, compress and send it, then upon return it's just the code that needs to be executed locally to display and speak the proper results.

You can turn off Siri and get the pre-iPhone 4S voice activated feature set back. This will let you make calls, find music, etc. It's limited enough that it doesn't need to "think" about an increasingly wider array of possible options.

I mention this because one of the downsides of Siri is that if you don't have an internet connection, or if it's not good, Siri can't work. Why should it if you can't get to its servers, right? But my issue is why isn't the local device smart enough to know that it can't access Siri to then default to the localized system until it comes back online, or why the local system can't be smart enough to do rudimentary processing to know that if the first waveform is "call' or "play' it knows it's a contact or music related and therefor be a local service.

From Apple's PoV they may like to think that the device is always connected but that simply isn't the case. Also, if you are having to go through their servers to make a call or play a song they then have a record of that simple task. I don't think for a second they would sell that data or use it for nefarious ways but I do wonder if they really think it's best to go their servers. LTE in my area is fast but it's still slower to go through Siri to make a call or play a song than it is with the previous system, and these are the two most common things I do with Siri.

My guess: because Apple doesn't want people to have "varying behavior of Siri". Either Siri is there, or Siri is not there, but Siri able to answer complex questions or not depending on network would probably stress Average Joe, and Apple hates that!

lightknight · April 22, 2013 10:19AM

Quote:

Originally Posted by KDarling

Things change. For example, back in the mid 70s, NSA (*) was not allowed to intercept American citizens in the USA. We had to swear and sign a ton of documents about it.

(*) Insiders don't add "the" in front of the name "NSA". The private joke is that there is only one NSA, and therefore... just as when talking about God... there is no need to say "the", as it's automatically implied. For example, you would say "I prayed to God" and "He works for NSA", in the same way.

1- I guess it's also true for the KGB ^^

2- Kind of like spies referring to their guys as "agents" and the opponent as "spies"?

3- Back in 1910, such a thing as the FBI was an impossibility. Then it became possible, and the relationship between Nazi ideology support in America and the creation of the FBI is a bit scary. Hopefully, the FBI is now a perfectly functional body of pure, efficient agents of free thought and democracy protecting American citizens. (at least in reasonable borders... asking for perfection is unreasonable)

solipsismx · April 22, 2013 10:20AM

lightknight wrote: »

My guess: because Apple doesn't want people to have "varying behavior of Siri". Either Siri is there, or Siri is not there, but Siri able to answer complex questions or not depending on network would probably stress Average Joe, and Apple hates that!

That sounds reasonable to me.

Apple reveals it keeps anonymized Siri data for up to 2 years

Comments