Your explanation is technically sound and very clear, but it clearly says what we also say: data is NOT fully anonymized, and it's unclear how it could ever be, since it's not random words, but series of words forming sentences useful to you, in particular.
I like Dragon, and I'd rather have such an offline Siri...
You are right that an offline Siri would improve privacy even more.
On the other hand I'm satisfied with it's implementation because the major privacy rules are satisfied:
2. Collecting of only the technical necessary data
3. Anonymization strategy is in place
4. Possibility to delete connected data by opt-out
5. No direct link to the user profile (=not merging data with other data sources or services in one profile)
If we want more we have to disconnect our devices from the internet, which means no mail, messaging, web browsing, internet search, e-commerce and social networking.
I'm pretty careful with private data, running my own servers, but stopping to communicate and share data is not what I'd even consider.
BTW: Are you sure that Dragon on iOS is fully functional in offline mode? As far as I know it also sends data to the servers in order to improve the software and at least Dragon Dictation needs a network connection.
Edit: I think I have to clarify some points.
First of all protecting privacy is taking the responsibility for data. A part of this responsibility has to be taken by the user, the other part shifts to a third party automatically when it saves or routes the data.
A simple rule is that when a store data from someone else you automatically accept your part of responsibility. Transparency simply means that you provide the information how you handle it.
If you have to take a lot of responsibility you should do a risk-analysis. The result of this is your data processing and anonymization strategy.
There are a couple of points to consider here:
1. Functional data
If your app and service simply needs private data in order to work you will have to save it.
2. Financial and legal binding data
If you process this kind of data you are forced to save it and in most cases you are bound to fixed retention periods.
In both cases you have to think about how to protect it, taking you functional and legal obligations into account. In some cases you will find that those requirements are diametrically opposed to the privacy expectations of the data owners.
The common way to address the problems raised by the risk-analysis is to develop a security and privacy strategy. Apart from covering legal issues there are a lot ways to differentiate here based on company culture, specific demand because topic and persons and last but not least your strategic decision to cover risks.
This strategic decision might lead to different implementations. The common approaches vary between shifting the responsibility to the user while providing tools for adjustment for taking the responsibility by defining comparable strong presets or rules.
It's hard to judge which approach is better, because the first gives the user a higher flexibility to adjust privacy / security for his specific needs, while the second in most cases leads to a higher global privacy / security level.
I normally use privacy and security combined in discussions, because privacy is data security equal to prevention of unwanted access or use of personal data.
What's also worth to mention is my opinion about risk analysis. It's a strong concept for avoiding damages, but when it crosses the line of paranoia it's positive effect is reversed.
You simply can't cover all risks. That's why we still have to bemoan traffic deaths.
The other problem is that too much restrictions can lead to a lack of self-responsibility.
I think everyone who has children knows about these conflicts trying to cover the responsibility without damping their ability to build their own identity.
Now back to my third point anonymization. Anonymization is not just a preset but a process.
In order to clarify whether the stored Siri data is fully anonymized or not I'd like to discuss it by a common example: employee satisfaction survey.
You want to improve employee satisfaction. So you develop a survey form which is filled by pencil.
If you only have five employees in your department and the department manager does the evaluation the data isn't strongly anonymized, because the manager might recognize the employees by handwriting or they way they respond to some critical questions.
The common approach is to delegate the evaluation to an independent third party, which also judges the impartiality of the questions in order to control what's stored and analyzed.
Another important point is the scale of the survey. If there are 1000 employees asked even an highly critical remark of an employee can't be necessarily tracked down to the specific person.
And that's how I look at the longer term saved Siri queries. Your queries are a couple of billions saved in a strongly anonymized way. I consider this as "realizable full-anonymization".
Does it cover all aspects of a risk analysis? No, because it's theoretical possible that someone who knows your voice steps through your recordings and finds something he might use in an unintended way. But because of all other measures in place this is highly unlikely.
Privacy can cover the common risks that someone not entitled can easily ask for "give me all queries from John Dow from the year 2012".
The only way to cover the residual risk is to provide opt-out and deletion functionality.
Edited by DominoXML - 4/21/13 at 4:44am