Apple Watch 'black box' algorithms unreliable for medical research [u]
Apple's use of algorithms to analyze data may be an issue for medical research, after a Harvard professor discovered inconsistencies in data from one Apple Watch accessed at different times.

An Apple Watch showing a blood oxygen reading.
One of the benefits of mobile devices and wearable devices like the Apple Watch is that improvements can be made in software. In medical research, this may not necessarily be a good thing, and has prompted one study to rethink its methodology.
According to JP Onnela, an associate professor of biostatistics at the Harvard T.H. Chan School of Public Health, these changes can produce inconsistencies in data collection. This can even be the case for analyzing the same data, but at different moments in time.
While Onnela typically prefers using research-grade devices for data collection for studies, The Verge reports a collaboration with the department of neurosurgery at Brigham and Women's Hospital prompted an examination of consumer hardware. Specifically, the study's team wanted to check how different the results from commercial products like the Apple Watch could be in terms of accuracy.
Two sets of the same daily heart rate variability data collected from one Apple Watch were collected, covering the same period from December 2018 until September 2020. While the sets were collected on September 5, 2020, and April 15, 2021, the data should have been identical given they dealt with identical timeframes, but differences were discovered.
It is thought that tweaks by Apple to algorithms used in the Apple Watch changed how the data was interpreted before being collected.
"These algorithms are what we would call black boxes - they're not transparent. So it's impossible to know what's in them," said Onnela. "What was surprising was how different they are. This is probably the cleanest example that I have seen of this phenomenon."
The changes are a concern for scientific researchers, who want there to be minimal changes or variances in how devices report or record data the same sets of data. Small changes may not be a problem for typical users, but for researchers where consistency is required, Onnela says "that's the concern."
The findings caused the team to shift away from using consumer hardware and back to medical-grade devices. Onnela proposes that the Apple Watch and other wearable items should only be used if raw data is available or if researchers can be informed of when algorithm changes occur.
The Apple Watch and other Apple hardware have been used for medical studies in the past, and sometimes as the primary device. In April, Apple partnered with the University of Washington to study how the Apple Watch could be used to predict illnesses like flu, or the coronavirus.
Stanford University also looked into whether an iPhone and Apple Watch could be used to remotely assess a heart disease patient's frailty, in a study funded by Apple. Researchers found there was a slight dip in accuracy in at-home testing versus in-clinic versions, though it was put down to "out-of-clinic variability" rather than Apple's sensors.
Update: Apple later told The Verge that algorithm changes are not retroactively applied to past data. The company had no explanation for the discrepancy found by Onnela, but suggested issues might arise when using third-party apps to export data.
Read on AppleInsider

An Apple Watch showing a blood oxygen reading.
One of the benefits of mobile devices and wearable devices like the Apple Watch is that improvements can be made in software. In medical research, this may not necessarily be a good thing, and has prompted one study to rethink its methodology.
According to JP Onnela, an associate professor of biostatistics at the Harvard T.H. Chan School of Public Health, these changes can produce inconsistencies in data collection. This can even be the case for analyzing the same data, but at different moments in time.
While Onnela typically prefers using research-grade devices for data collection for studies, The Verge reports a collaboration with the department of neurosurgery at Brigham and Women's Hospital prompted an examination of consumer hardware. Specifically, the study's team wanted to check how different the results from commercial products like the Apple Watch could be in terms of accuracy.
Two sets of the same daily heart rate variability data collected from one Apple Watch were collected, covering the same period from December 2018 until September 2020. While the sets were collected on September 5, 2020, and April 15, 2021, the data should have been identical given they dealt with identical timeframes, but differences were discovered.
It is thought that tweaks by Apple to algorithms used in the Apple Watch changed how the data was interpreted before being collected.
"These algorithms are what we would call black boxes - they're not transparent. So it's impossible to know what's in them," said Onnela. "What was surprising was how different they are. This is probably the cleanest example that I have seen of this phenomenon."
The changes are a concern for scientific researchers, who want there to be minimal changes or variances in how devices report or record data the same sets of data. Small changes may not be a problem for typical users, but for researchers where consistency is required, Onnela says "that's the concern."
The findings caused the team to shift away from using consumer hardware and back to medical-grade devices. Onnela proposes that the Apple Watch and other wearable items should only be used if raw data is available or if researchers can be informed of when algorithm changes occur.
The Apple Watch and other Apple hardware have been used for medical studies in the past, and sometimes as the primary device. In April, Apple partnered with the University of Washington to study how the Apple Watch could be used to predict illnesses like flu, or the coronavirus.
Stanford University also looked into whether an iPhone and Apple Watch could be used to remotely assess a heart disease patient's frailty, in a study funded by Apple. Researchers found there was a slight dip in accuracy in at-home testing versus in-clinic versions, though it was put down to "out-of-clinic variability" rather than Apple's sensors.
Update: Apple later told The Verge that algorithm changes are not retroactively applied to past data. The company had no explanation for the discrepancy found by Onnela, but suggested issues might arise when using third-party apps to export data.
Read on AppleInsider
Comments
If you're going to welcome someone to do their job with your data, and create an interface for that data, don't be so surprised if that person says, "yeah but... this data isn't what we need." So, where's the problem again?
Right, yes, of course, Anyone But Apple™. Sorry, I'll be better.
Apple should stop marketing ResearchKit as anything close to fit for purpose if they're going to dick around behind the scenes and without transparency. Hell, even HealthKit should have some alarms bells going off.
The opacity around this totally undermines the reliability of health data on the Apple Watch. Stupid own goal by Apple.
In any event, one possible solution would be for Apple to be transparent about changes in the algorithms. The researcher was having problems with the lack of transparency and not necessarily the fact that Apple was making changes to the algorithms.
I'm suggesting that this is one opinion from one professor, so maybe we should all calm down a little before making judgments that might look stupid down the line.
The report distinguishes Reproducibility from Replication. To reproduce is to take the original data and reanalyze it, sometimes also using the same software used in the original study. Replication is duplicating the original study -- different researchers, different conditions, etc.
The NASEM report notes that software used by researchers are subject to change (black boxes, if you will) and this can alter results of studies. Software like R, SAS, SPSS, etc are often updated.
Frankly, I'm not at all clear what these paragraphs from the above article mean:
"Two sets of the same daily heart rate variability data collected from one Apple Watch were collected, covering the same period from December 2018 until September 2020. While the sets were collected on September 5, 2020, and April 15, 2021, the data should have been identical given they dealt with identical timeframes, but differences were discovered.
It is thought that tweaks by Apple to algorithms used in the Apple Watch changed how the data was interpreted before being collected. "
A set of data from December 2018, and another set from September 2020? Why should data have been identical? What am I not understanding? Did they really take some raw data from December 2018, and pass it through two separate algorithms and found a difference in how that data was interpreted?
In any case, I don't take my Apple Watch results that seriously. I expect a lot of variation. Where on my wrist I wear it, software upgrades, skin changes, sweat, environment, different watch with different sensors. Big picture trends is the only thing that I would expect would count, not absolute values. I have no clue as to the error bars of the Apple Watch. Using medical quality devices would be ideal, but nobody wears them for 20 hours per day over years -- medical equipment is used over a few days or a few minutes -- quite limited in value even if perfectly accurate.