Apple Watch 'black box' algorithms unreliable for medical research [u]

AppleInsider · July 27, 2021 2:04PM

Apple's use of algorithms to analyze data may be an issue for medical research, after a Harvard professor discovered inconsistencies in data from one Apple Watch accessed at different times.

An Apple Watch showing a blood oxygen reading.

One of the benefits of mobile devices and wearable devices like the Apple Watch is that improvements can be made in software. In medical research, this may not necessarily be a good thing, and has prompted one study to rethink its methodology.

According to JP Onnela, an associate professor of biostatistics at the Harvard T.H. Chan School of Public Health, these changes can produce inconsistencies in data collection. This can even be the case for analyzing the same data, but at different moments in time.

While Onnela typically prefers using research-grade devices for data collection for studies, The Verge reports a collaboration with the department of neurosurgery at Brigham and Women's Hospital prompted an examination of consumer hardware. Specifically, the study's team wanted to check how different the results from commercial products like the Apple Watch could be in terms of accuracy.

Two sets of the same daily heart rate variability data collected from one Apple Watch were collected, covering the same period from December 2018 until September 2020. While the sets were collected on September 5, 2020, and April 15, 2021, the data should have been identical given they dealt with identical timeframes, but differences were discovered.

It is thought that tweaks by Apple to algorithms used in the Apple Watch changed how the data was interpreted before being collected.

"These algorithms are what we would call black boxes - they're not transparent. So it's impossible to know what's in them," said Onnela. "What was surprising was how different they are. This is probably the cleanest example that I have seen of this phenomenon."

The changes are a concern for scientific researchers, who want there to be minimal changes or variances in how devices report or record data the same sets of data. Small changes may not be a problem for typical users, but for researchers where consistency is required, Onnela says "that's the concern."

The findings caused the team to shift away from using consumer hardware and back to medical-grade devices. Onnela proposes that the Apple Watch and other wearable items should only be used if raw data is available or if researchers can be informed of when algorithm changes occur.

The Apple Watch and other Apple hardware have been used for medical studies in the past, and sometimes as the primary device. In April, Apple partnered with the University of Washington to study how the Apple Watch could be used to predict illnesses like flu, or the coronavirus.

Stanford University also looked into whether an iPhone and Apple Watch could be used to remotely assess a heart disease patient's frailty, in a study funded by Apple. Researchers found there was a slight dip in accuracy in at-home testing versus in-clinic versions, though it was put down to "out-of-clinic variability" rather than Apple's sensors.

Update: Apple later told The Verge that algorithm changes are not retroactively applied to past data. The company had no explanation for the discrepancy found by Onnela, but suggested issues might arise when using third-party apps to export data.

Read on AppleInsider

mike1 · July 27, 2021 2:16PM

AppleInsider said:

It is thought that tweaks by Apple to algorithms used in the Apple Watch changed how the data was interpreted before being collected.

"These algorithms are what we would call black boxes - they're not transparent. So it's impossible to know what's in them," said Onnela. "What was surprising was how different they are. This is probably the cleanest example that I have seen of this phenomenon."

It's amazing how smart people can be so stupid. It's not a phenomenon. It's called continual updates to tweak the algorithm over the course of almost two years. If this is an issue for your research, you make sure the software is locked down for the length of the study.

dws-2 · July 27, 2021 2:20PM

I'm surprised that the Apple Watch is used for research anyway. If I take a heart rate variability reading with the Breathe app, I get around 110-120. The automatic readings are around 20-30. Sort of silly, and it's unclear what meaning I would attach to any of it.

neoncat · July 27, 2021 2:45PM

mike1 said:

AppleInsider said:

It is thought that tweaks by Apple to algorithms used in the Apple Watch changed how the data was interpreted before being collected.

"These algorithms are what we would call black boxes - they're not transparent. So it's impossible to know what's in them," said Onnela. "What was surprising was how different they are. This is probably the cleanest example that I have seen of this phenomenon."

It's amazing how smart people can be so stupid. It's not a phenomenon. It's called continual updates to tweak the algorithm over the course of almost two years. If this is an issue for your research, you make sure the software is locked down for the length of the study.

I realize this is the new-normal for Apple sites—to be a dismissive prick, like a badge of honor—but the problem the article highlights is the researchers *don't have control* over this algorithmic versioning. This is public data, that *Apple itself* is encouraging be used for these studies (ResearchKit, anyone?) I think the frustration is entirely above-board.

If you're going to welcome someone to do their job with your data, and create an interface for that data, don't be so surprised if that person says, "yeah but... this data isn't what we need." So, where's the problem again?

Right, yes, of course, Anyone But Apple™. Sorry, I'll be better.

albatrossflyer · July 27, 2021 2:47PM

dws-2 said:

I'm surprised that the Apple Watch is used for research anyway. If I take a heart rate variability reading with the Breathe app, I get around 110-120. The automatic readings are around 20-30. Sort of silly, and it's unclear what meaning I would attach to any of it.

That was the whole point of the study. Can consumer grade hardware be used for medical studies?

crowley · July 27, 2021 3:06PM

mike1 said:

AppleInsider said:

It is thought that tweaks by Apple to algorithms used in the Apple Watch changed how the data was interpreted before being collected.

"These algorithms are what we would call black boxes - they're not transparent. So it's impossible to know what's in them," said Onnela. "What was surprising was how different they are. This is probably the cleanest example that I have seen of this phenomenon."

It's amazing how smart people can be so stupid. It's not a phenomenon. It's called continual updates to tweak the algorithm over the course of almost two years. If this is an issue for your research, you make sure the software is locked down for the length of the study.

Why is any post-facto algorithm being applied over historical data? That seems highly dubious for anything that requires data integrity and is forming part of ongoing health monitoring.

Apple should stop marketing ResearchKit as anything close to fit for purpose if they're going to dick around behind the scenes and without transparency. Hell, even HealthKit should have some alarms bells going off.

The opacity around this totally undermines the reliability of health data on the Apple Watch. Stupid own goal by Apple.

mike1 · July 27, 2021 3:11PM

neoncat said:

mike1 said:

AppleInsider said:

It is thought that tweaks by Apple to algorithms used in the Apple Watch changed how the data was interpreted before being collected.

"These algorithms are what we would call black boxes - they're not transparent. So it's impossible to know what's in them," said Onnela. "What was surprising was how different they are. This is probably the cleanest example that I have seen of this phenomenon."

It's amazing how smart people can be so stupid. It's not a phenomenon. It's called continual updates to tweak the algorithm over the course of almost two years. If this is an issue for your research, you make sure the software is locked down for the length of the study.

I realize this is the new-normal for Apple sites—to be a dismissive prick, like a badge of honor—but the problem the article highlights is the researchers *don't have control* over this algorithmic versioning. This is public data, that *Apple itself* is encouraging be used for these studies (ResearchKit, anyone?) I think the frustration is entirely above-board.

If you're going to welcome someone to do their job with your data, and create an interface for that data, don't be so surprised if that person says, "yeah but... this data isn't what we need." So, where's the problem again?

Right, yes, of course, Anyone But Apple™. Sorry, I'll be better.

Not dismissive at all. To not expect changes over two years is stupid. Not a surprise since Apple issues a dozen updates per year. If you don't know that, you shouldn't be using the hardware/software in your study. If you design a study where you can't have the variability, then it's your job to control the variability. Easily done by asking participants not to update their devices. It's pure ignorance or hyperbole to claim this is a phenomenon when you know damn well why and how it happens.

tompmri · July 27, 2021 3:15PM

I think this study is consistent with what the Apple Watch is cleared to measure and notify the user of a possible clinical issue. For the ECG function only, this includes atrial fibrillation (totally variable heart rate), tachycardia (very high heart rate), and low resting heart rate (<40 bpm). I don’t believe that the Apple Watch Pulse Oximeter is cleared for highly accurate O2 saturation, but is for only detecting low resting heart rate. For both measurements (ECG and Pulse Oximeter), nothing that involves slight variability in heart rate is ever reported, but such data would be of interest to researchers.

williamh · July 27, 2021 3:34PM

neoncat said:

mike1 said:

AppleInsider said:

It is thought that tweaks by Apple to algorithms used in the Apple Watch changed how the data was interpreted before being collected.

"These algorithms are what we would call black boxes - they're not transparent. So it's impossible to know what's in them," said Onnela. "What was surprising was how different they are. This is probably the cleanest example that I have seen of this phenomenon."

It's amazing how smart people can be so stupid. It's not a phenomenon. It's called continual updates to tweak the algorithm over the course of almost two years. If this is an issue for your research, you make sure the software is locked down for the length of the study.

I realize this is the new-normal for Apple sites—to be a dismissive prick, like a badge of honor—but the problem the article highlights is the researchers *don't have control* over this algorithmic versioning. This is public data, that *Apple itself* is encouraging be used for these studies (ResearchKit, anyone?) I think the frustration is entirely above-board.

If you're going to welcome someone to do their job with your data, and create an interface for that data, don't be so surprised if that person says, "yeah but... this data isn't what we need." So, where's the problem again?

Right, yes, of course, Anyone But Apple™. Sorry, I'll be better.

I guess you are staying consistent with expectations for the “new normal.”

In any event, one possible solution would be for Apple to be transparent about changes in the algorithms. The researcher was having problems with the lack of transparency and not necessarily the fact that Apple was making changes to the algorithms.

igorsky · July 27, 2021 3:35PM

crowley said:

mike1 said:

AppleInsider said:

It is thought that tweaks by Apple to algorithms used in the Apple Watch changed how the data was interpreted before being collected.

"These algorithms are what we would call black boxes - they're not transparent. So it's impossible to know what's in them," said Onnela. "What was surprising was how different they are. This is probably the cleanest example that I have seen of this phenomenon."

It's amazing how smart people can be so stupid. It's not a phenomenon. It's called continual updates to tweak the algorithm over the course of almost two years. If this is an issue for your research, you make sure the software is locked down for the length of the study.

Why is any post-facto algorithm being applied over historical data? That seems highly dubious for anything that requires data integrity and is forming part of ongoing health monitoring.

Apple should stop marketing ResearchKit as anything close to fit for purpose if they're going to dick around behind the scenes and without transparency. Hell, even HealthKit should have some alarms bells going off.

The opacity around this totally undermines the reliability of health data on the Apple Watch. Stupid own goal by Apple.

So we're taking this one study as gospel and indicative of what the medical research community thinks as a whole?

igorsky · July 27, 2021 3:37PM

neoncat said:

mike1 said:

AppleInsider said:

It is thought that tweaks by Apple to algorithms used in the Apple Watch changed how the data was interpreted before being collected.

"These algorithms are what we would call black boxes - they're not transparent. So it's impossible to know what's in them," said Onnela. "What was surprising was how different they are. This is probably the cleanest example that I have seen of this phenomenon."

It's amazing how smart people can be so stupid. It's not a phenomenon. It's called continual updates to tweak the algorithm over the course of almost two years. If this is an issue for your research, you make sure the software is locked down for the length of the study.

I realize this is the new-normal for Apple sites—to be a dismissive prick, like a badge of honor—but the problem the article highlights is the researchers *don't have control* over this algorithmic versioning. This is public data, that *Apple itself* is encouraging be used for these studies (ResearchKit, anyone?) I think the frustration is entirely above-board.

If you're going to welcome someone to do their job with your data, and create an interface for that data, don't be so surprised if that person says, "yeah but... this data isn't what we need." So, where's the problem again?

Right, yes, of course, Anyone But Apple™. Sorry, I'll be better.

I see as many posts like yours, with that sarcastic anti-Apple tone, as a see from those who defend them. So not really sure what the new normal is around here.

edited July 2021

crowley · July 27, 2021 3:44PM

igorsky said:

crowley said:

mike1 said:

AppleInsider said:

It is thought that tweaks by Apple to algorithms used in the Apple Watch changed how the data was interpreted before being collected.

"These algorithms are what we would call black boxes - they're not transparent. So it's impossible to know what's in them," said Onnela. "What was surprising was how different they are. This is probably the cleanest example that I have seen of this phenomenon."

It's amazing how smart people can be so stupid. It's not a phenomenon. It's called continual updates to tweak the algorithm over the course of almost two years. If this is an issue for your research, you make sure the software is locked down for the length of the study.

Why is any post-facto algorithm being applied over historical data? That seems highly dubious for anything that requires data integrity and is forming part of ongoing health monitoring.

Apple should stop marketing ResearchKit as anything close to fit for purpose if they're going to dick around behind the scenes and without transparency. Hell, even HealthKit should have some alarms bells going off.

The opacity around this totally undermines the reliability of health data on the Apple Watch. Stupid own goal by Apple.

So we're taking this one study as gospel and indicative of what the medical research community thinks as a whole?

You're suggesting that a Harvard professor is lying about two data sets covering the same period being different, a very easily verifiable claim?

edited July 2021

bonobob · July 27, 2021 3:44PM

Researchers shouldn’t be using calculated results like heart rate variability or VO2 Max. They should be using the primary data: pulse, motion, oxygenation, possibly location and elevation changes. Then they can do their own calculations however they wish.

gatorguy · July 27, 2021 3:48PM

mike1 said:

neoncat said:

mike1 said:

AppleInsider said:

It is thought that tweaks by Apple to algorithms used in the Apple Watch changed how the data was interpreted before being collected.

"These algorithms are what we would call black boxes - they're not transparent. So it's impossible to know what's in them," said Onnela. "What was surprising was how different they are. This is probably the cleanest example that I have seen of this phenomenon."

It's amazing how smart people can be so stupid. It's not a phenomenon. It's called continual updates to tweak the algorithm over the course of almost two years. If this is an issue for your research, you make sure the software is locked down for the length of the study.

I realize this is the new-normal for Apple sites—to be a dismissive prick, like a badge of honor—but the problem the article highlights is the researchers *don't have control* over this algorithmic versioning. This is public data, that *Apple itself* is encouraging be used for these studies (ResearchKit, anyone?) I think the frustration is entirely above-board.

If you're going to welcome someone to do their job with your data, and create an interface for that data, don't be so surprised if that person says, "yeah but... this data isn't what we need." So, where's the problem again?

Right, yes, of course, Anyone But Apple™. Sorry, I'll be better.

Not dismissive at all. To not expect changes over two years is stupid. Not a surprise since Apple issues a dozen updates per year. If you don't know that, you shouldn't be using the hardware/software in your study. If you design a study where you can't have the variability, then it's your job to control the variability. Easily done by asking participants not to update their devices. It's pure ignorance or hyperbole to claim this is a phenomenon when you know damn well why and how it happens.

The issue is not so much the changes but the "black box" way of doing it, aka, lack of transparency. Researchers have no idea if anything pertinent to their study has changed because Apple doesn't disclose what goes into the algorithm, how it works, nor report when changes are done and when they happen.

igorsky · July 27, 2021 4:05PM

crowley said:

igorsky said:

crowley said:

mike1 said:

AppleInsider said:

It is thought that tweaks by Apple to algorithms used in the Apple Watch changed how the data was interpreted before being collected.

"These algorithms are what we would call black boxes - they're not transparent. So it's impossible to know what's in them," said Onnela. "What was surprising was how different they are. This is probably the cleanest example that I have seen of this phenomenon."

It's amazing how smart people can be so stupid. It's not a phenomenon. It's called continual updates to tweak the algorithm over the course of almost two years. If this is an issue for your research, you make sure the software is locked down for the length of the study.

Why is any post-facto algorithm being applied over historical data? That seems highly dubious for anything that requires data integrity and is forming part of ongoing health monitoring.

Apple should stop marketing ResearchKit as anything close to fit for purpose if they're going to dick around behind the scenes and without transparency. Hell, even HealthKit should have some alarms bells going off.

The opacity around this totally undermines the reliability of health data on the Apple Watch. Stupid own goal by Apple.

So we're taking this one study as gospel and indicative of what the medical research community thinks as a whole?

You're suggesting that a Harvard professor is lying about two data sets covering the same period being different, a very easily verifiable claim?

Question with a question...awesome.

I'm suggesting that this is one opinion from one professor, so maybe we should all calm down a little before making judgments that might look stupid down the line.

edited July 2021

georgebmac · July 27, 2021 4:20PM

I have to chuckle.... It's widely known and acknowledged in medical circles that the single largest determinant in the outcome of any medical study is who paid for the study. They go through extreme measures to insure the integrity of the data (which it frequently isn't) -- but there are no controls over the design of the study or its analysis.

A classic example is proving that chocolate chip cookies are healthy -- by comparing them to Oreo cookies.

Likewise, it is predictable that this guy used Heart Rate Variability to criticize the Apple Watch's accuracy.

Anybody familiar with Heart Rate Variability knows that Heart Rate Variability is, well, extremely variable. Yesterday mine ranged from 18 to 57.

Those who study it and use it know that to get accurate results you need identical conditions. So, it is recommended that you check it when you wake up and before you even get out of bed.

As well, medical personal have always had a bias against consumer grade equipment and preferred their own medical grade equipment over it.

The result is: Less data collected in the most expensive way possible -- which limits data collection even further!

So sorry pal!

Research has long suffered because of the things you are pushing.

Mobile data collection is here and it will be growing.

And, it will change medicine.

.... For the first time researchers will be able to obtain accurate, objective data real time on lifestyle choices. No more will they be limited to questionnaires asking "How much have you exercised over the last month"? Or, "What intensity do you exercise at?". The Apple Watch can collect and monitor that in real time as it happens -- and deliver the cheapest, most accurate data research has ever had.

Basically I would say to this researcher that it isn't the Apple Watch that is unreliable. It's the researchers.

edited July 2021

crowley · July 27, 2021 4:28PM

igorsky said:

crowley said:

igorsky said:

crowley said:

mike1 said:

AppleInsider said:

It is thought that tweaks by Apple to algorithms used in the Apple Watch changed how the data was interpreted before being collected.

"These algorithms are what we would call black boxes - they're not transparent. So it's impossible to know what's in them," said Onnela. "What was surprising was how different they are. This is probably the cleanest example that I have seen of this phenomenon."

It's amazing how smart people can be so stupid. It's not a phenomenon. It's called continual updates to tweak the algorithm over the course of almost two years. If this is an issue for your research, you make sure the software is locked down for the length of the study.

Why is any post-facto algorithm being applied over historical data? That seems highly dubious for anything that requires data integrity and is forming part of ongoing health monitoring.

Apple should stop marketing ResearchKit as anything close to fit for purpose if they're going to dick around behind the scenes and without transparency. Hell, even HealthKit should have some alarms bells going off.

The opacity around this totally undermines the reliability of health data on the Apple Watch. Stupid own goal by Apple.

So we're taking this one study as gospel and indicative of what the medical research community thinks as a whole?

You're suggesting that a Harvard professor is lying about two data sets covering the same period being different, a very easily verifiable claim?

Question with a question...awesome.

I'm suggesting that this is one opinion from one professor, so maybe we should all calm down a little before making judgments that might look stupid down the line.

Further opinions would be useful and welcome, but I see no reason to disbelieve this account.

daydalaus · July 27, 2021 4:45PM

mike1 said:

AppleInsider said:

It is thought that tweaks by Apple to algorithms used in the Apple Watch changed how the data was interpreted before being collected.

"These algorithms are what we would call black boxes - they're not transparent. So it's impossible to know what's in them," said Onnela. "What was surprising was how different they are. This is probably the cleanest example that I have seen of this phenomenon."

It's amazing how smart people can be so stupid. It's not a phenomenon. It's called continual updates to tweak the algorithm over the course of almost two years. If this is an issue for your research, you make sure the software is locked down for the length of the study.

I think this largely qualifies as a phenomenon, if you see the definition used largely on science is about an observable occurrence, they observed this behavior and they managed to see it repeatedly hence it ocurres... maybe you are confusing it with another use of the word?

daydalaus · July 27, 2021 4:57PM

GeorgeBMac said:

I have to chuckle....   It's widely known and acknowledged in medical circles that the single largest determinant in the outcome of any medical study is who paid for the study.   They go through extreme measures to insure the integrity of the data (which it frequently isn't) -- but there are no controls over the design of the study or its analysis.

A classic example is proving that chocolate chip cookies are healthy -- by comparing them to Oreo cookies.

Likewise, it is predictable that this guy used Heart Rate Variability to criticize the Apple Watch's accuracy.
Anybody familiar with Heart Rate Variability knows that Heart Rate Variability is, well, extremely variable. Yesterday mine ranged from 18 to 57.
Those who study it and use it know that to get accurate results you need identical conditions.   So, it is recommended that you check it when you wake up and before you even get out of bed.

As well, medical personal have always had a bias against consumer grade equipment and preferred their own medical grade equipment over it.
The result is:   Less data collected in the most expensive way possible -- which limits data collection even further!

So sorry pal!
Research has long suffered because of the things you are pushing.
Mobile data collection is here and it will be growing.
And, it will change medicine.
.... For the first time researchers will be able to obtain accurate, objective data real time on lifestyle choices.   No more will they be limited to questionnaires asking "How much have you exercised over the last month"? Or, "What intensity do you exercise at?". The Apple Watch can collect and monitor that in real time as it happens -- and deliver the cheapest, most accurate data research has ever had.

Basically I would say to this researcher that it isn't the Apple Watch that is unreliable.   It's the researchers.

I think there is a miss-conception about the study, reading it they explain that they exported the same data twice, they aren't comparing two different days for the same person and say that HRV is unreliable, they registered HRV during more than a year in a phone and they exported the SAME data in September 2020 and April 2021, the second time the same data was exported they saw different results than the first time, for the same days (previous to September 2020). This means that apple used the same raw data, in both cases, but after one year they recalculated ALL HRV information (even the historic data) with a new algorithm, hence now all the data is slightly different. That's why they say they can't be used, not because the measurements are variable, but because the manufacturer can modify historic data without notice, which makes things unreliable for statistical analysis

larryjw · July 27, 2021 5:06PM

NASEM (National Association of Science, Engineering and Medicine), formerly just NAS, issued a report in 2019 requested by Congress, on Reproducibility and Replication in Science. I'd highly recommend it.

The report distinguishes Reproducibility from Replication. To reproduce is to take the original data and reanalyze it, sometimes also using the same software used in the original study. Replication is duplicating the original study -- different researchers, different conditions, etc.

The NASEM report notes that software used by researchers are subject to change (black boxes, if you will) and this can alter results of studies. Software like R, SAS, SPSS, etc are often updated.

Frankly, I'm not at all clear what these paragraphs from the above article mean:
"Two sets of the same daily heart rate variability data collected from one Apple Watch were collected, covering the same period from December 2018 until September 2020. While the sets were collected on September 5, 2020, and April 15, 2021, the data should have been identical given they dealt with identical timeframes, but differences were discovered.

It is thought that tweaks by Apple to algorithms used in the Apple Watch changed how the data was interpreted before being collected. "

A set of data from December 2018, and another set from September 2020? Why should data have been identical? What am I not understanding? Did they really take some raw data from December 2018, and pass it through two separate algorithms and found a difference in how that data was interpreted?

In any case, I don't take my Apple Watch results that seriously. I expect a lot of variation. Where on my wrist I wear it, software upgrades, skin changes, sweat, environment, different watch with different sensors. Big picture trends is the only thing that I would expect would count, not absolute values. I have no clue as to the error bars of the Apple Watch. Using medical quality devices would be ideal, but nobody wears them for 20 hours per day over years -- medical equipment is used over a few days or a few minutes -- quite limited in value even if perfectly accurate.

gatorguy · July 27, 2021 5:23PM

larryjw said:

Frankly, I'm not at all clear what these paragraphs from the above article mean:
"Two sets of the same daily heart rate variability data collected from one Apple Watch were collected, covering the same period from December 2018 until September 2020. While the sets were collected on September 5, 2020, and April 15, 2021, the data should have been identical given they dealt with identical timeframes, but differences were discovered.

It is thought that tweaks by Apple to algorithms used in the Apple Watch changed how the data was interpreted before being collected. "

A set of data from December 2018, and another set from September 2020? Why should data have been identical? What am I not understanding? Did they really take some raw data from December 2018, and pass it through two separate algorithms and found a difference in how that data was interpreted?

Yes, you are misunderstanding. See post 18 where it's more clearly explained

Apple Watch 'black box' algorithms unreliable for medical research [u]

Comments