Health metrics are absolutely tarnished by a lack of proper context. Unsurprisingly, it turns out that you can't reliably take a concept as broad as health and reduce it to a number. We see the same arguments over and over with body fat percentages, vo2 max estimates, BMI, lactate thresholds, resting heart rate, HRV, and more. These are all useful metrics, but it's important to consider them in the proper context that each of them deserve.
This article gave an LLM a bunch of health metrics and then asked it to reduce it to a single score, didn't tell us any of the actual metric values, and then compared that to a doctor's opinion. Why anyone would expect these to align is beyond my understanding.
The most obvious thing that jumps out to me is that I've noticed doctors generally, for better or worse, consider "health" much differently than the fitness community does. It's different toolsets and different goals. If this person's VO2 max estimate was under 30, that's objectively a poor VO2 max by most standards, and an LLM trained on the internet's entire repository of fitness discussion is likely going to give this person a bad score in terms of cardio fitness. But a doctor who sees a person come in who isn't complaining about anything in particular, moves around fine, doesn't have risk factors like age or family history, and has good metrics on a blood test is probably going to say they're in fine cardio health regardless of what their wearable says.
I'd go so far to say this is probably the case for most people. Your average person is in really poor fitness-shape but just fine health-shape.
> Despite having access to my weight, blood pressure and cholesterol, ChatGPT based much of its negative assessment on an Apple Watch measurement known as VO2 max, the maximum amount of oxygen your body can consume during exercise. Apple says it collects an “estimate” of VO2 max, but the real thing requires a treadmill and a mask. Apple says its cardio fitness measures have been validated, but independent researchers have found those estimates can run low — by an average of 13 percent.
There's plenty of blame to go around for everyone, but at least for some of it (such as the above) I think the blame more rests on Apple for falsely representing the quality of their product (and TFA seems pretty clearly to be blasting OpenAI for this, not others like Apple).
What would you expect the behavior of the AI to be? Should it always assume bad data or potentially bad data? If so, that seems like it would defeat the point of having data at all as you could never draw any conclusions from it. Even disregarding statistical outliers, it's not at all clear what part of the data is "good" vs "unrealiable" especially when the company that collected that data claims that it's good data.
> I think the blame more rests on Apple for falsely representing the quality of their product
There was plenty of other concerning stuff in that article. And from a quick read it wasn't suggested or implied the VO2 max issue was the deciding factor for the original F score the author received. The article did suggest many times over the ChatGPT is really not equipped for the task of health diagnosis.
> There was another problem I discovered over time: When I tried asking the same heart longevity-grade question again, suddenly my score went up to a C. I asked again and again, watching the score swing between an F and a B.
The lack of self-consistency does seem like a sign of a deeper issue with reliability. In most fields of machine learning robustness to noise is something you need to "bake in" (often through data augmentation using knowledge of the domain) rather than get for free in training.
> Should it always assume bad data or potentially bad data? If so, that seems like it would defeat the point of having data at all as you could never draw any conclusions from it.
Yes. You, and every other reasoning system, should always challenge the data and assume it’s biased at a minimum.
This is better described as “critical thinking” in its formal form.
You could also call it skepticism.
That impossibility of drawing conclusions assumes there’s a correct answer and is called the “problem of induction.” I promise you a machine is better at avoiding it than a human.
Many people freeze up or fail with too much data - put someone with no experience in front of 500 ppl to give a speech if you want to watch this live.
> What would you expect the behavior of the AI to be? Should it always assume bad data or potentially bad data? If so, that seems like it would defeat the point of having data at all as you could never draw any conclusions from it.
Well, I would expect the AI to provide the same response as a real doctor did from the same information. Which the article went over the doctors were able to.
I also would expect the AI to provide the same answer every time to the same data unlike what it did (from F to B over multiple attempts in the article)
OpenAI is entirely to blame here when they are putting out faulty products, (hallucinations even on accurate data are a fault of them).
I have been sitting and waiting for the day these trackers get exposed as just another health fad that is optimized to deliver shareholder value and not serious enough for medical grade applications
I don't see how they are considered a health fad, they're extremely useful and accurate enough. There are plenty of studies and real world data showing Garmin VO2Max readings being accurate to 1-2 points different to a real world test.
There is this constant debate about how accurately VO2max is measured and its highly dependent on actually doing exercise to determine your VO2max using your watch. But yes if you want a lab/medically precise measure you need to do it a test that measures your actual oxygen uptake.
A simple understanding of transformers should be enough to make someone see that using an LLM to analyze multi-variate time series data is a really stupid endeavor.
A year or so ago, I fed my wife's blood work results into chatgpt and it came back with a terrifying diagnosis. Even after a lot of back and forth it stuck to its guns. We went to a specialist who performed some additional tests and explained that the condition cannot be diagnosed with just the original blood work and said that she did not have the condition. The whole thing was a borderline traumatic ordeal that I'm still pretty pissed about.
Please keep telling your story. This is the kind of shit that medical science has been dealing with for at least a century. When evaluating testing procedures false positives can have serious consequences. A test that's positive every time will catch every single true positive, but it's also worthless. These LLMs don't have a goddamn clue about it. There should be consequences for these garbage fires giving medical advice.
Part of the issue is taking it's output as conclusion rather than as a signal / lead.
I would never let an LLM make an amputate or not decision, but it could convince me to go talk with an expert who sees me in person and takes a holistic view.
The basic idea was to adapt JEPA (Yann LeCun's Joint-Embedding Predictive Architecture) to multivariate time series, in order to learn a latent space of human health from purely unlabeled data. Then, we tested the model using supervised fine tuning and evaluation on on a bunch of downstream tasks, such as predicting a diagnosis of hypertension (~87% accuracy). In theory, this model could be also aligned to the latent space of an LLM--similar to how CLIP aligns a vision model to an LLM.
IMO, this shows that accuracy in consumer health will require specialized models alongside standard LLMs.
The author is a healthy person but the computer program still gave him a failing grade of F. It is irresponsible for these companies to release broken tools that can cause so much fear in real people. They are treating serious medical advice like it is just a video game or a toy. Real users should not be the ones testing these dangerous products.
I would claim that ignoring the "ChatGPT is AI and can make mistakes. Check important info." text, right under the query they type in client, is clearly more irresponsible.
I think that a disclaimer like that is the most useful and reasonable approach for AI.
"Here's a tool, and it's sometimes wrong." means the public can have access to LLMs and AI. The alternative, that you seem to be suggesting (correct me if I'm wrong), means the public can't have access to an LLM until they are near perfect, which means the public can't ever have access to an LLM, or any AI.
What do you see as a reasonable approach to letting the public access these imperfect models? Training? Popups/agreement after every question "I understand this might be BS"? What's the threshold for quality of information where it's no longer considered "broken"? Is that threshold as good as or better than humans/news orgs/doctors/etc?
Apple watch told me, based on vo2 max, that i'm almost dead, all the time. I went to the doctor, did a real test and it was complete nonsense. I had the watch replaced 3 times but same results, so I returned it and will not try again. Scaring people with stuff you cannot actually shut off (at least you couldn't before) is not great.
Typical Western coverage: “How dare they call me unhealthy.”
In reality, the doctor said it needs further investigation and that some data isn’t great. They didn’t say “unhealthy”; they said “needs more investigation.” What’s wrong with that? Is the real issue just a bruised Western ego?
Health metrics are absolutely tarnished by a lack of proper context. Unsurprisingly, it turns out that you can't reliably take a concept as broad as health and reduce it to a number. We see the same arguments over and over with body fat percentages, vo2 max estimates, BMI, lactate thresholds, resting heart rate, HRV, and more. These are all useful metrics, but it's important to consider them in the proper context that each of them deserve.
This article gave an LLM a bunch of health metrics and then asked it to reduce it to a single score, didn't tell us any of the actual metric values, and then compared that to a doctor's opinion. Why anyone would expect these to align is beyond my understanding.
The most obvious thing that jumps out to me is that I've noticed doctors generally, for better or worse, consider "health" much differently than the fitness community does. It's different toolsets and different goals. If this person's VO2 max estimate was under 30, that's objectively a poor VO2 max by most standards, and an LLM trained on the internet's entire repository of fitness discussion is likely going to give this person a bad score in terms of cardio fitness. But a doctor who sees a person come in who isn't complaining about anything in particular, moves around fine, doesn't have risk factors like age or family history, and has good metrics on a blood test is probably going to say they're in fine cardio health regardless of what their wearable says.
I'd go so far to say this is probably the case for most people. Your average person is in really poor fitness-shape but just fine health-shape.
> Despite having access to my weight, blood pressure and cholesterol, ChatGPT based much of its negative assessment on an Apple Watch measurement known as VO2 max, the maximum amount of oxygen your body can consume during exercise. Apple says it collects an “estimate” of VO2 max, but the real thing requires a treadmill and a mask. Apple says its cardio fitness measures have been validated, but independent researchers have found those estimates can run low — by an average of 13 percent.
There's plenty of blame to go around for everyone, but at least for some of it (such as the above) I think the blame more rests on Apple for falsely representing the quality of their product (and TFA seems pretty clearly to be blasting OpenAI for this, not others like Apple).
What would you expect the behavior of the AI to be? Should it always assume bad data or potentially bad data? If so, that seems like it would defeat the point of having data at all as you could never draw any conclusions from it. Even disregarding statistical outliers, it's not at all clear what part of the data is "good" vs "unrealiable" especially when the company that collected that data claims that it's good data.
FWIW, Apple has published validation data showing the Apple Watch's estimate is within 1.2 ml/kg/min of a lab-measured Vo2Max.
Behind the scenes, it's using a pretty cool algorithm that combines deep learning with physiological ODEs: https://www.empirical.health/blog/how-apple-watch-cardio-fit...
[delayed]
> I think the blame more rests on Apple for falsely representing the quality of their product
There was plenty of other concerning stuff in that article. And from a quick read it wasn't suggested or implied the VO2 max issue was the deciding factor for the original F score the author received. The article did suggest many times over the ChatGPT is really not equipped for the task of health diagnosis.
> There was another problem I discovered over time: When I tried asking the same heart longevity-grade question again, suddenly my score went up to a C. I asked again and again, watching the score swing between an F and a B.
The lack of self-consistency does seem like a sign of a deeper issue with reliability. In most fields of machine learning robustness to noise is something you need to "bake in" (often through data augmentation using knowledge of the domain) rather than get for free in training.
> Should it always assume bad data or potentially bad data? If so, that seems like it would defeat the point of having data at all as you could never draw any conclusions from it.
Yes. You, and every other reasoning system, should always challenge the data and assume it’s biased at a minimum.
This is better described as “critical thinking” in its formal form.
You could also call it skepticism.
That impossibility of drawing conclusions assumes there’s a correct answer and is called the “problem of induction.” I promise you a machine is better at avoiding it than a human.
Many people freeze up or fail with too much data - put someone with no experience in front of 500 ppl to give a speech if you want to watch this live.
> What would you expect the behavior of the AI to be? Should it always assume bad data or potentially bad data? If so, that seems like it would defeat the point of having data at all as you could never draw any conclusions from it.
Well, I would expect the AI to provide the same response as a real doctor did from the same information. Which the article went over the doctors were able to.
I also would expect the AI to provide the same answer every time to the same data unlike what it did (from F to B over multiple attempts in the article)
OpenAI is entirely to blame here when they are putting out faulty products, (hallucinations even on accurate data are a fault of them).
I have been sitting and waiting for the day these trackers get exposed as just another health fad that is optimized to deliver shareholder value and not serious enough for medical grade applications
I don't see how they are considered a health fad, they're extremely useful and accurate enough. There are plenty of studies and real world data showing Garmin VO2Max readings being accurate to 1-2 points different to a real world test.
There is this constant debate about how accurately VO2max is measured and its highly dependent on actually doing exercise to determine your VO2max using your watch. But yes if you want a lab/medically precise measure you need to do it a test that measures your actual oxygen uptake.
A simple understanding of transformers should be enough to make someone see that using an LLM to analyze multi-variate time series data is a really stupid endeavor.
ChatGPT Health is a completely wreckless and dangerous product, they should be sued into oblivion for even naming it "health".
A year or so ago, I fed my wife's blood work results into chatgpt and it came back with a terrifying diagnosis. Even after a lot of back and forth it stuck to its guns. We went to a specialist who performed some additional tests and explained that the condition cannot be diagnosed with just the original blood work and said that she did not have the condition. The whole thing was a borderline traumatic ordeal that I'm still pretty pissed about.
Please keep telling your story. This is the kind of shit that medical science has been dealing with for at least a century. When evaluating testing procedures false positives can have serious consequences. A test that's positive every time will catch every single true positive, but it's also worthless. These LLMs don't have a goddamn clue about it. There should be consequences for these garbage fires giving medical advice.
Part of the issue is taking it's output as conclusion rather than as a signal / lead.
I would never let an LLM make an amputate or not decision, but it could convince me to go talk with an expert who sees me in person and takes a holistic view.
I can't wait until it starts recommending signing me up for an OpenAI personalized multi-vitamin® supscription
We trained a foundation model specifically for wearable data: https://www.empirical.health/blog/wearable-foundation-model-...
The basic idea was to adapt JEPA (Yann LeCun's Joint-Embedding Predictive Architecture) to multivariate time series, in order to learn a latent space of human health from purely unlabeled data. Then, we tested the model using supervised fine tuning and evaluation on on a bunch of downstream tasks, such as predicting a diagnosis of hypertension (~87% accuracy). In theory, this model could be also aligned to the latent space of an LLM--similar to how CLIP aligns a vision model to an LLM.
IMO, this shows that accuracy in consumer health will require specialized models alongside standard LLMs.
The author is a healthy person but the computer program still gave him a failing grade of F. It is irresponsible for these companies to release broken tools that can cause so much fear in real people. They are treating serious medical advice like it is just a video game or a toy. Real users should not be the ones testing these dangerous products.
> It is irresponsible for these companies
I would claim that ignoring the "ChatGPT is AI and can make mistakes. Check important info." text, right under the query they type in client, is clearly more irresponsible.
I think that a disclaimer like that is the most useful and reasonable approach for AI.
"Here's a tool, and it's sometimes wrong." means the public can have access to LLMs and AI. The alternative, that you seem to be suggesting (correct me if I'm wrong), means the public can't have access to an LLM until they are near perfect, which means the public can't ever have access to an LLM, or any AI.
What do you see as a reasonable approach to letting the public access these imperfect models? Training? Popups/agreement after every question "I understand this might be BS"? What's the threshold for quality of information where it's no longer considered "broken"? Is that threshold as good as or better than humans/news orgs/doctors/etc?
> Popups/agreement after every question "I understand this might be BS"?
Considering the number of people who take LLM responses as authoritative Truth, that wouldn't be the worst thing in the world.
What LLM should the LLM turn to ask if what the user is asking is safe for the first LLM to answer?
Apple watch told me, based on vo2 max, that i'm almost dead, all the time. I went to the doctor, did a real test and it was complete nonsense. I had the watch replaced 3 times but same results, so I returned it and will not try again. Scaring people with stuff you cannot actually shut off (at least you couldn't before) is not great.
Original article can be read at https://www.washingtonpost.com/technology/2026/01/26/chatgpt....
Paywall-free version at https://archive.ph/k4Rxt
Typical Western coverage: “How dare they call me unhealthy.” In reality, the doctor said it needs further investigation and that some data isn’t great. They didn’t say “unhealthy”; they said “needs more investigation.” What’s wrong with that? Is the real issue just a bruised Western ego?