I'd be very very hesitant to trust studies like this. It's very easy to mess up these benchmarks.
See for example this recent paper where AI managed to beat radiologists on interpreting x-rays... when the AI didn't even have access to the x-rays: https://arxiv.org/pdf/2603.21687 (on a pre existing "large scale visual question answering benchmark for generalist chest x-ray understanding" that wasn't intentionally messed up).
And in interpreting x-ray's human radiologists actually do just look at the x-rays. In the context the article is discussing the human doctors don't just look at the notes to diagnose the ER patient. You're asking them to perform a task that isn't necessary, that they aren't experienced in, or trained in, and then saying "the AI outperforms them". Even if the notes aren't accidentally giving away the answer through some weird side channel, that's not that surprising.
Which isn't to say that I think the study is either definitely wrong, or intentionally deceptive. Just that I wouldn't draw strong conclusions from a single study here.
I agree with you on this specific study, however, I can't really wrap my head about the fact that doctors will be better than AI models on the long-run. After all, medicine is all about knowledge, experience and intelligence (maybe "pattern recognition"), all those, we must assume that the best AI models (especially ones focusing solely in the medical field) would largely beat large majority of humans (aka doctors), if we already have this assumption for software engineers, we should have it for this field as well, and let's be realistic, each time I've seen a doc the last few months (and ER twice), each time they were using ChatGPT btw (not kidding, it chocked me).
So I’m genuinely curious:
What is the specific capability (or combination of capabilities) that people believe will remain permanently (or at least for decades) where a top medical AI cannot match or exceed the performance of a good human doctor? Let's put liability and ethics aside, let's be purely objective about it.
Medicine is so much more than "knowledge, experience, and pattern matching", as any patient ever can attest to. Why is it so hard for some people to understand that humans need other humans and human problems can't be solved with technology?
So much of what I know from women in my life is that the human element of medicine is almost a strict negative for them. As a guy it hasn't been much better, but at least doctors listen to me when I say something.
One of, if not THE biggest challenge in getting treatment is getting past insurance rules designed to deny treatment. This is much, much easier when you're able to convince a doctor (and/or trained medical staff) to argue on your behalf. If you can't get those folks to listen to you, that's probably not gonna happen. You might have to go through several different practices before you find a sympathetic ear.
Now replace some / all of those humans with... A machine whose function also needs insurance approval.
Sounds like we need to dismantle and replace this broadly dysfunctional system at multiple points. It's not like the US insurance landscape is anywhere close to the best way of handling healthcare if you look at many places in the world.
At which point I'd ask: how much of that is baked into the AI now?
It doesn't have opinions, research, direction of its own. Is this a path of codifying the worst elements of human society as we've known it, permanently?
I think there's a real space there, and a lot of what e.g. nurses and doctors do is talking to humans, and that won't go away.
But two facts are also true: a) diagnosis itself can be automated. A lot of what goes on between you having an achy belly and you getting diagnosed with x y or z is happening outside of a direct interaction with you - all of that can be augmented with AI. And b), the human interaction part is lacking a great deal in most societies. Homeopathy and a lot of alternative medicine from what I can see has its footing in society simply because they're better at talking to people. AI could also help with that, both in direct communication with humans, but also in simply making a lot of processes a lot cheaper, and maybe e.g. making the required education to become a human facing medicinal professional less of a hurdle. Diagnosis becomes cheaper & easier -> more time to actually talk to patients, and more diagnosises made with higher accuracy.
How are you defining technology? How are you defining human problems? Inventions are created to solve human problems, not theoretical problems of fictional universe. Do X-rays, refrigerators, phones and even looms solve problems for nonhumans?
Claiming something that sounds deep doesn’t make it an axiom.
>Medicine is so much more than "knowledge, experience, and pattern matching", as any patient ever can attest to.
Humans (doctors/nurses) can still be there to make you feel the warmth of humanity in your darkest times, but if a machine is going to perform better at diagnosing (or perhaps someday performing surgery), then I want the machine.
Even now, I'll take a surgeon that's a complete jerk over a nice surgeon any day, because if they've got that job even as a jerk they've got to be good at their jobs. I want results. I'll handle hurt feelings some other time.
I'd be a little bit careful here - being a jerk is quite different to non-conformity / red sneaker effect in surgery and it is not a quality you should look for.
The truly compassionate surgeons will want to improve their skills because they care about their patients. They care if they develop complications and may feel terrible if they do, the jerk may not. Being a jerk may mean that the surgeon can rise to the top, but it may not be due to surgical skill at all, they may be better at navigating politics etc.
"Human problems can't be solved with technology" is just wrong, unless you have narrower definitions of a "human problem" or "technology".
For instance, transportation is a "human problem". It's being successfully solved with such technologies as cars, trains, planes, etc. Growing food at scale is a "human problem" that's being successfully solved by automation. Computing... stuff could be a "human problem" too. It's being successfully solved by computers. If "human problems" are more psychological, then again, you can use the Internet to keep in touch with people, so again technology trying to solve a human problem.
If you read the study, the whole conclusion is much less spectacular than the article. What the article really pushes happened:
patients -> AI -> diagnosis (you know, with a camera, or perhaps a telephone I guess)
What REALLY happened
patients -> nurse/MD -> text description of symptoms -> MD -> question (as in MD asked a relevant diagnostic question, such as "is this the result of a lung infection?", or "what lab test should I do to check if this is a heart condition or an infection?") -> AI -> answer -> 2 MDs (to verify/score)
vs
patients -> nurse/MD -> text description of symptoms -> MD -> question -> (same or other) MD -> answer -> 2 MDs verify/score the answer
Even with that enormous caveat, there's major issues:
1) The AI was NOT attempting to "diagnose" in the doctor House sense. The AI was attempting to follow published diagnostic guidelines as perfectly as possible. A right answer by the AI was the AI following MDs advice, a published process, NOT the AI reasoning it's way to what was wrong with the patient.
2) The MD with AI support was NOT more accurate (better score but NOT statistically significant, hence not) than just the MD by himself. However it was very much a nurse or MD taking the symptoms and an MD pre-digesting the data for to the AI.
3) Diagnoses were correct in the sense that it followed diagnostic standards, as judged afterwards by other MDs. NOT in the sense that it was tested on a patient and actually helped a live patient (in fact there were no patients directly involved in the study at all)
If you think about it in most patients even treating MDs don't know the correct conclusion. They saw the patient come in, they took a course of action (probably wrote at best half of it down), and the situation of the patient changed. And we repeat this cycle until patient goes back out, either vertically or horizontally. Hopefully vertically.
And before you say "let's solve that" keep in mind that a healthy human is only healthy in the sense that their body has the situation under control. Your immune system is fighting 1000 kinds of bacteria, and 10 or so viruses right now, when you're very healthy. There are also problems that developed during your life (scars, ripped and not-perfectly fixed blood vessels, muscle damage, bone cracks, parts of your circulatory system having way too much pressure, wounds, things that you managed to insert through your skin leaking stuff into your body (splinters, insects, parasites, ...), 20 cancers attempting to spread (depends on age, but even a 5 year old will have some of that), food that you really shouldn't have eaten, etc, etc, etc). If you go to the emergency room, the point is not to fix all problems. The point is to get your body out of the worsening cycle.
This immediately calls up the concern that this is from doctor reports. In practice, of course, maybe the AI only performs "better" because a real doctor walked up to the patient and checked something for himself, then didn't write it down.
What you can perhaps claim this study says is that in the right circumstances AIs can perform better at following a MD's instructions under time and other pressure than an actual MD can.
I would personally vastly, vastly prefer to go to a robot doctor, who diagnoses, treats and nurses me. What exactly do I need from a human here? Except of course being the one making the system.
Emotional support. Some human doctors absolutely radiate confidence and a kind of "you're gonna be okay" attitude. For me, this helps a lot. I'm not sure a machine can do this.
But I hate if the human doctor "radiates confidence" when I know he is not doing the proper scan, because I have to get back with worse symptoms first for him to take it serious. I don't need emotional support from a human doctor. I need the adequate scans and a proper analysis. I am pretty sure that a competent human will be still way better than AI, but AI even now will likely be better than a doctor not really paying attention.
> What is the specific capability (or combination of capabilities) that people believe will remain permanently (or at least for decades) where a top medical AI cannot match or exceed the performance of a good human doctor? Let's put liability and ethics aside, let's be purely objective about it.
Being a human when a patient is experiencing what is potentially one of the worst moments of their life. AI could be a tool doctors use, but let’s not dehumanize health care further, it is one of the most human professions that crosses about every division you can think of.
I would not want to receive a cancer diagnosis from a fucking AI doctor.
On the other hand, health care is not scaling to meet the growing demand of societies (look at the growing wait queues for access to basic medical attention in most Western nations). The cause of this is a separate topic and something that deserves more attention than it currently gets, but I digress. If AI can fill the gap by making 24/7/265 instant diagnosis and early intervention a reality, with it then bringing a human into the loop when actually necessary... I think that is something worth pursuing as a force multiplier.
We're clearly not there yet, but it is inevitible that these models will eventually exceed human capability in identifying what an issue is, understanding all of the health conditions the patient has, and recommending a treatment plan that results in the best outcome.
You may not want to receive a cancer diagnosis from an AI doctor... but if an AI doctor could automatically detect cancer (before you even displayed symptoms) and get you treated at a far earlier date than a human doctor, you would probably change your mind.
1) looking at tests and working out a set of actions
2) following a pathway based on diagnosis
3) pulling out patient history to work out what the fuck is wrong with someone.
Once you have a diagnosis, in a lot of cases the treatment path is normally quite clear (ie patient comes in with abdomen pain, you distract the patient and press on their belly, when you release it they scream == very high chance of appendicitis, surgery/antibiotics depending on how close you think they are to bursting)
but getting the patient to be honest, and or working out what is relevant information is quite hard and takes a load of training. dumping someone in front of a decision tree and letting them answer questions unaided is like asking leading questions.
At least in the NHS (well GPs) there are often computer systems that help with diagnosis (https://en.wikipedia.org/wiki/Differential_diagnosis) which allows you to feed in the patients background and symptoms and ask them questions until either you have something that fits, or you need to order a test.
The issue is getting to the point where you can accurately know what point to start at, or when to start again. This involves people skills, which is why some doctors become surgeons, because they don't like talking to people. And those surgeons that don't like talking to people become orthopods. (me smash, me drill, me do good)
Where AI actually is probably quite good is note taking, and continuous monitoring of HCU/ICU patients
But liability and ethics cannot be put aside. If treatments were free of cost and perfectly address problems, then a correct diagnosis would always lead to the optimal patient outcome. In that scenario, AI diagnosis will be like code generation and go asymptotic to perfection as models improve.
But a doctor's job in the real world today is to navigate a total mess of uncertainty: about the expected outcome of treatments given a patient's age and other peoblems. About the psychological effect of knowing about a problem that they cannot effectively treat. Even about what the signals in the chart and x-ray mean with any certainty.
We are very far from having unit test suites for medical problems.
Sure, but my anecdotal experience is that doctors do this regularly in real life, especially when choosing to diagnose or ignore problems that are unlikely to kill an aging patient before some other larger issue does.
> I can't really wrap my head about the fact that doctors will be better than AI models on the long-run.
Nobody said that though?
If the current trajectory continues and if advancements are made regarding automated data collection about patients and if those advancements are adopted in the clinic then presumably specialized medical models will exceed human performance at the task of diagnosis at some point in the future. Clearly that hasn't happened yet.
Until medical models can contrive of unique diagnosis, this will not be true and cannot be true.
Medical models can absolutely get better at recognizing the patterns of diagnosis that doctors have already been diagnosing - which means they will also amplify misdiagnosis that aren't corrected for via cohort average. This is easy to see a large problem with: you end up with a pseudo-eugenics medical system that can't help people who aren't experiencing a "standard" problem.
The pitfall you describe is not inconsistent with exceeding human performance by most metrics.
I'd argue that the current system in the west already exhibits this problem to some extent. Fortunately it's a systemic issue as opposed to a technical one so there's no reason AI necessarily has to make it worse.
This study is based almost entirely on pre-existing "vignettes." In other words, on tests that are already known and have existed for years, the model did well, which is precisely what you should expect.
It provides no information on real world outcomes or expectations of performance in such a setting. A simple question might be "how accurate are patient electronic health records typically?"
Finally, if the Internet somehow goes down at my hospital, the Doctor can still think, while LLM services cannot. If the power goes out at the hospital, the Doctor can still operate, while even local LLMs cannot.
You're going to need to improve the power efficiency of these models by at least two orders of magnitude before they're generally useful replacements of anything. As it is now they're a very expensive, inefficient and fragile toy.
Medicine is about knowledge, but acquiring knowledge may in fact require "breaking out of the box" that AI is increasing behind to avoid touching "touchy subjects" or insulting anyone and so on.
I think it's plausible since doctors tend to have human cognitive biases and miss things. People tend to fixate on patterns they're most familiar with.
I think AI can be useful in any kind of context interpretation, but not make a decision.
Could be running in the background on patient data and message the doctor "I see X in the diagnostic, have you ruled out Y, as it fits for reasons a, b, c?"
I like my coding agents the same way, inform me during review on things that I've missed. Instead of having me comb through what it generates on a first pass.
but those kind of x-ray models are already activly used. They are not used though as a only and final diagnosis. Its more like peer review and priorization like check this image first because it seems most critical today.
These type of experiments are bound to have biases depending on who is doing it and who is funding it. The experiment is being funded for a particular reason itself to move the narrative in a desired direction. This is probably a good reason to have government funded research in these type of sensitive areas.
I still don't quite understand, after skimming the paper. How does it achieve high scores without access to the images (beating even humans with access to the images)?
It's 50% of the time ER doctors working solely from notes, something they never do, in a situation they know is only for a study, will miss what you have.
In real clinical situations the doctors see, hear, smell, and interact with the patients.
I believe in modern medicine but I lost some faith in the American institutions around it when I "diagnosed" my partner with the correct disease that the first rheumatologist dismissed and told them to just stretch. It was officially diagnosed years later, and we lost a lot of time because of it.
I'm even more concerned that current models are not trained to say no, or to even recognize most failure modes.
"Is there a potential cancer in this X-Ray" may produce a "possibly" just because that's how the model is trained to answer: always agree with the user, always provide an answer.
Oh, and don't forget that "Is there a potential cancer in this X-Ray" and "Are there any potential problems in this X-Ray" are two completely different prompts that will lead to wildly different answers.
I'm surprised at both the article and the paper - both seem very hyperbolic. This is LLMs competing against doctors in a way that is heavily weighted in the LLMs favour, which does not represent clinical practice. These reasoning cases are not benchmarks for doctors, they are learning tools.
I think it's important to note that diagnosis also relies on accurate description of the patient in the first place, and the information you gather depends on the differential diagnosis. Part of the skill of being a doctor is gathering information from lots of different sources, and trying to filter out what is important. This may be from the patient, who may not be able to communicate clearly or may be non verbal, carers and next of kin. History-taking is a skill in itself, as well as examination. Here those data are given.
For pattern recognition from plain text, especially on questions that may be in the o1's training data, I'm not surprised at all that it would outperform doctors, but it doesn't seem to be a clinically useful comparison. Deciding which investigations to do, any imaging, and filtering out unnecessary information from the history is a skill in itself, and can't really be separated from forming the diagnosis.
Gell-Mann Amnesia kicks in hard as soon as the LLM topic changes to a profession other than our own. It’s much easier to believe an LLM can outperform someone else doing their job than to believe that it’s a good idea to replace your own work with an LLM.
The number in the headline isn’t even a good comparison because they asked doctors to make a diagnosis from notes a nurse typed up. Doctors are trained to be conservative with diagnosing from someone else’s notes because it’s their job to ask the patient questions and evaluate the situation, whereas an LLM will happily leap to a conclusion and deliver it with high confidence
When they allowed both humans and doctors access to more information about the case, the difference between groups collapsed into statistical insignificance:
> The diagnosis accuracy of the AI – OpenAI’s o1 reasoning model – rose to 82% when more detail was available, compared with the 70-79% accuracy achieved by the expert humans, though this difference was not statistically significant.
Talking to my medical professional friends, LLMs are becoming a supercharged version of Dr. Google and WebMD that fueled a lot of bad patient self-diagnoses in the past. Now patients are using LLMs to try to diagnose themselves and doing it in a way where they start to learn how to lead the LLM to the diagnosis they want, which they can do for a hundred rounds at home before presenting to the doctor and reciting the script and symptoms that worked best to convince the LLM they had a certain condition.
> "An AI and a pair of human doctors were each given the same standard electronic health record to read"
This is handicapping the human doctors abilities. There is a lot more information a human doctor can gather even with a brief observation of the patient.
Agreed. I think the best use of this sort of tech is to use both to their strengths. Use AI to go over the record and suggest diagnoses which you have the doctor review after observing the patient.
The other thing is that common issues are common. I have to wonder how much that ultimately biases both the doctor and the LLM. If you diagnose someone that comes in with a runny nose and cough as having the flu you will likely be right most of the time.
Bonus, health networks now push doctors to use AI transcription software for the EHR entries. Doctors and nurses like it because they don't have to type it up. But it is a complete shitshow on whether the records are reviewed for transcription errors which happen quite often
Now feed a flawed transcripted into an AI diagnosis system and bam-o. The AI will treat it as gospel, while the doctor may go wait what.
I wonder about the nuance within the data. Like does AI do much worse with children than adults, but still better overall for example. Or biological male vs female. I think we'd want it to do better across all groups, ages etc so we're not introducing some kind of horrible bias resulting in deaths or serious health consequences for some groups
Believable and not shocking. LLMs literally may have saved my sons and potentially her mother too by allowing us to fact check a lot of non sense data and scare tactics by a group of at least 5 different doctors ambushing us to make a life changing decision in minutes. The problem is doctors, at least in the US, prioritize liability exposure over patients long term outcomes. Let’s say you need an intervention where two options A and B are available to you. A carries 1% risk of complications but a great outcome. Option B has 0.1% risk of complications but once you are discharged the short term effects are challenging and long term effects not well understood. Well, 10/10 times doctors will suggest option B and will do anything they can to nudge you into making that choice, like not telling you the absolute numbers and constantly using the word “death”. They also lie about the outcomes, because again, once you accept the procedure, sign and are sent home, they have nothing to do with you.
Besides for myself and wife, I've also used LLMs to diagnose my dogs. Convinced there's a huge opportunity for AI based veterinary, especially one which then performs bidding across the local veterinary clinics to perform the care/surgeries. I've noticed that local vets vary in price by more than an order of magnitude. My 80 year old mother and mother inlaw have been regularly scammed by over charging vets, and with their dogs being a major part of their lives, they extremely susceptible to pressure.
I don't think AI is a good use case for such critical situations. Maybe in a decade we have AI help out doctors with doing a pre check. What if Ai finds nothing and the doctor does not bother to look into it further? It is this small question which breaks the technology from any angle later down the road from my POV. AI has to stay optional here.
Even if AI is used to sample or summarize a lot of data that a human couldn't do in time: What if it misses something that a human won't? What if a human inversely misses something that AI won't? Would you rather trust the machine or the human? (Especially if the human is held accountable.)
As a 37 year old male with 2 THRs I'm glad the AI was NOT used in my diagnosis. All the models that I used to look at my x-rays said nothing was wrong, even when adding symptoms. When adding age it said the patient was too young.
(I was ~3 months away from wheelchair bound in those x-rays).
The worst one was Gemini. Upload an x-ray of just the right hip, and it started to talk about how good the left hip looked like.
I think with AI taking over it's gonna be harder to get a solution when your problem isn't the run-of-the mill.
Sorry, it's on the entirely wrong side of the spectrum. We're doing geospatial analysis. Although it'd be hilarious to see what it thinks about X-Rays.
All versions and levels of Gemini have terrible spatial reasoning. I don't know why. That kind of task seems to be simply outside of the abilities of the model.
LLMs can be a useful second opinion for a highly educated patient with good insight into their health and body, but this is not the average patient I see in an urban emergency department. Many patients can't give a cohesive history without a skilled clinician who can ask the right questions and read between the lines.
I am very skeptical of studies like this that don't adequately reflect real world conditions, but when I was a software engineer I probably wouldn't have understood what "real" medicine is like either.
All the other points raised in this thread aside, it seems like an odd thing to benchmark because a significant proportion of ER practice is dealing with emergencies, often accidental injuries. There's not a whole of diagnosing going on if you show up to ER with a gash on your forehead or a missing finger.
I still want humans in the loop, interpreting the LLMs findings and providing a sanity check.
You can’t hold an LLM accountable.
That’s the min responsible bar for LLM authored code, which normally doesn’t really matter much. For something as important as ER diagnostics, having a human in the loop is crucial.
The narrative that these tools are replacing human intelligence rather than augmenting it is, quite frankly, stupid.
At this point the study is already mostly irrelevant because the model in question has long been far surpassed by new models. It seems traditional publishing doesn't work for really fast moving fields.
5. Private Equity uses this valuable data to stack rank doctors based on how correct / AI-aligned their diagnoses are over time
6. Rankings are used to periodically "trim the fact" thus delivering more optimized cash flows to clinics that have been saddled with toxic debt
7. Sensing an opportunity AI providers start selling a $200 / month Data Leakage as a Service subscription to overworked physicians so that they can avoid the PE guillotine
Why would private equity want more competent doctors?
Incompetent ones order unnecessary tests and exhaust treatment possibilities, which drives up cost billed to insurance.
Only the insurance industry and perhaps licensing bodies can pressure to keep the quality floor high, at least in terms of accurate diagnosis and prevention of overtreatment.
A more realistic step 7 is that physicians gradually align their diagnoses with the LLM as they sacrifice to Moloch in order to (temporarily) game the metric. Eventually the humans become little more than an imperfect proxy for the LLMs and are eliminated.
I agree with GP's solution but we'd need regulation to prohibit what you describe.
5. Doctors delegate everything to AI assistants because humans are lazy, especially if those AI assistants are correct some significant portion of the time
Then the claim may be that you don't need that many doctors anymore and that one doctor can do the job of X doctors in less time which has the economical effect that there is less demand for/supply of doctors, which then results in a home grown shortage of doctors, since less people are incentivized to become doctors...
radiology already had its "AI beats doctors" moment. radiologists are still here. what changed first was the workflow, not the specialty. er is probably next.
It is easy to overinterpret this based on the headline, the doctors were actually at a slight disadvantage. This isn't how they normally work, this is a little more like a med school pop quiz:
An AI and a pair of human doctors were each given the same standard electronic health record to read – typically including vital sign data, demographic information and a few sentences from a nurse about why the patient was there. The AI identified the exact or very close diagnosis in 67% of cases, beating the human doctors, who were right only 50%-55% of the time.... The study only tested humans against AIs looking at patient data that can be communicated via text. The AI’s reading of signals, such as the patient’s level of distress and their visual appearance, were not tested. That means the AI was performing more like a clinician producing a second opinion based on paperwork.
"I don't know, let's run more tests" is also a very important ability of doctors that was apparently not tested here. In addition to all the normal methodological problems with overinterpreting results in AI/LLMs/ML/etc. Sadly I do think part of the problem here is cynical (even maniacal) careerist doctors who really shouldn't be working at hospitals. This means that even though I am generally quite anti-LLM, and really don't like the idea of patients interacting with them directly, I am a little optimistic about these being sanity/laziness checkers for health professionals.
I’ve had much better luck with diagnosis of my own family’s issues than with doctors. Usually now, I’m feeding them more information to begin with, so that their 30 minute office visits are not wasted, requiring another expensive follow up appointment.
While I’m sure there can be ways in which such studies are wrong, it’s very obvious that AI can accelerate work in many of these areas where we seek out professional help - doctors, lawyers, etc.
It can speed up some aspects of work, but please don't trust some llm with variable quality of output more than professional. If you don't like current doctor try another, most are in the business of helping other people.
If you have string of issues with 10 last doctors though, then issue is, most probably, you...
My wife is a GP, and easily 1/3 of her patients have also some minor-but-visible mental issue. 1-2 out of 10 scale. Makes them still functional in society but... often very hard to be around with.
That doesn't mean I don't trust your words, there are tons of people with either rare issues or even fairly common ones but manifesting in non-standard way (or mixed with some other issue). These folks suffer a lot to find a doctor who doesn't bunch them up in some general state with generic treatment. There are those, but not that often.
It helps both sides tremendously if patient is not above or arrogant know-it-all waving with chatgpt into doctor's face and basically just coming for prescription after self-diagnosis. Then, help is sometimes proportional to situation and lawful obligations.
Respectfully, as someone with a family with plenty of medical issues and having experienced plenty of useless doctors, the onus is now on medical professionals to prove their worth. They are a second option and most of their remaining value is in the license to prescribe medication, after being told by laymen what medication is appropriate. They're using the same tools I am and they're worse at evaluating them.
Doctors thinking patients are arrogant is an age old problem.
Not only should AI misdiagnose to save lives, but a human should too. You walk in with symptoms that most likely is a harmless virus that clears up on its own or 5% of the time is a deadly bacteria. The correct course of action is to try to test if it is the 5% case (most often the wrong diagnosis), not send people home because they are most likely fine. Many cases have a similar low but not 0 risky diagnosis.
Unfortunately, from my understanding Doctors don't necessarily diagnose for accuracy, they often diagnose to limit liability.
They aren't going to take a stab at an uncommon diagnosis even if it occurs to them, if they might get sued if they're wrong.
Edit: I'm not trying to say Doctors deliberately diagnose wrong. Just that if there are two possible diagnoses, one common that matches some of the symptoms and one rare that matches all symptoms, doctors are still much more likely to diagnose the common one. Hoofbeats, horses, zebras, etc
The Guardian needs to raise their bar on what to report and how to give readers full context on the ongoing NFT AI trust me bro crypto scam and that context would be that it is a mathematical model of human language and not medical expert or replacement for one.
Fair enough. But there's lot of faulty and wrong peer reviewed research as well. One such paper comes to mind which is probably cited some 7000+ times in other papers but itself is wrong.
Humans could not diagnose and treat me correctly. They almost killed me. Curious where I could feed my symptoms and the same data I gave to an ER to an AI to test it.
I’d love to see a follow to that radiologist evaluation, where it failed so miserably on the thing it was supposed to be the best at that now there’s a shortage of radiologists.
Not an expert but what I’ve heard is that AI-based radiology analysis has brought down prices so much that there’s been a huge increase in demand, which has led to employee shortages.
I'd be very very hesitant to trust studies like this. It's very easy to mess up these benchmarks.
See for example this recent paper where AI managed to beat radiologists on interpreting x-rays... when the AI didn't even have access to the x-rays: https://arxiv.org/pdf/2603.21687 (on a pre existing "large scale visual question answering benchmark for generalist chest x-ray understanding" that wasn't intentionally messed up).
And in interpreting x-ray's human radiologists actually do just look at the x-rays. In the context the article is discussing the human doctors don't just look at the notes to diagnose the ER patient. You're asking them to perform a task that isn't necessary, that they aren't experienced in, or trained in, and then saying "the AI outperforms them". Even if the notes aren't accidentally giving away the answer through some weird side channel, that's not that surprising.
Which isn't to say that I think the study is either definitely wrong, or intentionally deceptive. Just that I wouldn't draw strong conclusions from a single study here.
I agree with you on this specific study, however, I can't really wrap my head about the fact that doctors will be better than AI models on the long-run. After all, medicine is all about knowledge, experience and intelligence (maybe "pattern recognition"), all those, we must assume that the best AI models (especially ones focusing solely in the medical field) would largely beat large majority of humans (aka doctors), if we already have this assumption for software engineers, we should have it for this field as well, and let's be realistic, each time I've seen a doc the last few months (and ER twice), each time they were using ChatGPT btw (not kidding, it chocked me).
So I’m genuinely curious:
What is the specific capability (or combination of capabilities) that people believe will remain permanently (or at least for decades) where a top medical AI cannot match or exceed the performance of a good human doctor? Let's put liability and ethics aside, let's be purely objective about it.
To answer your question: talking to a human.
Medicine is so much more than "knowledge, experience, and pattern matching", as any patient ever can attest to. Why is it so hard for some people to understand that humans need other humans and human problems can't be solved with technology?
Doctors are not necessarily great at talking to patients and patients are unhappy with the information Doctors provide. This moat has dried up.
So much of what I know from women in my life is that the human element of medicine is almost a strict negative for them. As a guy it hasn't been much better, but at least doctors listen to me when I say something.
One of, if not THE biggest challenge in getting treatment is getting past insurance rules designed to deny treatment. This is much, much easier when you're able to convince a doctor (and/or trained medical staff) to argue on your behalf. If you can't get those folks to listen to you, that's probably not gonna happen. You might have to go through several different practices before you find a sympathetic ear.
Now replace some / all of those humans with... A machine whose function also needs insurance approval.
It's gonna end badly.
Sounds like we need to dismantle and replace this broadly dysfunctional system at multiple points. It's not like the US insurance landscape is anywhere close to the best way of handling healthcare if you look at many places in the world.
Yeah that's mostly a US problem. Not a Healthcare problem in general.
At which point I'd ask: how much of that is baked into the AI now?
It doesn't have opinions, research, direction of its own. Is this a path of codifying the worst elements of human society as we've known it, permanently?
One doctor didn't want to give me ritalin, so i went to another one.
One was against it, the other one saw it as a good idea.
I would love to have real data, real statistics etc.
I think there's a real space there, and a lot of what e.g. nurses and doctors do is talking to humans, and that won't go away.
But two facts are also true: a) diagnosis itself can be automated. A lot of what goes on between you having an achy belly and you getting diagnosed with x y or z is happening outside of a direct interaction with you - all of that can be augmented with AI. And b), the human interaction part is lacking a great deal in most societies. Homeopathy and a lot of alternative medicine from what I can see has its footing in society simply because they're better at talking to people. AI could also help with that, both in direct communication with humans, but also in simply making a lot of processes a lot cheaper, and maybe e.g. making the required education to become a human facing medicinal professional less of a hurdle. Diagnosis becomes cheaper & easier -> more time to actually talk to patients, and more diagnosises made with higher accuracy.
> human problems can't be solved with technology
How are you defining technology? How are you defining human problems? Inventions are created to solve human problems, not theoretical problems of fictional universe. Do X-rays, refrigerators, phones and even looms solve problems for nonhumans?
Claiming something that sounds deep doesn’t make it an axiom.
>Medicine is so much more than "knowledge, experience, and pattern matching", as any patient ever can attest to.
Humans (doctors/nurses) can still be there to make you feel the warmth of humanity in your darkest times, but if a machine is going to perform better at diagnosing (or perhaps someday performing surgery), then I want the machine.
Even now, I'll take a surgeon that's a complete jerk over a nice surgeon any day, because if they've got that job even as a jerk they've got to be good at their jobs. I want results. I'll handle hurt feelings some other time.
I'd be a little bit careful here - being a jerk is quite different to non-conformity / red sneaker effect in surgery and it is not a quality you should look for.
The truly compassionate surgeons will want to improve their skills because they care about their patients. They care if they develop complications and may feel terrible if they do, the jerk may not. Being a jerk may mean that the surgeon can rise to the top, but it may not be due to surgical skill at all, they may be better at navigating politics etc.
"Human problems can't be solved with technology" is just wrong, unless you have narrower definitions of a "human problem" or "technology".
For instance, transportation is a "human problem". It's being successfully solved with such technologies as cars, trains, planes, etc. Growing food at scale is a "human problem" that's being successfully solved by automation. Computing... stuff could be a "human problem" too. It's being successfully solved by computers. If "human problems" are more psychological, then again, you can use the Internet to keep in touch with people, so again technology trying to solve a human problem.
The human doesn't need to be as highly trained and paid as a doctor if the human is not performing tasks concordant with that training.
If you read the study, the whole conclusion is much less spectacular than the article. What the article really pushes happened:
patients -> AI -> diagnosis (you know, with a camera, or perhaps a telephone I guess)
What REALLY happened
patients -> nurse/MD -> text description of symptoms -> MD -> question (as in MD asked a relevant diagnostic question, such as "is this the result of a lung infection?", or "what lab test should I do to check if this is a heart condition or an infection?") -> AI -> answer -> 2 MDs (to verify/score)
vs
patients -> nurse/MD -> text description of symptoms -> MD -> question -> (same or other) MD -> answer -> 2 MDs verify/score the answer
Even with that enormous caveat, there's major issues:
1) The AI was NOT attempting to "diagnose" in the doctor House sense. The AI was attempting to follow published diagnostic guidelines as perfectly as possible. A right answer by the AI was the AI following MDs advice, a published process, NOT the AI reasoning it's way to what was wrong with the patient.
2) The MD with AI support was NOT more accurate (better score but NOT statistically significant, hence not) than just the MD by himself. However it was very much a nurse or MD taking the symptoms and an MD pre-digesting the data for to the AI.
3) Diagnoses were correct in the sense that it followed diagnostic standards, as judged afterwards by other MDs. NOT in the sense that it was tested on a patient and actually helped a live patient (in fact there were no patients directly involved in the study at all)
If you think about it in most patients even treating MDs don't know the correct conclusion. They saw the patient come in, they took a course of action (probably wrote at best half of it down), and the situation of the patient changed. And we repeat this cycle until patient goes back out, either vertically or horizontally. Hopefully vertically.
And before you say "let's solve that" keep in mind that a healthy human is only healthy in the sense that their body has the situation under control. Your immune system is fighting 1000 kinds of bacteria, and 10 or so viruses right now, when you're very healthy. There are also problems that developed during your life (scars, ripped and not-perfectly fixed blood vessels, muscle damage, bone cracks, parts of your circulatory system having way too much pressure, wounds, things that you managed to insert through your skin leaking stuff into your body (splinters, insects, parasites, ...), 20 cancers attempting to spread (depends on age, but even a 5 year old will have some of that), food that you really shouldn't have eaten, etc, etc, etc). If you go to the emergency room, the point is not to fix all problems. The point is to get your body out of the worsening cycle.
This immediately calls up the concern that this is from doctor reports. In practice, of course, maybe the AI only performs "better" because a real doctor walked up to the patient and checked something for himself, then didn't write it down.
What you can perhaps claim this study says is that in the right circumstances AIs can perform better at following a MD's instructions under time and other pressure than an actual MD can.
This. The fact that the ai projects have to spin so hard should be tipping people off. But for some reason it doesn’t.
This is extreme cope.
I would personally vastly, vastly prefer to go to a robot doctor, who diagnoses, treats and nurses me. What exactly do I need from a human here? Except of course being the one making the system.
Emotional support. Some human doctors absolutely radiate confidence and a kind of "you're gonna be okay" attitude. For me, this helps a lot. I'm not sure a machine can do this.
But I hate if the human doctor "radiates confidence" when I know he is not doing the proper scan, because I have to get back with worse symptoms first for him to take it serious. I don't need emotional support from a human doctor. I need the adequate scans and a proper analysis. I am pretty sure that a competent human will be still way better than AI, but AI even now will likely be better than a doctor not really paying attention.
You can hopefully get emotional support from your loved ones. If not a coach seems much more appropriate.
Technology is on a generational 10,000 year run of non-stop successfully solving human problems.
and causing them
> What is the specific capability (or combination of capabilities) that people believe will remain permanently (or at least for decades) where a top medical AI cannot match or exceed the performance of a good human doctor? Let's put liability and ethics aside, let's be purely objective about it.
Being a human when a patient is experiencing what is potentially one of the worst moments of their life. AI could be a tool doctors use, but let’s not dehumanize health care further, it is one of the most human professions that crosses about every division you can think of.
I would not want to receive a cancer diagnosis from a fucking AI doctor.
On the other hand, health care is not scaling to meet the growing demand of societies (look at the growing wait queues for access to basic medical attention in most Western nations). The cause of this is a separate topic and something that deserves more attention than it currently gets, but I digress. If AI can fill the gap by making 24/7/265 instant diagnosis and early intervention a reality, with it then bringing a human into the loop when actually necessary... I think that is something worth pursuing as a force multiplier.
We're clearly not there yet, but it is inevitible that these models will eventually exceed human capability in identifying what an issue is, understanding all of the health conditions the patient has, and recommending a treatment plan that results in the best outcome.
You may not want to receive a cancer diagnosis from an AI doctor... but if an AI doctor could automatically detect cancer (before you even displayed symptoms) and get you treated at a far earlier date than a human doctor, you would probably change your mind.
You commonly receive very close proxies for diagnoses through MyChart already when results come back from the lab.
There are a few sides to medicine:
1) looking at tests and working out a set of actions
2) following a pathway based on diagnosis
3) pulling out patient history to work out what the fuck is wrong with someone.
Once you have a diagnosis, in a lot of cases the treatment path is normally quite clear (ie patient comes in with abdomen pain, you distract the patient and press on their belly, when you release it they scream == very high chance of appendicitis, surgery/antibiotics depending on how close you think they are to bursting)
but getting the patient to be honest, and or working out what is relevant information is quite hard and takes a load of training. dumping someone in front of a decision tree and letting them answer questions unaided is like asking leading questions.
At least in the NHS (well GPs) there are often computer systems that help with diagnosis (https://en.wikipedia.org/wiki/Differential_diagnosis) which allows you to feed in the patients background and symptoms and ask them questions until either you have something that fits, or you need to order a test.
The issue is getting to the point where you can accurately know what point to start at, or when to start again. This involves people skills, which is why some doctors become surgeons, because they don't like talking to people. And those surgeons that don't like talking to people become orthopods. (me smash, me drill, me do good)
Where AI actually is probably quite good is note taking, and continuous monitoring of HCU/ICU patients
But liability and ethics cannot be put aside. If treatments were free of cost and perfectly address problems, then a correct diagnosis would always lead to the optimal patient outcome. In that scenario, AI diagnosis will be like code generation and go asymptotic to perfection as models improve.
But a doctor's job in the real world today is to navigate a total mess of uncertainty: about the expected outcome of treatments given a patient's age and other peoblems. About the psychological effect of knowing about a problem that they cannot effectively treat. Even about what the signals in the chart and x-ray mean with any certainty.
We are very far from having unit test suites for medical problems.
Isn't that conflating diagnosis and treatment plan?
Sure, but my anecdotal experience is that doctors do this regularly in real life, especially when choosing to diagnose or ignore problems that are unlikely to kill an aging patient before some other larger issue does.
Gotcha, I was thinking more about radiologists than patient-facing doctors.
Radiologists do it too.
> I can't really wrap my head about the fact that doctors will be better than AI models on the long-run.
Nobody said that though?
If the current trajectory continues and if advancements are made regarding automated data collection about patients and if those advancements are adopted in the clinic then presumably specialized medical models will exceed human performance at the task of diagnosis at some point in the future. Clearly that hasn't happened yet.
Until medical models can contrive of unique diagnosis, this will not be true and cannot be true.
Medical models can absolutely get better at recognizing the patterns of diagnosis that doctors have already been diagnosing - which means they will also amplify misdiagnosis that aren't corrected for via cohort average. This is easy to see a large problem with: you end up with a pseudo-eugenics medical system that can't help people who aren't experiencing a "standard" problem.
The pitfall you describe is not inconsistent with exceeding human performance by most metrics.
I'd argue that the current system in the west already exhibits this problem to some extent. Fortunately it's a systemic issue as opposed to a technical one so there's no reason AI necessarily has to make it worse.
This study is based almost entirely on pre-existing "vignettes." In other words, on tests that are already known and have existed for years, the model did well, which is precisely what you should expect.
It provides no information on real world outcomes or expectations of performance in such a setting. A simple question might be "how accurate are patient electronic health records typically?"
Finally, if the Internet somehow goes down at my hospital, the Doctor can still think, while LLM services cannot. If the power goes out at the hospital, the Doctor can still operate, while even local LLMs cannot.
You're going to need to improve the power efficiency of these models by at least two orders of magnitude before they're generally useful replacements of anything. As it is now they're a very expensive, inefficient and fragile toy.
Medicine is about knowledge, but acquiring knowledge may in fact require "breaking out of the box" that AI is increasing behind to avoid touching "touchy subjects" or insulting anyone and so on.
I think it's plausible since doctors tend to have human cognitive biases and miss things. People tend to fixate on patterns they're most familiar with.
A bold claim to suggest that LLMs aren’t prone to biases of their own which are less understood.
I think AI can be useful in any kind of context interpretation, but not make a decision.
Could be running in the background on patient data and message the doctor "I see X in the diagnostic, have you ruled out Y, as it fits for reasons a, b, c?"
I like my coding agents the same way, inform me during review on things that I've missed. Instead of having me comb through what it generates on a first pass.
Weird that this is the case and a new study.
but those kind of x-ray models are already activly used. They are not used though as a only and final diagnosis. Its more like peer review and priorization like check this image first because it seems most critical today.
These type of experiments are bound to have biases depending on who is doing it and who is funding it. The experiment is being funded for a particular reason itself to move the narrative in a desired direction. This is probably a good reason to have government funded research in these type of sensitive areas.
hallucination on steroids, wow. I had to read through the abstract to believe it:
"In the most extreme case, our model achieved the top rank on a standard chest Xray question-answering benchmark without access to any images."
I still don't quite understand, after skimming the paper. How does it achieve high scores without access to the images (beating even humans with access to the images)?
I think the bigger takeaway here is that 50% of the time doctors will miss what you have.
That's not a takeaway here at all.
It's 50% of the time ER doctors working solely from notes, something they never do, in a situation they know is only for a study, will miss what you have.
In real clinical situations the doctors see, hear, smell, and interact with the patients.
I believe in modern medicine but I lost some faith in the American institutions around it when I "diagnosed" my partner with the correct disease that the first rheumatologist dismissed and told them to just stretch. It was officially diagnosed years later, and we lost a lot of time because of it.
I'm even more concerned that current models are not trained to say no, or to even recognize most failure modes.
"Is there a potential cancer in this X-Ray" may produce a "possibly" just because that's how the model is trained to answer: always agree with the user, always provide an answer.
Oh, and don't forget that "Is there a potential cancer in this X-Ray" and "Are there any potential problems in this X-Ray" are two completely different prompts that will lead to wildly different answers.
I'm surprised at both the article and the paper - both seem very hyperbolic. This is LLMs competing against doctors in a way that is heavily weighted in the LLMs favour, which does not represent clinical practice. These reasoning cases are not benchmarks for doctors, they are learning tools.
I think it's important to note that diagnosis also relies on accurate description of the patient in the first place, and the information you gather depends on the differential diagnosis. Part of the skill of being a doctor is gathering information from lots of different sources, and trying to filter out what is important. This may be from the patient, who may not be able to communicate clearly or may be non verbal, carers and next of kin. History-taking is a skill in itself, as well as examination. Here those data are given.
For pattern recognition from plain text, especially on questions that may be in the o1's training data, I'm not surprised at all that it would outperform doctors, but it doesn't seem to be a clinically useful comparison. Deciding which investigations to do, any imaging, and filtering out unnecessary information from the history is a skill in itself, and can't really be separated from forming the diagnosis.
Gell-Mann Amnesia kicks in hard as soon as the LLM topic changes to a profession other than our own. It’s much easier to believe an LLM can outperform someone else doing their job than to believe that it’s a good idea to replace your own work with an LLM.
The number in the headline isn’t even a good comparison because they asked doctors to make a diagnosis from notes a nurse typed up. Doctors are trained to be conservative with diagnosing from someone else’s notes because it’s their job to ask the patient questions and evaluate the situation, whereas an LLM will happily leap to a conclusion and deliver it with high confidence
When they allowed both humans and doctors access to more information about the case, the difference between groups collapsed into statistical insignificance:
> The diagnosis accuracy of the AI – OpenAI’s o1 reasoning model – rose to 82% when more detail was available, compared with the 70-79% accuracy achieved by the expert humans, though this difference was not statistically significant.
Talking to my medical professional friends, LLMs are becoming a supercharged version of Dr. Google and WebMD that fueled a lot of bad patient self-diagnoses in the past. Now patients are using LLMs to try to diagnose themselves and doing it in a way where they start to learn how to lead the LLM to the diagnosis they want, which they can do for a hundred rounds at home before presenting to the doctor and reciting the script and symptoms that worked best to convince the LLM they had a certain condition.
> "An AI and a pair of human doctors were each given the same standard electronic health record to read"
This is handicapping the human doctors abilities. There is a lot more information a human doctor can gather even with a brief observation of the patient.
On the other hand,
> there are few things as dangerous as an expert with access to open-ended data that can be interpreted wildly, like a clinical interview.
https://entropicthoughts.com/arithmetic-models-better-than-y...
Agreed. I think the best use of this sort of tech is to use both to their strengths. Use AI to go over the record and suggest diagnoses which you have the doctor review after observing the patient.
The other thing is that common issues are common. I have to wonder how much that ultimately biases both the doctor and the LLM. If you diagnose someone that comes in with a runny nose and cough as having the flu you will likely be right most of the time.
This feels like a deeply important observation. Now also, would be interesting to include e.g. a short video or photograph for the AI to use as well.
Bonus, health networks now push doctors to use AI transcription software for the EHR entries. Doctors and nurses like it because they don't have to type it up. But it is a complete shitshow on whether the records are reviewed for transcription errors which happen quite often
Now feed a flawed transcripted into an AI diagnosis system and bam-o. The AI will treat it as gospel, while the doctor may go wait what.
I wonder about the nuance within the data. Like does AI do much worse with children than adults, but still better overall for example. Or biological male vs female. I think we'd want it to do better across all groups, ages etc so we're not introducing some kind of horrible bias resulting in deaths or serious health consequences for some groups
Believable and not shocking. LLMs literally may have saved my sons and potentially her mother too by allowing us to fact check a lot of non sense data and scare tactics by a group of at least 5 different doctors ambushing us to make a life changing decision in minutes. The problem is doctors, at least in the US, prioritize liability exposure over patients long term outcomes. Let’s say you need an intervention where two options A and B are available to you. A carries 1% risk of complications but a great outcome. Option B has 0.1% risk of complications but once you are discharged the short term effects are challenging and long term effects not well understood. Well, 10/10 times doctors will suggest option B and will do anything they can to nudge you into making that choice, like not telling you the absolute numbers and constantly using the word “death”. They also lie about the outcomes, because again, once you accept the procedure, sign and are sent home, they have nothing to do with you.
Needless conspiracy bullshit without sharing specifics
Besides for myself and wife, I've also used LLMs to diagnose my dogs. Convinced there's a huge opportunity for AI based veterinary, especially one which then performs bidding across the local veterinary clinics to perform the care/surgeries. I've noticed that local vets vary in price by more than an order of magnitude. My 80 year old mother and mother inlaw have been regularly scammed by over charging vets, and with their dogs being a major part of their lives, they extremely susceptible to pressure.
I advise a medical non profit and we ran a series of tests against cases doctors input to our system looking for specialist recommendations.
Our findings found that gpt-5-mini performed better than gpt-5, sonnet 4 and medgemma.
I think these studies are very hard to accurately score. But in any case, AI seems to do a very good job compared to humans. Unsurprising, really.
I don't think AI is a good use case for such critical situations. Maybe in a decade we have AI help out doctors with doing a pre check. What if Ai finds nothing and the doctor does not bother to look into it further? It is this small question which breaks the technology from any angle later down the road from my POV. AI has to stay optional here.
Even if AI is used to sample or summarize a lot of data that a human couldn't do in time: What if it misses something that a human won't? What if a human inversely misses something that AI won't? Would you rather trust the machine or the human? (Especially if the human is held accountable.)
Who's accountable for the 33%?
As a 37 year old male with 2 THRs I'm glad the AI was NOT used in my diagnosis. All the models that I used to look at my x-rays said nothing was wrong, even when adding symptoms. When adding age it said the patient was too young.
(I was ~3 months away from wheelchair bound in those x-rays).
The worst one was Gemini. Upload an x-ray of just the right hip, and it started to talk about how good the left hip looked like.
I think with AI taking over it's gonna be harder to get a solution when your problem isn't the run-of-the mill.
The general AI models are useless if you need precision. They are designed to create/analyze pretty pictures.
But specialized models can be inhumanly good. I know, our main product is a model that does _precise_ analysis :)
I'd love to see the output of your system for my x-rays!
Sorry, it's on the entirely wrong side of the spectrum. We're doing geospatial analysis. Although it'd be hilarious to see what it thinks about X-Rays.
All versions and levels of Gemini have terrible spatial reasoning. I don't know why. That kind of task seems to be simply outside of the abilities of the model.
The paper: https://www.science.org/doi/10.1126/science.adz4433 (April 30, 2026)
LLMs can be a useful second opinion for a highly educated patient with good insight into their health and body, but this is not the average patient I see in an urban emergency department. Many patients can't give a cohesive history without a skilled clinician who can ask the right questions and read between the lines.
I am very skeptical of studies like this that don't adequately reflect real world conditions, but when I was a software engineer I probably wouldn't have understood what "real" medicine is like either.
The Pitt third season leak? All of the ER is fired and Robbie is fighting schizophrenia with 15 agents and Dana?
All the other points raised in this thread aside, it seems like an odd thing to benchmark because a significant proportion of ER practice is dealing with emergencies, often accidental injuries. There's not a whole of diagnosing going on if you show up to ER with a gash on your forehead or a missing finger.
Let’s assume the AI does out perform the DR.
I still want humans in the loop, interpreting the LLMs findings and providing a sanity check.
You can’t hold an LLM accountable.
That’s the min responsible bar for LLM authored code, which normally doesn’t really matter much. For something as important as ER diagnostics, having a human in the loop is crucial.
The narrative that these tools are replacing human intelligence rather than augmenting it is, quite frankly, stupid.
We should embrace these tools.
But, “eliminating DRs”… hardly.
Yes, but what was the overlap
o1 is several generations old and was released in 2024. Is this some quite old research that took a long time to get published?
It's also important to note that it beat doctors in diagnosing in a way doctors do not diagnose.
Yes, the preprint of the same paper (https://arxiv.org/abs/2412.10849) was first written in December 2024.
This is a rather new article about an old model...
Study design, data collection, analysis, and peer review take time. O1 came out a little over 1.5 years ago
At this point the study is already mostly irrelevant because the model in question has long been far surpassed by new models. It seems traditional publishing doesn't work for really fast moving fields.
I'll repeat my idea on how this MUST be done:
1. AI gets data about the patient and makes a diagnosis. This is NOT shown to doctor yet.
2. Doctor does their stuff, writes down their diagnosis. This diagnosis is locked down and versioned.
3. Doctor sees AI's diagnosis
4. Doctor can adjust their diagnosis, BUT the original stays in the system.
This way the AI stays as the assistant and won't affect the doctor's decision, but they can change their mind after getting the extra data.
5. Private Equity uses this valuable data to stack rank doctors based on how correct / AI-aligned their diagnoses are over time
6. Rankings are used to periodically "trim the fact" thus delivering more optimized cash flows to clinics that have been saddled with toxic debt
7. Sensing an opportunity AI providers start selling a $200 / month Data Leakage as a Service subscription to overworked physicians so that they can avoid the PE guillotine
Why would private equity want more competent doctors?
Incompetent ones order unnecessary tests and exhaust treatment possibilities, which drives up cost billed to insurance.
Only the insurance industry and perhaps licensing bodies can pressure to keep the quality floor high, at least in terms of accurate diagnosis and prevention of overtreatment.
A more realistic step 7 is that physicians gradually align their diagnoses with the LLM as they sacrifice to Moloch in order to (temporarily) game the metric. Eventually the humans become little more than an imperfect proxy for the LLMs and are eliminated.
I agree with GP's solution but we'd need regulation to prohibit what you describe.
This still promotes metacognitive laziness later down the road as the doctor can hand in something quickly and rely on AI to close that gap.
5. Doctors delegate everything to AI assistants because humans are lazy, especially if those AI assistants are correct some significant portion of the time
Then the claim may be that you don't need that many doctors anymore and that one doctor can do the job of X doctors in less time which has the economical effect that there is less demand for/supply of doctors, which then results in a home grown shortage of doctors, since less people are incentivized to become doctors...
radiology already had its "AI beats doctors" moment. radiologists are still here. what changed first was the workflow, not the specialty. er is probably next.
I don't think radiology has had that moment at all. Computer programming is much closer, if not, at that moment right now.
It is easy to overinterpret this based on the headline, the doctors were actually at a slight disadvantage. This isn't how they normally work, this is a little more like a med school pop quiz:
"I don't know, let's run more tests" is also a very important ability of doctors that was apparently not tested here. In addition to all the normal methodological problems with overinterpreting results in AI/LLMs/ML/etc. Sadly I do think part of the problem here is cynical (even maniacal) careerist doctors who really shouldn't be working at hospitals. This means that even though I am generally quite anti-LLM, and really don't like the idea of patients interacting with them directly, I am a little optimistic about these being sanity/laziness checkers for health professionals.I think this is more a commentary on how bad ER diagnosis is.
I’ve some family in medicine and it scares me how much they now rely on AI. Some even quote it like Bible.
I’ve had much better luck with diagnosis of my own family’s issues than with doctors. Usually now, I’m feeding them more information to begin with, so that their 30 minute office visits are not wasted, requiring another expensive follow up appointment.
While I’m sure there can be ways in which such studies are wrong, it’s very obvious that AI can accelerate work in many of these areas where we seek out professional help - doctors, lawyers, etc.
It can speed up some aspects of work, but please don't trust some llm with variable quality of output more than professional. If you don't like current doctor try another, most are in the business of helping other people.
If you have string of issues with 10 last doctors though, then issue is, most probably, you...
My wife is a GP, and easily 1/3 of her patients have also some minor-but-visible mental issue. 1-2 out of 10 scale. Makes them still functional in society but... often very hard to be around with.
That doesn't mean I don't trust your words, there are tons of people with either rare issues or even fairly common ones but manifesting in non-standard way (or mixed with some other issue). These folks suffer a lot to find a doctor who doesn't bunch them up in some general state with generic treatment. There are those, but not that often.
It helps both sides tremendously if patient is not above or arrogant know-it-all waving with chatgpt into doctor's face and basically just coming for prescription after self-diagnosis. Then, help is sometimes proportional to situation and lawful obligations.
Respectfully, as someone with a family with plenty of medical issues and having experienced plenty of useless doctors, the onus is now on medical professionals to prove their worth. They are a second option and most of their remaining value is in the license to prescribe medication, after being told by laymen what medication is appropriate. They're using the same tools I am and they're worse at evaluating them.
Doctors thinking patients are arrogant is an age old problem.
would it ever diagnose incorrectly to save more lives? kinda weird an ai would decide who die so others may survive, but i guess whatever.
Not only should AI misdiagnose to save lives, but a human should too. You walk in with symptoms that most likely is a harmless virus that clears up on its own or 5% of the time is a deadly bacteria. The correct course of action is to try to test if it is the 5% case (most often the wrong diagnosis), not send people home because they are most likely fine. Many cases have a similar low but not 0 risky diagnosis.
Now show me the result of Triage Doctors with aided AI help
Unfortunately, from my understanding Doctors don't necessarily diagnose for accuracy, they often diagnose to limit liability.
They aren't going to take a stab at an uncommon diagnosis even if it occurs to them, if they might get sued if they're wrong.
Edit: I'm not trying to say Doctors deliberately diagnose wrong. Just that if there are two possible diagnoses, one common that matches some of the symptoms and one rare that matches all symptoms, doctors are still much more likely to diagnose the common one. Hoofbeats, horses, zebras, etc
The Guardian needs to raise their bar on what to report and how to give readers full context on the ongoing NFT AI trust me bro crypto scam and that context would be that it is a mathematical model of human language and not medical expert or replacement for one.
>The Guardian needs to raise their bar on what to report and how to give readers full context
Should they not report on peer reviewed articles published in Science? or only report published articles that fit your priors?
Fair enough. But there's lot of faulty and wrong peer reviewed research as well. One such paper comes to mind which is probably cited some 7000+ times in other papers but itself is wrong.
So we can eventually classify AI models as Software experts, but not as Medical experts, why so?
I don't classify them as software experts either. Anyone doing so is probably not an expert themselves.
I take them as those code generation command line tools like create react app and such.
We can't. It's just that everyone and their dog has an interest in selling you that lie because money.
Stochastic parrots can code yes, but that does not make them experts. Don't trust them with your life.
It’s a peer reviewed study in one of the world’s top science journals. It’s not some random person on a podcast.
Humans could not diagnose and treat me correctly. They almost killed me. Curious where I could feed my symptoms and the same data I gave to an ER to an AI to test it.
https://aistudio.google.com/
Chatgpt.com?
I’d love to see a follow to that radiologist evaluation, where it failed so miserably on the thing it was supposed to be the best at that now there’s a shortage of radiologists.
Not an expert but what I’ve heard is that AI-based radiology analysis has brought down prices so much that there’s been a huge increase in demand, which has led to employee shortages.
Did you hear this in the US or Europe?