Ontario auditors find doctors' AI note takers routinely blow basic facts

(theregister.com)

113 points | by sohkamyung 4 hours ago ago

50 comments

I have generally moved from bearish to bullish on the future of current AI technology, but the continued inaccuracy with basic facts all while the models significantly improve continues to give me significant pause.

As an example, creating recipes with Claude Opus based on flavor profiles and preferences feels magical, right up until the point at which it can't accurately convert between tablespoons and teaspoons. It's like the point in the movie where a character is acting nearly right but something is a bit off and then it turns out they're a zombie and going to try to eat your brain. This note taking example feels similar. It nearly works in some pretty impressive ways and then fails at the important details in a way that something able to do the things AI can allegedly do really shouldn't.

It's these failures that make me more and more convinced that while current generation AI can do some pretty cool things if you manage it right, we're not actually on the right track to achieve real intelligence. The persistence of these incredibly basic failure modes even as models advance makes it fairly obvious that continued advancement isn't going to actually address those problems.

[-]

Brian_K_White 2 hours ago

I hate to help provide possible soultions to an entire process I don't approve of, but maybe the fuzzy tools need old style deterministic tools the same way and for the same reasons we do.

So instead of an LLM trying to answer a math or reason question by finding a statistical match with other similar groups of words it found on 4chan and the all in podcast and a terrible recipe for soup written by a terrible cook, it can use a calculator when it needs a calculator answer.

[-]

stevula 13 minutes ago

I think that is how the smarter agents do things? Just like Claude/ChatGPT sometimes does a web search they can do other tool calls instead of just making a statistical guess. Of course it doesn’t always make the bright choice between those options though…

[-]

WalterBright 7 minutes ago

> it doesn’t always make the bright choice

I'm available for a small fee.

colechristensen an hour ago

No, they just need to be trained to have adversarial self review "thinking" processes.

You ask an LLM "What's wrong with your answer?" and you get pretty good results.

[-]

binary0010 an hour ago

Or you get the original output result was perfect and the adversarial "rethinking" switches to an incorrect result.

[-]

byzantinegene 10 minutes ago

this seems to happen far more than i would like

themafia 2 hours ago

> we're not actually on the right track to achieve real intelligence.

Real intelligence means you have to say "I don't know" when you don't know, or ask for help, or even just saying you refuse to help with the subtext being you don't want to appear stupid.

The models could ostensibly do this when it has low confidence in it's own results but they don't. What I don't know if it's because it would be very computationally difficult or it would harm the reputation of the companies charging a good sum to use them.

[-]

wagwang 38 minutes ago

You can just tell the agent to do exactly that

[-]

tempest_ 7 minutes ago

I've had various agents backed by various models ignore the shit out of various rules and request at varying rates but they all do it.

When you point it out "Oh yes, I did do that which is contrary to the rules, request <whatever>.. Anyway..."

alterom 14 minutes ago

>You can just tell the agent to do exactly that

You can.

It just won't do it.

cmrdporcupine an hour ago

That's just not how they work, really. They don't know what they don't know and their process requires an output.

I think they're getting better at it, but it's likely just the number of parameters getting bigger and bigger in the SOTA models more than anything.

[-]

adastra22 29 minutes ago

They do know what they don't know. There's a probability distribution for outputs that they are sampling from. That just isn't being used for that purpose.

[-]

raddan a minute ago

I’m not clear what you mean by “know.” If you mean “the information is in the model” then I mostly agree, distributional information is represented somewhere. But if you mean that a model can actually access this information in a meaningful and accurate way—say, to state its confidence level—I don’t think that’s true. There is a stochastic process sampling from those distributions, but can the process introspect? That would be a very surprising capability.

Isamu 14 minutes ago

Oh, you mean somewhere it is tracking the statistical likelihood of the output. Yeah I buy that, although I think it just tends towards the most likely output given the context that it is dragging along. I mean it wouldn’t deliberately choose something really statistically unlikely, that’s like a non sequitur.

[-]

tempest_ 3 minutes ago

From its point of view what does it mean "to know".

Is it the token (or set of tokens) that are strictly > 50% probable or is it just the highest probability in a set of probabilities?

While generating bullshit is not ideal for a lot of use cases you don't want your premier chat bot to say "I don't know" to the general public half the time. The investment in these things requires wide adoption so they are always going to favour the "guesses".

bluefirebrand an hour ago

My theory is because the people building the models and in charge of directing where they go love the sycophantic yes-man behavior the models display

They don't like hearing "I don't know"

colechristensen an hour ago

You can TELL the models to do this and they'll follow your prompt.

"Give me your answer and rate each part of it for certainty by percentage" or similar.

[-]

mylifeandtimes an hour ago

could you please tell me how it generates that certainty score?

[-]

adastra22 29 minutes ago

Vibes.

colechristensen 44 minutes ago

The whole thing is a statistical model, that's just what it is. No, I cannot in a reasonable way dissect how an LLM works to a satisfactory level to a skeptic.

[-]

fc417fc802 10 minutes ago

He's not a skeptic, he's asking you to explicitly state your reasoning with the expectation that either the readers will learn something or (more likely) you will realize that your thought and speech pattern there was the equivalent of an LLM hallucinating. Yes you can prompt it as you suggested and yes you will generally receive a convincing answer but it is not doing what you seem to think it is doing ie the generated rating is complete bullshit that the model pulled out of its proverbial ass.

skydhash 37 minutes ago

It's a statistical model for words and sentences, not knowledge. What does the LLM knows about having a pebble in your shoes, or drinking a nice cup of coffee?

zOneLetter 3 hours ago

Anecdotally, we use an LLM note-taker at work for meetings. I had to intervene recently because our CIO was VERY angry at our vendor for something they promised to do and never did. He wasn't at the meeting where the "promise" was made. I was. They never promised anything, and the discussion was significantly more nuanced than what the LLM wrote in the detailed summary.

In other cases, I have seen it miss the mark when the discussion is not very linear. For example, if I am going back and forth with the SOC team about their response to a recent alert/incident. It'll get the gist of it right, but if you're relying on it for accuracy, holy hell does it miss the mark.

I can see the LLM take great notes for that initial nurse visit when you're at the hospital: summarize your main issue, weight, height, recent changes, etc. I would not trust it when it comes to a detailed and technical back-and-forth with the doctor. I would think for compliance reasons hospitals would not want to alter the records and only go by transcripts, but what do I know...

[-]

toraway 7 minutes ago

I recently left my mom a voicemail saying happy Mother’s Day and generic human boilerplate of sorry I missed you, feel free to give me a call back tonight or we can talk tomorrow, either is fine by me whatever works best for you, hope we can talk soon, love you, bye.

She called me back later that night and we were chatting normally and then she paused and sort of uncertainly was like “So… was there something you were needing to tell me?” And I was completely baffled and was like “Uhhhh I don’t think so…?”

She then explained the notification she got about my call and apparently the LLM summary of my voicemail converted a message consisting of 75% well-meaning but insignificant interpersonal human filler (like most voicemails) into this stilted, overly formal business-y speak with a somewhat ominous tone. Assigning way too much significance to each of the individual statements in the message about wanting to talk (to say happy Mother’s Day), inquiring about her availability ASAP (to say happy Mother’s Day) etc. Plus grossly exaggerating the information density of the call making it sound like I left this rambling, detailed message about needing to tell her something that was left completely vague, but possibly important and also time critical.

Added up it made her a little worried and left me a bit pissed that was the end result of my wishing her well. Because apparently everything needs a half baked LLM summary crammed into it now.

fc417fc802 16 minutes ago

> I would think for compliance reasons hospitals would not want to alter the records and only go by transcripts, but what do I know...

I'm puzzled by this as well. Why not just generate a transcript and be done with it? If it's a particularly long transcript that's being referenced repeatedly for whatever reason let the humans manually mark it up with a side by side summary when and where they feel the need. At least my experience is that usually these sort of interactions don't have a lot of extraneous data that can be casually filtered out to begin with. The details tend to matter quite a lot!

Ferret7446 16 minutes ago

Transcription works pretty well in my experience, and the transcripts should be treated as the ground truth in such cases.

Hobadee 3 hours ago

The AI note taker we use at work records the meeting as well, and each note it takes about the meeting has a timestamp link that takes you directly there in the recording so you can check it yourself. While I'm sure a solution like this is more complicated in a HIPPAA environment, something like this is critical for things as important as healthcare.

[-]

TonyAlicea10 2 hours ago

When designing AI-based user experiences I refer to this as provenance. It’s a vital aspect of trust, reliability, compliance and more. If a software system includes LLM output like this but doesn’t surface the provenance of its output for human evaluation and verification then it’s at best poor user experience, and at worst a dangerous one.

[-]

autoexec an hour ago

At the same time, do you really want every conversation you have with your doctor recorded, handed over to third party companies, and stored forever with your medical file? Plus what doctor has time to sit down and re-listen to your visit to check to make sure the AI didn't screw up at some point in the future anyway? If your doctor isn't going to be verifying the accuracy from those recordings who would? Overseas contractors? At what point does it become a larger waste of time and money to babysit an incompetent AI than just not using one in the first place?

There are some good uses for AI, but I'm not convinced that this (or many other cases where accuracy matters) is one of them.

AlienRobot 6 minutes ago

That doesn't sound like a "note taker," that sounds like an audio sample search engine. You still need to listen to everything if you want accuracy.

2 hours ago

[deleted]

[-]

lostmsu 2 hours ago

You can check the summary immediately after the meeting, that gives some extra confidence that the notes were recognized correctly.

alterom 9 minutes ago

Yeah, what you're saying requires either:

- some human checking all the notes by listening to the entire meeting recording (takes a lot of time and man-hours)

- attendees checking notes from memory (prone to error unless they take notes)

- attendees cross checking with their own notes (defies the point of having the AI note taker)

The reality is that AI usage is not acceptable in any form in any context where accuracy is critical, but good luck getting anyone to acknowledge that.

dmix 32 minutes ago

> They specifically address the AI Scribe program, the Ontario Ministry of Health initiated for physicians, nurse practitioners, and other healthcare professionals across the broader health sector.

makes me wonder what quality software the ministry would push (probably mostly qualifications like SOC).

This is apparently this list of approved vendors

https://www.supplyontario.ca/vor/software/tender-20123-artif...

ceejayoz 3 hours ago

> 60% of evaluated AI Scribe systems mixed up prescribed drugs in patient notes, auditors say

Not mentioned, as far as I can see: the comparative human mistake rate.

Having seen a lot of medical records, 60% sounds about normal lol.

[-]

BrokenCogs 5 minutes ago

Outlandish claim, you better show some evidence. I've reviewed several medical charts too and the error rate is much lower than that - typically everything is dictated and transcribed which are fairly mature and accurate technologies

autoexec an hour ago

Even if you had the same 60% error rate with humans the types of errors would be vastly different. Humans might make typos, or forget to include something, or even occasionally misremember some minor detail, but that's very different from BS AI just hallucinates out of nowhere. AI makes the kinds of mistakes no human ever would which means they can be extremely confusing and easy to catch or they can be something no human would even think to question or be looking out for because it makes no sense why AI would randomly (and confidently) say something so wrong.

thepotatodude 2 hours ago

60% is insanely high and absolutely not the performance of human mistake rate. What charts are you reading?

vor_ 18 minutes ago

60% is a normal human mistake rate? You can't be serious.

Arodex 3 hours ago

But who is responsible is different.

(And if you already see 60% error rates in standard, pre-AI note taking, how does that not translate into many deaths and injury? At least one country's health system in the world should have caught that)

[-]

tredre3 2 hours ago

> And if you already see 60% error rates in standard, pre-AI note taking, how does that not translate into many deaths and injury?

Presumably most doctor's visits are a one-problem-one-solution-one-doctor type of thing. Done deal, notes are never read again. So that alone would explain why high rates of errors doesn't result in injuries or death very often.

Any injury or death caused by poor notes would have to occur when mistakes are done if you're followed for a serious chronic condition, or if you're handled by a team where effective communication is required.

ceejayoz 2 hours ago

> how does that not translate into many deaths and injury?

Because most of it is just written down and never looked at again until there’s a lawsuit or something.

cyanydeez 3 hours ago

Yeah, the problem is the health system has no sacrificial goat if the AI note taker provides the wrong detail. The last thing we want is CTO being responsible!

[-]

bluefirebrand 2 hours ago

I'm not convinced the CTO would be held accountable either.

I do wonder if people would be pushing AI so hard if their organizations were planning to hold them accountable for mistakes the AI made

I bet if that were the case we'd see a lot slower rollout of AI systems

jmward01 2 hours ago

This is not a popular view 'AI sucks at X but so do humans' but I think it is valid and we should take wins where we can, especially in healthcare. It is pretty clear that initial accuracy issues will become less and less of a problem as these technologies mature. This focus on accuracy now as a 'see it's bad' talking point though misses the real danger. Medical note takers have an exceptionally high chance of being hijacked for money and that is an issue we need to bring attention to now. They provide a real-time feed into a trillion dollar industry. Just roll that around in your head for a second. Insurance companies are going to want to tap that feed in real time so they can squeeze more money out. Drug makers are going to want to tap into that feed so they can abuse the data. Hospitals will want to tap into that feed to wring more out of doctors and boost the number of billable codes for each encounter. Very few entities are looking to tap into that feed to, you guessed it, help the patient. I am for these systems (and I have been involved in building them in the past) but the feeding frenzy of business interest that will obviously get involved with them is the thing we should be yelling and screaming about, not short-term accuracy issues.

[-]

NateEag 17 minutes ago

> It is pretty clear that initial accuracy issues will become less and less of a problem as these technologies mature.

What do you base this on?

As someone who can both see the amazing things genAI can do, and who sees how utterly flawed most genAI output is, it's not obvious to me.

I'm working with Claude every day, Opus 4.7, and reviewing a steady stream of PRs from coworkers who are all-in, not just using due to corporate mandates like me, and I find an unending stream of stupidity and incomprehension from these bots that just astonishes me.

Claude recently output this to me:

"I've made those changes in three files:

- File 1

- File 2"

That is a vintage hallucination that could've come right out of GPT 2.0.

mcphage an hour ago

> It is pretty clear that initial accuracy issues will become less and less of a problem as these technologies mature.

Does it?

nothinkjustai 2 hours ago

People will eventually figure out LLMs have no capacity for intent and are fundamentally unreliable for tasks such as summarization, note taking etc.

jqpabc123 an hour ago

And once again, we have an example of how AI is a liability issue waiting to happen.