LLMD: A Large Language Model for Interpreting Longitudinal Medical Records

(arxiv.org)

37 points | by troyastorino 4 hours ago ago

6 comments

> We find strong evidence that accuracy on today's medical benchmarks is not the most significant factor when analyzing real-world patient data, an insight with implications for future medical LLMs.

I interpreted this as challenging whether answering PubMedQA questions as well as a physician is correlated to recommending successful care paths based on the results (and other outcomes) shown in the sample corpus of medical records.

The analogy is a joke I used to make about ML where it made for crappy self-driving cars but surprisingly good pedestrian and cyclist hunter-killer robots.

Really, LLMs aren't expert system reasoners (yet) and if the medical records all contain the same meta-errors that ultimately kill patients, there's a GIGO problem where the failure mode of AI medical opinions makes the same errors faster and at greater scale. LLMs may be really good at finding how internally consistent an ontology made of language is, where the quality of its results is the effect of that internal logical consistency.

There's probably a pareto distribution of cases where AI is amazing for basic stuff like, "see a doctor" and then conspicuously terrible in some cases where a human is obviously better.

[-]

kkielhofner 14 minutes ago

An often-ignored/forgotten/unknown fact about utilizing LLMs is that you really need to develop your own benchmark for your specific application/use-case. It’s step 1.

“This model scores higher on MMLU” or some other off-the-shelf benchmark may (likely?) have essentially nothing to do with performance on a given specific use-case, especially when it’s highly specialized.

They can give you a general idea of the capabilities of a model but if you don’t have a benchmark for what you’re trying to do in the end you’re flying blind.

nmitchko 3 hours ago

Interesting they don't compare to open-bio. Page 7 charts are quite weak.

https://huggingface.co/aaditya/Llama3-OpenBioLLM-70B

infamouscow an hour ago

As someone that built an EMR that sold to Epic, I think I can say with some authority these studies don't suggest this is ready for the real world.

While tech workers are unregulated, clinicians are highly regulated. Ultimately the clinician takes on the responsibility and risk relying on these computer systems to treat a patient, tech workers and their employers aren't. Clinicians do not take risks with patients because they have to contend with malpractice lawsuits and licensing boards.

In my experience, anything that is slightly inaccurate permanently reduces a clinician's trust in the system. This matters when it comes time to renew your contracts in one, three, or five years.

You can train the clinicians on your software and modify your UI to make it clear that a heuristic should be only taken as a suggestion, but that will also result in a support request every time. Those support requests have be resolved pretty quickly because they're part of the SLA.

I just can't imagine any hospital renewing a contract when their support requests is some form of "LLMs hallucinate sometimes." I used to hire engineers from failed companies that built non-deterministic healthcare software.

[-]

troyastorino an hour ago

(Co-founder of PicnicHealth here; we trained LLMD)

Accuracy and deploying in appropriate use cases is key for real world use. Building guardrails, validation, continuous auditing, etc is a larger amount of work than model training.

We don't deploy in EHRs or sell to physicians or health systems. That is a very challenging environment, and I agree that it would be very difficult to appropriately deploy LLMs that way today. I know Epic is working on it, and they say it's live in some places, but I don't know if that's true.

Our main production use case for LLMD at PicnicHealth is to improve and replace human clinical abstraction internally. We've done extensive testing (only alluded to in the paper) comparing and calibrating LLMD performance vs trained human annotator performance, and for many structuring tasks LLMD outperforms human annotators. For our production abstraction tasks where LLMD does not outperform humans (or where regulations require human review), we use LLMD to improve the workflow of our human annotators. It is much easier to make sure that clinical abstractors, who are our employees doing well-defined tasks, understand the limitations in LLM performance than it would be to ensure that users in a hospital setting would.

briandear an hour ago

> Clinicians do not take risks with patients

Some nuance here — they absolutely take risks, but with informed consent.