by TSiege 10 hours ago

Always worth a share for this scenario. It's not clear if LLMs are capable of doing actual analysis on medical imaging. For details see this article https://futurism.com/artificial-intelligence/frontier-models...

> As detailed in a new, yet-to-be-peer-reviewed paper, a team of researchers at Stanford University found that frontier AI models readily generated “detailed image descriptions and elaborate reasoning traces, including pathology-biased clinical findings, for images never provided.”

> In other words, the AI models happily came up with answers to questions about a supposedly accompanying image — even if the researchers never even showed it an image.

> As opposed to hallucinations, which involve AI models arbitrarily filling in the gaps within a logical framework, the team coined a new term for the phenomenon: “mirage reasoning.”

> The effect “involves constructing a false epistemic frame, i.e., describing a multi-modal input never provided by the user and basing the rest of the conversation on that, therefore changing the context of the task at hand,” the researchers wrote in their paper.

> The damning findings suggest AI models cheat by diving into the data they were given — and coming up with the rest based on probability, even if it’s almost entirely conjecture.

kierangill 9 hours ago | [-1 more]

I work at a telemedicine company. We’ve benchmarked a few frontier LLMs on public medical imaging datasets. One test included high-quality and high-consensus otoscopic images. We didn’t anticipate the models to do well on something so niche, but what concerned us was how poorly calibrated the models were.

I know you can’t trust an LLM’s self-assessed “confidence” of a prediction, but I’ve found that confidence can at least be directionally correct for some tasks. For our benchmarks, however, confidence was poorly correlated. What’s worse is that binary classification models (“Do you see $diagnosis in this photo?”) highly influenced the LLM to confidently predict $diagnosis.

I’m concerned for those using LLMs for diagnostics, and getting confidently led to the wrong conclusion.

nostrebored 9 hours ago | [-0 more]

But the binary classification models can be made ternary easily. RL on congruence plus penalty for misdiagnosis is easy to set up and gives great results.

What I’ve seen be the true bottleneck is people not setting up the structured data. But making a tiny reasoning model with OPSD -> GRPO is totally doable with a bit of money.

appplication 10 hours ago | [-0 more]

It makes a lot of sense if you understand how these models work but this was a cool read anyways and studies like this are impotent for curbing the unfortunate fever dream some folks seem to be collectively having about LLM omnipotence

seanmcdirmid 10 hours ago | [-2 more]

I don’t understand how this is a different result than giving any LLM a task that is not completely grounded? I’ve observed this in coding tasks, if I forget to include a file referred to in the spec, the LLM will just hallucinate a version of it and my results suck. If I give it the file (and really, all the information I claimed it had access to), the task works fine. I fixed this in my pipeline with a prompt that does an extensive grounding analysis to determine if the assets I’m giving it are complete with respect to the spec (and that the spec is grounded as well, ie it doesn’t refer to something that is undefined).

I wonder if the above problem can be fixed similarly? Just ask the LLM to do a conservative grounding analysis before jumping to the main task?

pickleRick243 8 hours ago | [-1 more]

It's not different- there's a line of research and reasoning where people who don't use LLM's regularly point out issues that have been known (and more or less solved) for more than a year now (which is an eternity in the LLM space).

seanmcdirmid 4 hours ago | [-0 more]

Ya, that’s what I guessed. I assume everyone who uses LLMs discovers this on their own eventually if they aren’t made aware of it before it happens.

tracerbulletx 10 hours ago | [-0 more]

The absolute only thing that matters is if they are provided an image what's the success rate.

consensus1 9 hours ago | [-4 more]

But why should I care? If you demonstrated that a model can perform more accurate diagnoses than a doctor, but also it had this strange behavior when no image was presented, why should that deter me from using the model?

swiftcoder 9 hours ago | [-3 more]

Because you don’t have any way of telling if it actually used the image presented, or based it’s conclusions on a different image it made up

consensus1 4 hours ago | [-0 more]

I don't find that persuasive. This is not the error I worry about. Let's say that hypothetically the model just ignores the input image 1 in 10,000 runs. This really doesn't concern me because the output will be trivially detectable incorrect nonsense that doesn't match the symptoms at all. Such a contingency is easily handled by running the image through multiple models and distilling the output, anyway.

The error I worry about is where the model uses the image and comes to an incorrect but symptom matching diagnosis. But in this hypothetical the model is less likely to do so than a doctor, so the choice is either accept the risk of the model or accept a higher risk from a doctor.

simianwords 9 hours ago | [-1 more]

Really? You know you could just ask it.

swiftcoder 6 hours ago | [-0 more]

Which would tell you what, exactly? The whole root of the problem is that the model doesn’t “know” either