> There's something incredibly peaceful about being in the hands of an expert you trust. [...] AI can absolutely shatter that feeling in an uncomfortable way [...] but I don't know if I can fully trust AI either.
This really is key. We know we can't trust the AI, but at the same time we're also more comfortable asking the AI for clarifications or confronting it. Not having a time-bound appointment or paying by the hour helps a lot. But even then, more information doesn't necessarily help!
I once brought my 11-year-old car, a Civic with 150k miles, to multiple garages. I figured I'd play the "second opinion" game to correlate what the garages recommended to decide on what needed to be done...
I got 3 completely unrelated recommendations, including one that I knew was invalid! I felt worse off than when I started!
The solution to uncertain information isn't more information, which the AI can certainly provide, it's better information, and AI cannot currently provide that.
I have multiple LLM subscriptions at any given time, plus an array of local models.
When I ask a question outside of my domain of expertise I like to ask all of the LLMs I have access to. I also create separate sessions and ask the same question multiple ways.
It’s revealing to see how many different and contradictory answers I get, most of which are presented confidently.
The last time I ran a medical question through Claude I couldn’t even get consistent answers between sessions.
It’s also scary how easily you can lead each LLM to the answer you have in mind. When I would start asking questions about different options that other LLMs had presented, each session would drift toward that explanation.
In my day job we tried creating a credit assessor tool using LLM as the credit assessor.
It did great, generated a report on the assessed business that was incredibly detailed and plausible.
Then I started running tests and getting into the details, and found that if you ran the same report on the same data, it generated completely different, still very plausible, results. I could run the same source data through the assessment process 10 times and get 10 very different results. We had to can the project and go a different route.
LLMs are designed to produce plausible results, not factual results. We can fix this when using them for software dev by using linters and tests (though we've all had the experience where the LLM invents an API endpoint). I would not trust raw LLM output in any situation where that kind of testing and verification capability isn't present.
Have you ever let the LLMs “discuss” with each other to see if that would give better answers?
You might end up with the answer from the most persuasive LLM, but you might also end up with better results.
Wonder if there is a paper out there on this.
The problem is how do you know whether the answer is just the most persuasive or actually the most accurate one? It's hard to figure this out without domain knowledge.
I dunno, I could see it working.
I do something similar with reviewing code: I have one agent write the code and another reviews it, then they go back and forth for a bit improving the code. Seems to yield better results than one agent alone.
Seems like a similar principle.
Worse is that LLMs are trained to be persuasive by default. The "you're absolutely right..." stereotype is because these things are A/B tested on response quality and we know from studies people reliably rate vibes better then anything else - e.g. while the quality of hospital accomodations likely has some impact on patient outcomes, the view and decor of the room certainly did not fundamentally change the quality of the care provided but it is the largest determinant in how well people rate that care.
The problem with trying to write a paper is the results depend on RNG.
That doesn't make it differrnt from any other problem measured by statistical significance in averaged over a big enough series of comparisons, no?
There's a big difference between a _puzzle_ and a _mystery_. In a puzzle, the goal state is known, and as more pieces - data - appears, the goal gets closer. You know how far you are from the goal.
A mystery is worse. With each additional piece of data, the goal gets farther away. Everything is more and more confusing.
(Popularized by Malcom Gladwell)
Maybe I am missing something but I just find this wrong.
Everything is a puzzle: there is one "Truth" or one diagnosis. You (a smart human) should be able to converge on it by cross-examining your LLMs. By themselves, they have no interest in revealing this, no stakes, which makes them tools only useful at the hands of a capable investigator.
The problem is that the diagnosis might not be known for a while. There's a few conditions and diseases that require an autopsy for a guaranteed diagnosis and therefore are diagnosis based on symptoms in clinical settings.
> You (a smart human) should be able to converge on it by cross-examining your LLMs.
What makes you think this is fundamentally different from cross-examining ELIZA? There is no guarantee that the LLM will help you converge on anything. Indeed actually calling out an LLM on BS tends to eventually produce an "I don't know and can't help you further" answer (as it should).
> There is no guarantee that the LLM will help you converge on anything.
Absolutely. The guarantee does not come from the LLM. The LLM is a simply an improved version of Google Search.
The guarantee can only come from a systemic application of epistemic discipline and reasoning, which is very much (smart) human territory.
Put it another way, I could make good decisions with/without LLMs, with some uncertain diagnostics as input. I would have to trawl through 50 papers myself, and it is possible that my decision arrives 5 years too late as a result. LLMs enable trawling and do some of the legwork in connecting the dots, but are ultimately only as capable as the orchestrating human.
The same goes for a human expert. There's no guarantee of convergence and you could eventually end up at "I don't know".
> The solution to uncertain information isn't more information, which the AI can certainly provide, it's better information, and AI cannot currently provide that.
I'd argue that AI _can_ currently provide that, but that it can't do it _reliably_, and that to non-experts it's impossible to differentiate, which makes it all the more dangerous.
Isn't that the case with human "experts"? If you had encounters with doctors, mechanics, etc. you'll know you can get a completely different diagnosis for the same problem which obviously means (in most cases) that the person you thought an expert is wrong.
What is needed are studies that will take a cold look at the actual results because AI seems to be required to be perfect or it is useless. It just needs to be as good as a human for most stuff, but in the long run it will be much better. At least that what extrapolating current reality shows us.
We have systems around humans that exist to manage expertise gaps, credibility signals, and accountability. This is part of what makes humans as good as they are, along with specialized training and some measure of meritocratic selection. We license and regulate and account and litigate to make a system that responds and improves.
Some of this might be applicable to LLMs, but some isn’t and much of it would be resisted. This is one reason we’re not likely to get “as good as a human” because at some level we’re not optimizing for the outcomes; we’re optimizing for speed, convenience, some participant’s economics, and underlying beliefs.
I've been going through PT for a hypermobility disorder related injury and I've use an AI to help me figure out "interview questions" to see if a PT knows anything about hypermobility or is willing to learn. I found it helpful to select a new PT after my first PT I trusted made things worse by prescribing stretches and no load progression from rest and recovery back to deadlifts
People put a lot of faith in human “guardrails”, standards, etc. But the same argument could be made that trusting human experts without discernment is as dangerous as trusting AI or Google or whatever other non-human source. It’s always been the case.
To provide a competing point of anecdata: A Gemini diagnosis saved me $3,000 in unnecessary repairs on my Civic.
YouTube has saved me at least that much in appliance repairs... and it doesn't even have an AI. It's amazing how valuable access to information can be.
I would love to hear more about this
Saved me $2000 on a koi pond pump and filtration system
The soothing sound of ChatGPT telling us how right and clever we are…how could it possibly hallucinate, certainly not 5.5
You’ve really honed in on the key issue. This is exactly how keen hackers news commenters approach this.
These tools can’t reliably fix a 4px misalignment on my icon, better ask them about a medical report… but honestly, I would do the same.
Tbh LLMs pulling data out of medical documents in it's training set and searchable online is likely a much easier task than fixing some weird CSS alignment issue.
You only got 3 opinions on your car? Why not 50? You could have found a more useful signal by getting more information.
I get it - getting an opinion from a mechanic is time consuming. Not true of AI though.
> There's something incredibly peaceful about being in the hands of an expert you trust
This is the primary business model of enterprise IT and is why companies pay so much for 4 hour disk replacement.