by Aurornis 8 hours ago

I have multiple LLM subscriptions at any given time, plus an array of local models.

When I ask a question outside of my domain of expertise I like to ask all of the LLMs I have access to. I also create separate sessions and ask the same question multiple ways.

It’s revealing to see how many different and contradictory answers I get, most of which are presented confidently.

The last time I ran a medical question through Claude I couldn’t even get consistent answers between sessions.

It’s also scary how easily you can lead each LLM to the answer you have in mind. When I would start asking questions about different options that other LLMs had presented, each session would drift toward that explanation.

marcus_holmes 14 minutes ago | [-0 more]

In my day job we tried creating a credit assessor tool using LLM as the credit assessor.

It did great, generated a report on the assessed business that was incredibly detailed and plausible.

Then I started running tests and getting into the details, and found that if you ran the same report on the same data, it generated completely different, still very plausible, results. I could run the same source data through the assessment process 10 times and get 10 very different results. We had to can the project and go a different route.

LLMs are designed to produce plausible results, not factual results. We can fix this when using them for software dev by using linters and tests (though we've all had the experience where the LLM invents an API endpoint). I would not trust raw LLM output in any situation where that kind of testing and verification capability isn't present.

Esophagus4 6 hours ago | [-5 more]

Have you ever let the LLMs “discuss” with each other to see if that would give better answers?

You might end up with the answer from the most persuasive LLM, but you might also end up with better results.

Wonder if there is a paper out there on this.

scheme271 6 hours ago | [-2 more]

The problem is how do you know whether the answer is just the most persuasive or actually the most accurate one? It's hard to figure this out without domain knowledge.

Esophagus4 14 minutes ago | [-0 more]

I dunno, I could see it working.

I do something similar with reviewing code: I have one agent write the code and another reviews it, then they go back and forth for a bit improving the code. Seems to yield better results than one agent alone.

Seems like a similar principle.

XorNot 4 hours ago | [-0 more]

Worse is that LLMs are trained to be persuasive by default. The "you're absolutely right..." stereotype is because these things are A/B tested on response quality and we know from studies people reliably rate vibes better then anything else - e.g. while the quality of hospital accomodations likely has some impact on patient outcomes, the view and decor of the room certainly did not fundamentally change the quality of the care provided but it is the largest determinant in how well people rate that care.

cadamsdotcom 5 hours ago | [-1 more]

The problem with trying to write a paper is the results depend on RNG.

NonHyloMorph 5 hours ago | [-0 more]

That doesn't make it differrnt from any other problem measured by statistical significance in averaged over a big enough series of comparisons, no?