Have you ever let the LLMs “discuss” with each other to see if that would give better answers?
You might end up with the answer from the most persuasive LLM, but you might also end up with better results.
Wonder if there is a paper out there on this.
The problem is how do you know whether the answer is just the most persuasive or actually the most accurate one? It's hard to figure this out without domain knowledge.
I dunno, I could see it working.
I do something similar with reviewing code: I have one agent write the code and another reviews it, then they go back and forth for a bit improving the code. Seems to yield better results than one agent alone.
Seems like a similar principle.
Worse is that LLMs are trained to be persuasive by default. The "you're absolutely right..." stereotype is because these things are A/B tested on response quality and we know from studies people reliably rate vibes better then anything else - e.g. while the quality of hospital accomodations likely has some impact on patient outcomes, the view and decor of the room certainly did not fundamentally change the quality of the care provided but it is the largest determinant in how well people rate that care.
The problem with trying to write a paper is the results depend on RNG.
That doesn't make it differrnt from any other problem measured by statistical significance in averaged over a big enough series of comparisons, no?