by gertlabs 2 hours ago

We've spent some time trying to understand this anomaly, even re-running Sonnet 4.6 through our evaluations to see if that would bring down its scores... and it didn't. I don't know what they did differently, but it's basically Opus 4.6 with more temperature variability (some great responses, some less great, with an approximately frontier median response in agentic work specifically). It is smart, methodical and excellent at tool calling in our custom environments.

We now use Sonnet 4.6 for a number of internal use cases we wouldn't have considered otherwise.

hedora 2 hours ago | [-0 more]

That tracks with my experience.

4.7 was so bad, I locked a bunch of my machines to 4.6.

I haven’t bothered locking the 4.8 machines to 4.6. There was a HN thread a while back where they run swe bench a few times a day and measure success rate and latency. It showed opus getting significantly dumber for the week before a recent launch.

It wouldn’t surprise me if they’re quantizing to improve margins or to hype models in comparative testing in order to defraud investors at IPO.

Or, maybe QA is hard. Anyway, I think they hit a performance wall sometime at or before 4.6.