by gertlabs 4 hours ago

GLM 5.2 is a great model, but if you only want to use the best model available, it isn't there yet. Every lab releases models that memorize benchmark answers, both intentionally and unintentionally. But we consistently find that models from Chinese labs have a wider gap between public benchmarks and our evaluations, which we designed to be less vulnerable to benchmaxxing.

In multi-agent coding environments, GLM 5.2 is just shy of Opus 4.6 on average. Data at https://gertlabs.com/rankings

But when factoring in performance/cost, GLM 5.2 is the frontier model.

jfaat an hour ago | [-4 more]

> but if you only want to use the best model available, it isn't there yet

I'm trying to wrap my head around exactly why so may people seem to want the best model available when it has recently become clear that most halfway decent models can write damn good code for a fraction of the price. And the frontier models get nerfed constantly so you with open weight you can get something slightly less performant but way more stable. Almost like buying a Ferrari for your daily commute instead of a Toyota or even a Mercedes.

I think there are several factors. Certainly marketing making us think we need the shiny thing which is rampant online and very smart people think they aren't susceptible to. There's a lot of really odd 'I trust Anthropic/OpenAI more than Deepseek' which tends to ignore, for starters, that you can run choose your provider and still save a ton. I also think there's some amount of addiction and brand loyalty where a Ferrari is one hell of a drive so that you turn your nose up at that sensible Toyota. Oh the other one I see used is like oh only fable can oneshot updating my embedded systems thing from 1975 to rust which is great but let's recognize how niche that is.

And it ends up just coming across as people are getting SO reliant on the tools so fast. Maybe it's ok to think and like read a few lines of code and work with these agents to convert your thing to rust or center your div. Even if coding is over which in some sense it certainly is, don't turn your mind into the wall-e people yet. I found myself guilty of this so often. It takes way more time and effort to do things via prompt and I wouldn't just open the editor and fix it because that dopamine hit of the magic the abstraction provided was so strong.

So I'm pretty much done using the 'best' (on benchmarks, if money isn't an object, etc etc) models available. After a year on Sonnet/Opus/GPT5x I'm having way better results with open weights models that don't get lobotomized weekly. I'm finding ways to do the crafting part of building software by focusing on honing my harness and workflow. I'm enjoying changing the oil on my Toyota after a year of almost flying off cliffs in my Ferrari and if I can check my ego it's a purely positive thing.

andai 4 minutes ago | [-0 more]

Yeah, the funniest thing about everyone freaking out about Fable's capabilities recently was that for most of the stuff they were amazed by, you could get roughly the same result from DeepSeek Flash.

I used to be obsessed with what's the best model. Then a while back when the new best model came out, I tested it on a task. I also tested its little brother (much smaller model from same company).

They both completed the task perfectly except the "best" model (the bigger one) cost 5x more and took 3x longer...

ssk42 an hour ago | [-2 more]

What is your favorite harness for the open weights?

jfaat a minute ago | [-0 more]

We built our own and aren't done open sourcing it but before that I got to a really good place with opencode plus some custom agents, pi family is good too although I haven't used it as much. We made an agent to design a spec, one to implement by dispatching subagents, one to validate against the plan, things like that. All of this helps claude/gpt too IME. For open models it has helped them stay out of loops (e.g. Kimi's but WAIT) and for frontier it helps them stay on task and not invent bloated patterns

NamlchakKhandro 42 minutes ago | [-0 more]

pi-mono

hedora 13 minutes ago | [-0 more]

In your box plots, 4.6 sonnet wins over all (even opus 4.6, the 4.8’s and fable).

That’s not super surprising to me, but, given the apparent randomness of the stack ranking, is GLM actually worse than any of the Anthropic models? This looks like a 10-way tie to me.

neya an hour ago | [-1 more]

What is the methodology of your benchmark?

On the contrary, I personally think these broader benchmarks are meaningless. I think personalized benchmarks are the way to go. They should answer "How does this model perform for MY use-case?" rather than trying to answer "How does this model perform across all coding environments?"

Case in point: I use Elixir which is not as popular as Python, is always a hit or miss with most SOTA models at the top of these benchmarks. Whereas, the ones in the middle of the benchmarks (like the GLM) almost always outperform even SOTA models from Google / Anthropic. However, this is relevant only for my use case and I wouldn't just advocate a model for everyone based off my use-case alone.

gertlabs 15 minutes ago | [-0 more]

We use a rotating pool of ~100 games for the coding parts of the benchmark, and are scored objectively based on ratings similar to Elo. Models write code submissions to interact with the environment, then are evaluated in large batches against other submissions.

We test 11 popular/interesting languages (you can see the Languages chart to filter), but not Elixir -- although other evaluations have found that many LLMs solve more problems when working with Elixir [0]. Why models write code well in some languages over others seems to go beyond pre-training data (Python scores quite low for most models) and we don't fully understand it.

[0] https://elixirforum.com/t/llm-coding-benchmark-by-language/7...

ronsor 2 hours ago | [-2 more]

Opus 4.6 is still my preferred model for work, so this is great to hear.

echelon an hour ago | [-1 more]

I can't wait for open models to take over in all categories.

Sounds like this is the year for coding.

pizzly an hour ago | [-0 more]

It looks possible open models will. I never expected the reason would be political/legal rather than technical.

bjourne 3 hours ago | [-1 more]

Man, there is exactly zero information on your site about how your benchmarks work. Why should one trust your numbers when there is no way to verify them?

gertlabs 2 hours ago | [-0 more]

Scroll to the bottom for the methodology (sorry, this should be linkable)

skeptic_ai 3 hours ago | [-7 more]

Why Deepseek v4 flash is better than pro in your benchmarks?

gertlabs 2 hours ago | [-0 more]

It's 100% due to tool use -- Flash adapts much better to our custom harness with tool names that are not identical to what models were likely trained on. DeepSeek V4 Pro performs much worse in that aspect than almost all other recent releases, for whatever reason.

rockwotj 3 hours ago | [-5 more]

I have also found deepseek flash beat pro in some of my own internal evals for tasklet.ai it’s really surprising and I don’t understand it

freakynit 2 hours ago | [-0 more]

Same.. although rare, but have observed twice till date.

Some blog post I read few weeks back said that DSV4Flash in xHigh effort beats even the pro model in xHigh effort.

xbmcuser 41 minutes ago | [-0 more]

maybe they distilled claude for the flash version and not for the other hence better tool use and programming benchmarks

onoesworkacct 2 hours ago | [-2 more]

The rumour is that it's trained on Opus, but who knows

rockwotj 2 hours ago | [-1 more]

Oh of course all deepseek and glm are. Multiple people have seen GLM self report that it is claude, which makes it super obvious.

I think the surprising thing is I expect flash to be a pure distillation and strictly worse quality but clearly it’s more nuanced than that.

kennywinker 2 hours ago | [-0 more]
jchw 4 hours ago | [-0 more]

After having used GLM 5.2 and Opus 4.8 for enough time, I'm very unconvinced of the benchmark maxxing claims - if anything, GLM 5.2's rather lackluster performance on benchmarks compared to Opus 4.8 paints the opposite picture when compared to the subjective experience.

When I first used Opus 4.8, I threw several different workloads I had at it - I have Claude doing a lot of misc projects whose primary purpose is pretty much just studying what AI agents can do for my own curiosity and no other reason. Opus 4.8 was one of the first models I ever snuck in there that basically ran out of control. No previous Opus or Sonnet model I had used ever did this. Within hours every agent I had running was writing non-sense tool calls that echoed pretend commands that didn't exist, like 10 in a row, and talking about the "tool channel" being dirty. I switched back to Opus 4.7 and assumed Opus 4.8 was legitimately just broken.

I did come back to Opus 4.8 and found that it was indeed, pretty powerful. But that initial experience has stuck with me on just how narrow of a perspective any given test or benchmark is guaranteed to have. LLMs are too broad, it really doesn't matter what you try to do in your benchmark, you will necessarily get a limited view of what the model is capable of and its shortcomings. This will remain true for at least as long as models are susceptible to massive swings in performance based on randomness and minor differences in prompts and other environmental factors.

I'm not saying benchmarks are useless or that your benchmarks are not possibly closer to the truth either. All evidence at least points to the idea that Chinese models perform very well in coding but often have more mixed results on other tasks. I'm just saying that at this point, benchmarks feel like they have limited connection to my actual real experiences. GLM 5.2 actually scored kinda meh on a lot of benchmarks (compared to closed frontier models) but my actual experience using it does not match this.

And I'm definitely not saying GLM 5.2 is better than the frontier LLMs here, just that the race is close. I still prefer GPT 5.5 right now for code review, I think, and Opus clearly has some advantages depending on the task. It's just no longer a given that Opus 4.8 will perform better than GLM 5.2 on any given task, so to me the calculus behind "using the best model available" is getting complex and you might need to get a feel for what models have what strengths to really figure it out.

I do feel like the "use the best model available" mentality is not going to die any time soon, but if it does die, it will be gradual and start soon for programming. Modern LLMs are still not a full superset of what human programmers can do, but still larger models are definitely starting to hit diminishing returns for tasks at the lower end of complexity, and that is a big deal. It's a weird world where some tasks you can feel kinda confident just throwing Gemma 4 at it and not sweating whether you should use a better model; I've certainly done it for some quick Python scripts or getting an overview of some code I'm unfamiliar with.

Madmallard 3 hours ago | [-0 more]

Notice the website url is the same name as the commentor.

Notice he's using "trust me bro" benchmarks.

Can we just remove all the motivated speech on HN? This is just not trustworthy information at all and obviously is incentivized.

Everyone is grinding and marketing nobody is actually discussing anything for real.