I added GLM 5.2 to my security bug hunting benchmark when it came out, and found it to be a good performer, but not the best open model. The benchmark tests whether models can find bugs Mythos found. The best open models in the initial benchmark were DeepSeek V4 Pro or MiMo 2.5 Pro. But it turned out MiMo got lucky, it's performed worse on almost every test I've done since, while DeepSeek has consistently been among the best performers and its extreme caching performance makes it cheaper than just about anything, including much smaller models.
https://swelljoe.com/post/will-it-mythos/
Also of note, I found giving models access to the open source semgrep as a tool makes some perform worse and none perform better, though it's plausible there's a way to wire it up in a harness that presents useful information to the model without the model having to know how to use it (my theory is that semgrep isn't heavily represented in the training data, so you're asking the model to do two things at once: figure out how to use semgrep and find security bugs, and both tasks suffer for the lack of focus...most small models, and some big models, can't do that well).
Edit: But, also, more testing is ongoing. I suspect GLM 5.2 will also be a consistently strong performer. It seems to excel at most things I've tested on it.
GLM 5.2 and DeepSeek v4 Pro seem to approach security research differently. This benchmark was with GLM 5.1, but the patterns are similar: https://dualuse.dev/posts/deepseek-v4-thinks-different
Overall, I still think GLM 5.2 is the much stronger performer. It's hard to tell the difference between GLM 5.2 and Opus at <120k tokens.
I have found that some models consistently find or miss specific bugs, and which bugs are hard don't completely line up across all models, so I believe that. I just refactored the security bug-finding harness I've been working on completely (not checked in yet, testing it currently) to strongly encourage "multi-model, multi-pass" scans and make them easy to orchestrate with de-dupe and weeding false positives with a strong model, rather than one model or doing just one pass over each file. Giving a model a second attempt increases their findings by 20-30%, and giving them a third, adds another 10-15%.
I'm inclined to use DeepSeek V4 Pro the most, because it is consistently extremely strong, it's very fast, it's very cheap and has excellent caching and cheap-as-free cached input tokens (something like 80% of token usage is cached when I'm using it for security scanning). So, my probably "pair" of frontline security researchers will probably be DeepSeek V4 Pro and Gemma 4 31B self-hosted (another shockingly strong contender, competitive with the best models once you let it loop on the same file a couple/few times). But, I won't be surprised if GLM 5.2 turns out better than DeepSeek V4 Pro...it costs quite a bit more.
We need a benchmark of independent community sourced benchmarks!
…probably already is one
It's not super scientific, but I really like to watch Bijan Bowen's videos on Youtube. I think he's pretty fair about the way he compares them, and it's enough for what I'm doing.
Actually doing something normal but challenging with a model is generally enough for me. I do a quick (an hour or two) project, and see how it holds up. If I'm feeling like it's harder than it should be, I switch to a comparable model I know is good. e.g. I most recently tested Gemini Flash 3.5 for making a web app. It shit the bed...kinda worked, but was ugly and needed several bugfixes right off the bat. I tried the same app in Opus 4.8, which aced it with barely any extra conversation, it looked great (basic but clean, like it was intentional) without any effort.
I like reading benchmarks, but I take them all with a grain of salt. They're just to tell me if the model is worth even trying for my task. I've heavily used self-hosted Qwen 3.6 and Gemma 4 on a bunch of different tasks, and while the benchmarks consistently say Qwen is the better model, I simply don't find that to be the case for anything I do. I think Qwen is tuned for benchmarks, while Google couldn't give two shits about most of the benchmarks, they're just busy making unusually smart tiny models.
I don't know how you'd judge benchmarks beyond "did it test and measure what it says it tests and measures". And, I guess there have been instances where the benchmark failed to do that, and the models could cheat in some way and it just tested the models ability to find the answer key. In the case of my benchmarks every model other than Claude models running in Claude Code never have network access and all information from after the bug was discovered has been removed from the repository the model can see.
But, there are benchmarks for so many different kinds of ability, I don't know how to compare them directly against one another. Like, models that do well on terminal and agentic coding benchmarks tend to do well on finding security bugs, but it's not a 1:1 correlation, there are surprises.