by SwellJoe 2 hours ago

I have found that some models consistently find or miss specific bugs, and which bugs are hard don't completely line up across all models, so I believe that. I just refactored the security bug-finding harness I've been working on completely (not checked in yet, testing it currently) to strongly encourage "multi-model, multi-pass" scans and make them easy to orchestrate with de-dupe and weeding false positives with a strong model, rather than one model or doing just one pass over each file. Giving a model a second attempt increases their findings by 20-30%, and giving them a third, adds another 10-15%.

I'm inclined to use DeepSeek V4 Pro the most, because it is consistently extremely strong, it's very fast, it's very cheap and has excellent caching and cheap-as-free cached input tokens (something like 80% of token usage is cached when I'm using it for security scanning). So, my probably "pair" of frontline security researchers will probably be DeepSeek V4 Pro and Gemma 4 31B self-hosted (another shockingly strong contender, competitive with the best models once you let it loop on the same file a couple/few times). But, I won't be surprised if GLM 5.2 turns out better than DeepSeek V4 Pro...it costs quite a bit more.