I have taken another look on these open models after the fiasco of Fable and GPT 5.6 this weekend and... GLM-5.2 truly is a good workhorse model for daily programming. I consider myself a heavy user of LLMs and a seasoned developer. A typical session for me with GPT is usually over a hundred dollars...
This weekend I programmed a matrix bot with encryption and a Rust agent with some tools. Because I need one and OpenClaw just felt... not what I wanted. Two days later and 20 dollars poorer I have what I need: a multimodal agent written in rust that has access to my homelab.
Nothing felt off with GLM. It did what I wanted, was fast, had a decent not very annoying personality and was much cheaper than Opus or GPT.
I used it unquantized through Fireworks, but there are multiple other providers too.
GLM 5.2 is a great model, but if you only want to use the best model available, it isn't there yet. Every lab releases models that memorize benchmark answers, both intentionally and unintentionally. But we consistently find that models from Chinese labs have a wider gap between public benchmarks and our evaluations, which we designed to be less vulnerable to benchmaxxing.
In multi-agent coding environments, GLM 5.2 is just shy of Opus 4.6 on average. Data at https://gertlabs.com/rankings
But when factoring in performance/cost, GLM 5.2 is the frontier model.
Opus 4.6 is still my preferred model for work, so this is great to hear.
Man, there is exactly zero information on your site about how your benchmarks work. Why should one trust your numbers when there is no way to verify them?
Scroll to the bottom for the methodology (sorry, this should be linkable)
Why Deepseek v4 flash is better than pro in your benchmarks?
It's 100% due to tool use -- Flash adapts much better to our custom harness with tool names that are not identical to what models were likely trained on. DeepSeek V4 Pro performs much worse in that aspect than almost all other recent releases, for whatever reason.
I have also found deepseek flash beat pro in some of my own internal evals for tasklet.ai it’s really surprising and I don’t understand it
Same.. although rare, but have observed twice till date.
Some blog post I read few weeks back said that DSV4Flash in xHigh effort beats even the pro model in xHigh effort.
The rumour is that it's trained on Opus, but who knows
Oh of course all deepseek and glm are. Multiple people have seen GLM self report that it is claude, which makes it super obvious.
I think the surprising thing is I expect flash to be a pure distillation and strictly worse quality but clearly it’s more nuanced than that.
Claude claims to be deepseek, under some circumstances:
https://www.reddit.com/r/DeepSeek/comments/1rd5jw7/claude_so...
After having used GLM 5.2 and Opus 4.8 for enough time, I'm very unconvinced of the benchmark maxxing claims - if anything, GLM 5.2's rather lackluster performance on benchmarks compared to Opus 4.8 paints the opposite picture when compared to the subjective experience.
When I first used Opus 4.8, I threw several different workloads I had at it - I have Claude doing a lot of misc projects whose primary purpose is pretty much just studying what AI agents can do for my own curiosity and no other reason. Opus 4.8 was one of the first models I ever snuck in there that basically ran out of control. No previous Opus or Sonnet model I had used ever did this. Within hours every agent I had running was writing non-sense tool calls that echoed pretend commands that didn't exist, like 10 in a row, and talking about the "tool channel" being dirty. I switched back to Opus 4.7 and assumed Opus 4.8 was legitimately just broken.
I did come back to Opus 4.8 and found that it was indeed, pretty powerful. But that initial experience has stuck with me on just how narrow of a perspective any given test or benchmark is guaranteed to have. LLMs are too broad, it really doesn't matter what you try to do in your benchmark, you will necessarily get a limited view of what the model is capable of and its shortcomings. This will remain true for at least as long as models are susceptible to massive swings in performance based on randomness and minor differences in prompts and other environmental factors.
I'm not saying benchmarks are useless or that your benchmarks are not possibly closer to the truth either. All evidence at least points to the idea that Chinese models perform very well in coding but often have more mixed results on other tasks. I'm just saying that at this point, benchmarks feel like they have limited connection to my actual real experiences. GLM 5.2 actually scored kinda meh on a lot of benchmarks (compared to closed frontier models) but my actual experience using it does not match this.
And I'm definitely not saying GLM 5.2 is better than the frontier LLMs here, just that the race is close. I still prefer GPT 5.5 right now for code review, I think, and Opus clearly has some advantages depending on the task. It's just no longer a given that Opus 4.8 will perform better than GLM 5.2 on any given task, so to me the calculus behind "using the best model available" is getting complex and you might need to get a feel for what models have what strengths to really figure it out.
I do feel like the "use the best model available" mentality is not going to die any time soon, but if it does die, it will be gradual and start soon for programming. Modern LLMs are still not a full superset of what human programmers can do, but still larger models are definitely starting to hit diminishing returns for tasks at the lower end of complexity, and that is a big deal. It's a weird world where some tasks you can feel kinda confident just throwing Gemma 4 at it and not sweating whether you should use a better model; I've certainly done it for some quick Python scripts or getting an overview of some code I'm unfamiliar with.
Notice the website url is the same name as the commentor.
Notice he's using "trust me bro" benchmarks.
Can we just remove all the motivated speech on HN? This is just not trustworthy information at all and obviously is incentivized.
Everyone is grinding and marketing nobody is actually discussing anything for real.
Im really curious about this. Why pay API pricing? I burn 1000s of dollars a month of api according to claude usage but only pay the $100 subscription
My increasing frustration with these plans is the harness lock in.
Anthropic won't even let you run "claude -p [prompt]" any more... They bill it at api rates.
So if you're trying to automate the ai (and seriously, that's the point) the subsidized plans are crippled.
They postponed that change, here is the email they sent out:
> In May, we sent you an email announcing that starting today, the Claude Agent SDK, claude -p, and third-party apps built on the Agent SDK would stop drawing from subscription rate limits and move to a dedicated monthly credit. We're writing to let you know that we’re not making this change today. We’re working to update the plan to better support how users build with Claude subscriptions.
> What this means for you
> Nothing changes for now. Agent SDK, claude -p, and third-party app usage continues to work with your subscription exactly as it did before today, and there's no credit to claim. Your subscription limits are unchanged. When we have an update, we'll share it with advance notice before it takes effect
They canned the moved to make -p commands API billable.
I'm using synthetic.new and Neuralwatt with pi and its good and also cheap
I have had bad experience with neuralwatt GLM 5.2. Seems like they may be using quantized version of the model.
Hi I'm the CTO of neuralwatt, would love to hear your feedback on what your experience was. Feel free to email me scott@neuralwatt.com. Also for GLM5.2 we run the FP8 quantization at 1M context which is a common deployment target.
Z.ai does not lock you in to any harness.
I think they rolled that back
There is a whole iceberg topic on subsidizing.
So your question is really “if they’re giving free usage, why not take advantage of it?”
I do, so I don’t know the reasons not to, other than to experiment.
[dead]
If you're using Matrix, consider Hermes as a harness if you haven't already. Native gateway support. I've been primarily using mine through Element and it has largely been great.
Oh interesting. I basically chose Matrix because setting anything up with Whatsapp or signal was kind of painful and telegram doesn't make it easy to use encryption with bots.
I kind of wanted to see if I can make a Matrix agent from scratch with Rust with GLM and it was surprisingly easy. Just make something for myself how I want it. Maybe I'll take a look on Hermes later...
Very interesting—Element X solved a lot of the pains of Element (iOS), could be a good solution!
Are you sure fireworks is unquant? It's not listing precision on openrouter like everyone else.
> A typical session for me with GPT is usually over a hundred dollars.
I don't think a $100 session is "typical". I use GPT for months. $20/m plus plan is enough for my daily work.
I use an observability tool with claude code [1] that shows me usage including prompt and session cost. Even though I use a max subscription, it's interesting to see what it would cost me if I was using API directly.
My typical session ranges from $100-$400 - higher end when using workflows with lots of subagents. $100/session is expected when using the API without the subsidized subscription pricing. Most larger orgs have to use API pricing AFAIK.
I have Claude max plan and the vscode claude dashboard plugin has logged about $4k worth of tokens in the past 2 months. I upgraded because I was using my weekly basic plan tokens in like 5 hours.
Likewise, I don't understand how anyone survives on the basic plans. It's funny seeing these two camps not understanding what the other is doing :)
It's really interesting what "normal" is for folks. I use the $200/month Anthropic subscription and use it within a few percentages of my limit every week.
I'd blow through $20/month plan in hours.
$20 on API pricing or on subscription?
API, pay per token.
Why are you not using the subscription plan?
Which harness did u use?
Opencode and Zed about 40/60.
[flagged]
Please take comments like this back to reddit.
Its an editor: https://zed.dev/
Just FYI, this question was a quote from Pulp Fiction, the other commenter (mdre) replied also with a quote, that was an answer to this question in the movie.
[flagged]
Twenty dollars?
How are you comfortable spending that much to write something as simple as a matrix bot?
Are people doing this kind of thing just super rich or am I missing something?
It's pretty simple. There are things that I do because it's fun, like gamedev. I hand code that, and don't use LLM tools because I like learning and building. I do lots of utility stuff coding for my wife's business, most of that is stuff I could do in a few hours. It's worth $20 to not spend a few hours doing it. It's a cost benefit tradeoff. I won't learn much fixing WordPress themes or adding a feature to her web page, or setting up an automation for her, so I don't see the point of doing that.
Same thing for stuff at work. Oh, the tables/schema changed and my queries broke? I could dork around with spark and cypher for an hour, or I can tell claude to update the queries for the new schema. At the rate I am paid, spending on Claude tokens is generally a better use of my resources.
Building a net new solution? Coding tools take a back seat until I get the core logic right, then I let automation handle web page and UI scaffolding.
$20 is really cheap for the amount of work saved, considering you're in the US.
Is spending $20 considered "super rich"?
[flagged]