by pimeys 5 hours ago

I have taken another look on these open models after the fiasco of Fable and GPT 5.6 this weekend and... GLM-5.2 truly is a good workhorse model for daily programming. I consider myself a heavy user of LLMs and a seasoned developer. A typical session for me with GPT is usually over a hundred dollars...

This weekend I programmed a matrix bot with encryption and a Rust agent with some tools. Because I need one and OpenClaw just felt... not what I wanted. Two days later and 20 dollars poorer I have what I need: a multimodal agent written in rust that has access to my homelab.

Nothing felt off with GLM. It did what I wanted, was fast, had a decent not very annoying personality and was much cheaper than Opus or GPT.

I used it unquantized through Fireworks, but there are multiple other providers too.

gertlabs 3 hours ago | [-12 more]

GLM 5.2 is a great model, but if you only want to use the best model available, it isn't there yet. Every lab releases models that memorize benchmark answers, both intentionally and unintentionally. But we consistently find that models from Chinese labs have a wider gap between public benchmarks and our evaluations, which we designed to be less vulnerable to benchmaxxing.

In multi-agent coding environments, GLM 5.2 is just shy of Opus 4.6 on average. Data at https://gertlabs.com/rankings

But when factoring in performance/cost, GLM 5.2 is the frontier model.

ronsor 33 minutes ago | [-0 more]

Opus 4.6 is still my preferred model for work, so this is great to hear.

bjourne an hour ago | [-1 more]

Man, there is exactly zero information on your site about how your benchmarks work. Why should one trust your numbers when there is no way to verify them?

gertlabs an hour ago | [-0 more]

Scroll to the bottom for the methodology (sorry, this should be linkable)

skeptic_ai 2 hours ago | [-6 more]

Why Deepseek v4 flash is better than pro in your benchmarks?

gertlabs an hour ago | [-0 more]

It's 100% due to tool use -- Flash adapts much better to our custom harness with tool names that are not identical to what models were likely trained on. DeepSeek V4 Pro performs much worse in that aspect than almost all other recent releases, for whatever reason.

rockwotj an hour ago | [-4 more]

I have also found deepseek flash beat pro in some of my own internal evals for tasklet.ai it’s really surprising and I don’t understand it

freakynit an hour ago | [-0 more]

Same.. although rare, but have observed twice till date.

Some blog post I read few weeks back said that DSV4Flash in xHigh effort beats even the pro model in xHigh effort.

onoesworkacct an hour ago | [-2 more]

The rumour is that it's trained on Opus, but who knows

rockwotj an hour ago | [-1 more]

Oh of course all deepseek and glm are. Multiple people have seen GLM self report that it is claude, which makes it super obvious.

I think the surprising thing is I expect flash to be a pure distillation and strictly worse quality but clearly it’s more nuanced than that.

kennywinker 5 minutes ago | [-0 more]
jchw 2 hours ago | [-0 more]

After having used GLM 5.2 and Opus 4.8 for enough time, I'm very unconvinced of the benchmark maxxing claims - if anything, GLM 5.2's rather lackluster performance on benchmarks compared to Opus 4.8 paints the opposite picture when compared to the subjective experience.

When I first used Opus 4.8, I threw several different workloads I had at it - I have Claude doing a lot of misc projects whose primary purpose is pretty much just studying what AI agents can do for my own curiosity and no other reason. Opus 4.8 was one of the first models I ever snuck in there that basically ran out of control. No previous Opus or Sonnet model I had used ever did this. Within hours every agent I had running was writing non-sense tool calls that echoed pretend commands that didn't exist, like 10 in a row, and talking about the "tool channel" being dirty. I switched back to Opus 4.7 and assumed Opus 4.8 was legitimately just broken.

I did come back to Opus 4.8 and found that it was indeed, pretty powerful. But that initial experience has stuck with me on just how narrow of a perspective any given test or benchmark is guaranteed to have. LLMs are too broad, it really doesn't matter what you try to do in your benchmark, you will necessarily get a limited view of what the model is capable of and its shortcomings. This will remain true for at least as long as models are susceptible to massive swings in performance based on randomness and minor differences in prompts and other environmental factors.

I'm not saying benchmarks are useless or that your benchmarks are not possibly closer to the truth either. All evidence at least points to the idea that Chinese models perform very well in coding but often have more mixed results on other tasks. I'm just saying that at this point, benchmarks feel like they have limited connection to my actual real experiences. GLM 5.2 actually scored kinda meh on a lot of benchmarks (compared to closed frontier models) but my actual experience using it does not match this.

And I'm definitely not saying GLM 5.2 is better than the frontier LLMs here, just that the race is close. I still prefer GPT 5.5 right now for code review, I think, and Opus clearly has some advantages depending on the task. It's just no longer a given that Opus 4.8 will perform better than GLM 5.2 on any given task, so to me the calculus behind "using the best model available" is getting complex and you might need to get a feel for what models have what strengths to really figure it out.

I do feel like the "use the best model available" mentality is not going to die any time soon, but if it does die, it will be gradual and start soon for programming. Modern LLMs are still not a full superset of what human programmers can do, but still larger models are definitely starting to hit diminishing returns for tasks at the lower end of complexity, and that is a big deal. It's a weird world where some tasks you can feel kinda confident just throwing Gemma 4 at it and not sweating whether you should use a better model; I've certainly done it for some quick Python scripts or getting an overview of some code I'm unfamiliar with.

Madmallard an hour ago | [-0 more]

Notice the website url is the same name as the commentor.

Notice he's using "trust me bro" benchmarks.

Can we just remove all the motivated speech on HN? This is just not trustworthy information at all and obviously is incentivized.

Everyone is grinding and marketing nobody is actually discussing anything for real.

Aditya_Garg 3 hours ago | [-10 more]

Im really curious about this. Why pay API pricing? I burn 1000s of dollars a month of api according to claude usage but only pay the $100 subscription

horsawlarway 3 hours ago | [-7 more]

My increasing frustration with these plans is the harness lock in.

Anthropic won't even let you run "claude -p [prompt]" any more... They bill it at api rates.

So if you're trying to automate the ai (and seriously, that's the point) the subsidized plans are crippled.

cortesoft 3 hours ago | [-0 more]

They postponed that change, here is the email they sent out:

> In May, we sent you an email announcing that starting today, the Claude Agent SDK, claude -p, and third-party apps built on the Agent SDK would stop drawing from subscription rate limits and move to a dedicated monthly credit. We're writing to let you know that we’re not making this change today. We’re working to update the plan to better support how users build with Claude subscriptions.

> What this means for you

> Nothing changes for now. Agent SDK, claude -p, and third-party app usage continues to work with your subscription exactly as it did before today, and there's no credit to claim. Your subscription limits are unchanged. When we have an update, we'll share it with advance notice before it takes effect

smcleod 3 hours ago | [-0 more]

They canned the moved to make -p commands API billable.

sroerick 3 hours ago | [-2 more]

I'm using synthetic.new and Neuralwatt with pi and its good and also cheap

computerex 3 hours ago | [-1 more]

I have had bad experience with neuralwatt GLM 5.2. Seems like they may be using quantized version of the model.

scottcha an hour ago | [-0 more]

Hi I'm the CTO of neuralwatt, would love to hear your feedback on what your experience was. Feel free to email me scott@neuralwatt.com. Also for GLM5.2 we run the FP8 quantization at 1M context which is a common deployment target.

throwawayffffas 3 hours ago | [-0 more]

Z.ai does not lock you in to any harness.

weird-eye-issue 3 hours ago | [-0 more]

I think they rolled that back

SV_BubbleTime 3 hours ago | [-0 more]

There is a whole iceberg topic on subsidizing.

So your question is really “if they’re giving free usage, why not take advantage of it?”

I do, so I don’t know the reasons not to, other than to experiment.

AussieWog93 2 hours ago | [-0 more]

[dead]

shostack 5 hours ago | [-2 more]

If you're using Matrix, consider Hermes as a harness if you haven't already. Native gateway support. I've been primarily using mine through Element and it has largely been great.

pimeys 4 hours ago | [-0 more]

Oh interesting. I basically chose Matrix because setting anything up with Whatsapp or signal was kind of painful and telegram doesn't make it easy to use encryption with bots.

I kind of wanted to see if I can make a Matrix agent from scratch with Rust with GLM and it was surprisingly easy. Just make something for myself how I want it. Maybe I'll take a look on Hermes later...

Barbing 2 hours ago | [-0 more]

Very interesting—Element X solved a lot of the pains of Element (iOS), could be a good solution!

KaoruAoiShiho 4 hours ago | [-0 more]

Are you sure fireworks is unquant? It's not listing precision on openrouter like everyone else.

jklmnopqrstuvw an hour ago | [-3 more]

> A typical session for me with GPT is usually over a hundred dollars.

I don't think a $100 session is "typical". I use GPT for months. $20/m plus plan is enough for my daily work.

simple10 12 minutes ago | [-0 more]

I use an observability tool with claude code [1] that shows me usage including prompt and session cost. Even though I use a max subscription, it's interesting to see what it would cost me if I was using API directly.

My typical session ranges from $100-$400 - higher end when using workflows with lots of subagents. $100/session is expected when using the API without the subsidized subscription pricing. Most larger orgs have to use API pricing AFAIK.

[1] https://github.com/simple10/agents-observe

tjwebbnorfolk 18 minutes ago | [-0 more]

I have Claude max plan and the vscode claude dashboard plugin has logged about $4k worth of tokens in the past 2 months. I upgraded because I was using my weekly basic plan tokens in like 5 hours.

Likewise, I don't understand how anyone survives on the basic plans. It's funny seeing these two camps not understanding what the other is doing :)

adamtaylor_13 33 minutes ago | [-0 more]

It's really interesting what "normal" is for folks. I use the $200/month Anthropic subscription and use it within a few percentages of my limit every week.

I'd blow through $20/month plan in hours.

dist-epoch 5 hours ago | [-2 more]

$20 on API pricing or on subscription?

pimeys 5 hours ago | [-1 more]

API, pay per token.

Chrisoaks an hour ago | [-0 more]

Why are you not using the subscription plan?

HKCM852 5 hours ago | [-6 more]

Which harness did u use?

pimeys 5 hours ago | [-5 more]

Opencode and Zed about 40/60.

noncoml 4 hours ago | [-4 more]

[flagged]

term333 4 hours ago | [-0 more]

Please take comments like this back to reddit.

sertsa 4 hours ago | [-1 more]
HAL3000 4 hours ago | [-0 more]

Just FYI, this question was a quote from Pulp Fiction, the other commenter (mdre) replied also with a quote, that was an answer to this question in the movie.

mdre 4 hours ago | [-0 more]

[flagged]

dom96 3 hours ago | [-3 more]

Twenty dollars?

How are you comfortable spending that much to write something as simple as a matrix bot?

Are people doing this kind of thing just super rich or am I missing something?

ygjb 2 hours ago | [-0 more]

It's pretty simple. There are things that I do because it's fun, like gamedev. I hand code that, and don't use LLM tools because I like learning and building. I do lots of utility stuff coding for my wife's business, most of that is stuff I could do in a few hours. It's worth $20 to not spend a few hours doing it. It's a cost benefit tradeoff. I won't learn much fixing WordPress themes or adding a feature to her web page, or setting up an automation for her, so I don't see the point of doing that.

Same thing for stuff at work. Oh, the tables/schema changed and my queries broke? I could dork around with spark and cypher for an hour, or I can tell claude to update the queries for the new schema. At the rate I am paid, spending on Claude tokens is generally a better use of my resources.

Building a net new solution? Coding tools take a back seat until I get the core logic right, then I let automation handle web page and UI scaffolding.

copperx 29 minutes ago | [-0 more]

$20 is really cheap for the amount of work saved, considering you're in the US.

adamtaylor_13 33 minutes ago | [-0 more]

Is spending $20 considered "super rich"?

playorizaya 3 hours ago | [-0 more]

[flagged]