HN via remix.js for vilnius.js

by bArray 5 hours ago

Apparently GLM 5.2 is 753B parameters [1], what kind of hardware are people using to run this locally?

[1] https://huggingface.co/zai-org/GLM-5.2

I ran it on my laptop, which is a Lenovo Legion 5i (think 32 GB RAM, 4060 w/ 8 GB VRAM, you get the picture). It was a quantized model (otherwise it would not fit on my NVMe 1TB drive) at 4 bits per weight - UD_Q4_K_XL. It ran at about 12 seconds per token (not tokens per second). A fun project, but not worth it. I used 4096 tokens of context cache, and I ran it with llama.cpp - as it supports memory mapping. Because the whole thing could obviously not fit in RAM, I was curious how much it would need to stream from SSD. The answer? For a simple 4 sentence description of who it was, about 1.5 TiB was streamed from disk.

bArray 2 hours ago | [-1 more]

Thank you for sharing. 1.5TB of streamed data at 12 seconds per token on a high end consumer laptop is a pretty high requirement - I can only imagine how much that cost to train. I don't know how running this model could be cost effective for anybody.

Retro_Dev 2 hours ago | [-0 more]

Indeed - definitely not cost effective to run it on this laptop LOL. It makes me wonder how fast we could run the model if we could fit the weights entirely within CPU cache (assuming a whole ton of CPUs with low latency & high speed IO of course).

kccqzy 4 hours ago | [-0 more]

Run quantized versions. https://unsloth.ai/docs/models/glm-5.2

crocowhile 5 hours ago | [-2 more]

follow antirez - https://x.com/antirez/status/2071173841175363905?s=20

nozzlegear 4 hours ago | [-0 more]

https://xcancel.com/antirez/status/2071173841175363905

JamesSwift 5 hours ago | [-0 more]

Thats quantized

scosman an hour ago | [-0 more]

short answer: they mostly aren't

A few people are running highly quantized models with limited context windows. It's still impressive, but not the benchmark level intelligence. Very few people could afford a rig for reasonable local performance at a reasonable quant, at full context size.

The antirez example is 2.6bit quant, 32k context, and few tokens per second... on a ~$7000 MacBook M5 (new RAM pricing).

dakolli 5 hours ago | [-44 more]

8 X RTX6000. It will run you around 80-100k to get started with a model at this size with decent tps..

Don't worry though, open source evangelists will tell you that these will be running on your phone in the next 3 years.

For $100k you could run this model 24/7 through open router with 10 concurrent sessions at 50tps for a decade and have money left over for a vacation. There's no point in investing this type of money in local models unless you have a business where you're already paying for many employee's individual token usage.

Aurornis 5 hours ago | [-6 more]

> 8 X RTX6000. It will run you around 80-100k to get started

8 x RTX6000 GPUs cost $100,000 alone. You then need to build a system that can support those GPUs with enough PCIe lanes through a PCIe switch.

It's going to be $120K to $150K to build or buy a system to run this.

cheschire 2 hours ago | [-1 more]

Not to mention the three separate dedicated 15A circuits you would need to have installed in order to run the 3x 2000W power supplies running ideally at no more than 1400W sustained load each. And definitely would need 200A service to the house if you have a family living there with you.

But hey you could save on heating?

InvertedRhodium an hour ago | [-0 more]

That’s a uniquely US issue - in NZ you can get a 100A single phase at 230V nominal without any issue. 23kw, straight to your door.

A single circuit using 10mm TPS would technically be enough to run what you’re describing. Might be pricey though, I’d probably take the excuse to get 3 phase installed so I could get access to the stock of used 3 phase machinery.

knollimar 3 hours ago | [-0 more]

isn't throwing that into a [insert financial vehicle that gives 99.99999% safe returns] going to destroy that when you factor in electricity costs?

Or even just electricity costs vs token cost

CamperBob2 4 hours ago | [-2 more]

You can run the NV4FP quant with 8x RTX6000 cards at 50-75 tps output, but not (practically speaking) the OEM FP8 version. You will learn more about PCIe than you ever wanted to know.

The real gangstas are running 16x RTX6000s. Too rich for my blood, and the NV4FP quant doesn't seem to be that much worse.

Sanzig 3 hours ago | [-1 more]

Anyone done any benchmarks on the NV4FP quant? Seriously considering pitching an 8 x RTX 6000 Pro box at work to run GLM-5.2 in an air gapped environment.

tiahura 3 hours ago | [-0 more]

Good luck. I’m in the legal field, and even there, selling airgapped is tough.

AussieWog93 an hour ago | [-1 more]

>Don't worry though, open source evangelists will tell you that these will be running on your phone in the next 3 years.

Not sure if you're being sarcastic, but I can run a quantised version of Gemma or Qwen on my 16GB M1 Macbook Pro that beats GPT-4 from 2023 hands-down.

I wouldn't be surprised if, in another 3 years, you'd be able to run something as powerful as Opus 4.5 or GLM-5.2 on standard consumer hardware - say a 32GB/64GB M7 Pro.

I also wouldn't be surprised if, 3 years after that, cheaper hardware and improved model efficiency means that there's a much smaller gap between what you can run on a consumer CPU (which, with memory prices coming down, could look like a 256GB M9 or M10 Pro) and $100k GPU cluster.

marcus_holmes an hour ago | [-0 more]

This is clearly where the industry is going, imho. Everyone who is playing with LLMs wants a laptop with enough grunt to run a decent model locally.

We've been sat with basically the same PC specs for ~20 years - our current specs are within an order of magnitude of the ones we could buy back in 2010. This is not really constrained by tech, as we could have much, much, larger machines. It's more because there's no mass demand for much, much, larger machines - if it's big enough to run Office apps or VSCode then you're good to go. The exponential growth we saw in the 90's was driven as much by software demand as it was by hardware development.

I can see the next 10 years produce the same kind of push for larger machines that the 90's did. And we should probably expect the same kind of standards churn as our existing technologies for storage, memory, etc, don't scale up enough and new technologies become worth developing because there's demand for them.

InvertedRhodium 4 hours ago | [-0 more]

Depends how much you value privacy and running uncensored models.

Personally, I’m waiting for hardware to hit the secondary market before I buy something to run unquantized models like GLM. But I have no doubt that I will, at some point.

Ldorigo 2 hours ago | [-1 more]

How do the economics of your statement work out? Clearly inference providers don't have a time to ROI of 10 years on their hardware costs; and that's without even taking ongoing energy costs into account. What's missing here?

ac29 an hour ago | [-0 more]

The inference providers are running batch sizes much larger than 10

krackers 5 hours ago | [-3 more]

Would you be better off pooling that money with some hackerspace group and then setting up shared inference infra, so that way you at least get better utilization?

KaoruAoiShiho 4 hours ago | [-1 more]

And before you know it, you invented some openrouter provider from first principles...

janalsncm 3 hours ago | [-0 more]

Right. For example you will need to figure out how to share it and who maintains it.

aetch 3 hours ago | [-0 more]

You can then rent spare capacity out to people on a subscription or token basis ….wait

8note 5 hours ago | [-13 more]

you can however, have fun with it.

oil workers buy 100k trucks they do not-much with. why not a 100k in computer?

Ken_At_EM 5 hours ago | [-3 more]

I can't help but ask where this comment came from, you must have some exposure..

CamperBob2 4 hours ago | [-2 more]

It is so easy to spend $100K on a pickup truck these days, it's not even funny.

tiahura 3 hours ago | [-0 more]

A Honda minivan is > 50k.

SV_BubbleTime 3 hours ago | [-0 more]

Factory F350 Platinum is at least 90k sticker.

afavour 4 hours ago | [-1 more]

Because car loans can’t be used to buy computers

ElProlactin 3 hours ago | [-0 more]

And there's your idea. If you could find a way to get people to add another $500/month over 80+ months to an auto loan, dealers would eat that up like filet mignon.

jliptzin 3 hours ago | [-5 more]

Yea as far has hobbies go, I feel like this is on the low end. I know people who collect watches and corvettes, that's way more expensive and functionally you can't really do anything special with them.

theteapot 3 hours ago | [-4 more]

The difference is watches and corvettes typically appreciate in value, where as computer hardware typically drops like a rock.

15155 2 hours ago | [-0 more]

> watches

Some, and the market fluctuates a ton.

> corvettes

Only the oldest, most unique model years: nobody is buying (C4-C5-realistically C6) mid-90s or early 2000s Corvettes for more than what they paid for them, and they never will.

randomNumber7 3 hours ago | [-0 more]

Also LLMs are mainly used for work and if you can spend 6 digits on watches your likely financially independent.

parineum 2 hours ago | [-0 more]

> The difference is watches and corvettes typically appreciate in value

Both of those things' value drops like a rock as soon as you buy them and, at least for cars, they don't all appreciate. Most don't. Even so, they appreciate at an incredible slow rate.

I can't speak for watches but I'd be surprised if it wasn't the same situation.

At least the gpus can create value after you buy them before they are worthless.

cdelsolar 44 minutes ago | [-0 more]

hmm ok let's build a state of the art from 2021 homelab using 2x Epyc Milan chips + DDR4 RAM and lmk how much it costs...

dakolli 5 hours ago | [-0 more]

Sure, If you want to light money on fire for entertainment, more power to you. There's probably worse ways to light 100k on fire. If I have an extra 100k laying around it's going to my family though.

KetoManx64 4 hours ago | [-9 more]

As an individual I do not need the whole model. I don't need the model to have knowledge of the rain history of Algeria nor how many colors are in the Russian flag. Once they start trimming down the excess and making them field focused they will run just fine on people's individual devices.

JumpCrisscross 4 hours ago | [-8 more]

> I do not need the whole model. I don't need the model to have knowledge of the rain history of Algeria nor how many colors are in the Russian flag

Isn’t the performance gap between quantized and full models indicative that even if you aren’t using it directly, the model knowing the colors in the Russian flag does have something to do with the intelligence you demand?

KetoManx64 4 hours ago | [-2 more]

Do quantized models specifically prune out specific knowledge? I think they just compress things down but they're still in there. You'd most likely need to do that when you're doing the initial model training, but I'm not expert.

JumpCrisscross an hour ago | [-1 more]

> they just compress things down but they're still in there

The compression is almost certainly in part specific knowledge getting fuzzed.

DennisP 22 minutes ago | [-0 more]

Yeah, but it's everything getting fuzzed, including the parts you care about.

kibwen 4 hours ago | [-4 more]

Quantizing is one thing. But in general it's self-evident that training the model on information that is irrelevant to your use case does not necessarily improve ability, otherwise you'd have AGI just from reinforcing your model on memorizing the first 10^50 digits of pi.

Likewise, LLMs do not violate the laws of information theory, and therefore the only way to encode X amount of information in Y amount of bits where X > Y is by performing what is effectively lossy compression, and as X grows larger relative to Y the compression ratio must change to lose ever more information.

Yes, for the sake of making chatbots that are "conversational" in that they can interpret natural language as input and produce code as output you can easily benefit in incidental and unintuitive ways by training it on more natural language text. But for a given fixed parameter size, it's possible to produce a better model for a specific task by selectively not muddying its training set in the first place with things that are likely irrelevant to the task.

JumpCrisscross an hour ago | [-0 more]

> it's self-evident that training the model on information that is irrelevant to your use case does not necessarily improve ability

We don’t understand AI or natural intelligence well enough to make such statements. As for self evidence, cross-domain competence in humans and the rise of generalist models over domain-specific ones (on competence, not cost) seems to pretty directly tank your hypothesis.

coldtea 2 hours ago | [-1 more]

>But in general it's self-evident that training the model on information that is irrelevant to your use case does not necessarily improve ability, otherwise you'd have AGI just from reinforcing your model on memorizing the first 10^50 digits of pi.

It's hardly self-evident, and your counter-example is hardly applicable.

The first 10^50 of pi is not the same as having BREADTH of information in the training data, which is the whole point not just any random "information that is irrelevant to your use case".

not to mention that the first 10^50 digits of pi compress to quite small formula, so not much information there to begin with from a shannon/kolmogorov perspective

kibwen 2 hours ago | [-0 more]

It is self-evident. Bringing up Kolmogorov complexity is irrelevant, we're talking about rote memorization, but if you can't ignore the given example then replace "digits of pi" with "bits of output from a true random number generator". There's an infinite amount of information that we could shove into a model, and a finite amount of bits with which to store any of that information such that it can be usefully recalled or form useful logical associations.

tiahura 3 hours ago | [-0 more]

Apparently irrelevant data can help because model weights are entangled.

wonnage 4 hours ago | [-0 more]

Yeah, the neoclouds and hyperscalers are taking massive losses right now, self hosting is basically signing yourself up to do the same. There are philosophical reasons to do so but it’s a terrible economic decision

rekttrader 5 hours ago | [-1 more]

Or you have data that HIPAA, GDPR, PII, or have to care about the concern of others training on your data.

dakolli 5 hours ago | [-0 more]

That too.

dist-epoch 5 hours ago | [-0 more]

> 50tps for a decade

assuming demand doesn't keep on increasing. even google has trouble having enough capacity apparently.

5 hours ago | [-0 more]

[deleted]