8 X RTX6000. It will run you around 80-100k to get started with a model at this size with decent tps..
Don't worry though, open source evangelists will tell you that these will be running on your phone in the next 3 years.
For $100k you could run this model 24/7 through open router with 10 concurrent sessions at 50tps for a decade and have money left over for a vacation. There's no point in investing this type of money in local models unless you have a business where you're already paying for many employee's individual token usage.
> 8 X RTX6000. It will run you around 80-100k to get started
8 x RTX6000 GPUs cost $100,000 alone. You then need to build a system that can support those GPUs with enough PCIe lanes through a PCIe switch.
It's going to be $120K to $150K to build or buy a system to run this.
Not to mention the three separate dedicated 15A circuits you would need to have installed in order to run the 3x 2000W power supplies running ideally at no more than 1400W sustained load each. And definitely would need 200A service to the house if you have a family living there with you.
But hey you could save on heating?
That’s a uniquely US issue - in NZ you can get a 100A single phase at 230V nominal without any issue. 23kw, straight to your door.
A single circuit using 10mm TPS would technically be enough to run what you’re describing. Might be pricey though, I’d probably take the excuse to get 3 phase installed so I could get access to the stock of used 3 phase machinery.
isn't throwing that into a [insert financial vehicle that gives 99.99999% safe returns] going to destroy that when you factor in electricity costs?
Or even just electricity costs vs token cost
You can run the NV4FP quant with 8x RTX6000 cards at 50-75 tps output, but not (practically speaking) the OEM FP8 version. You will learn more about PCIe than you ever wanted to know.
The real gangstas are running 16x RTX6000s. Too rich for my blood, and the NV4FP quant doesn't seem to be that much worse.
Anyone done any benchmarks on the NV4FP quant? Seriously considering pitching an 8 x RTX 6000 Pro box at work to run GLM-5.2 in an air gapped environment.
Good luck. I’m in the legal field, and even there, selling airgapped is tough.
>Don't worry though, open source evangelists will tell you that these will be running on your phone in the next 3 years.
Not sure if you're being sarcastic, but I can run a quantised version of Gemma or Qwen on my 16GB M1 Macbook Pro that beats GPT-4 from 2023 hands-down.
I wouldn't be surprised if, in another 3 years, you'd be able to run something as powerful as Opus 4.5 or GLM-5.2 on standard consumer hardware - say a 32GB/64GB M7 Pro.
I also wouldn't be surprised if, 3 years after that, cheaper hardware and improved model efficiency means that there's a much smaller gap between what you can run on a consumer CPU (which, with memory prices coming down, could look like a 256GB M9 or M10 Pro) and $100k GPU cluster.
This is clearly where the industry is going, imho. Everyone who is playing with LLMs wants a laptop with enough grunt to run a decent model locally.
We've been sat with basically the same PC specs for ~20 years - our current specs are within an order of magnitude of the ones we could buy back in 2010. This is not really constrained by tech, as we could have much, much, larger machines. It's more because there's no mass demand for much, much, larger machines - if it's big enough to run Office apps or VSCode then you're good to go. The exponential growth we saw in the 90's was driven as much by software demand as it was by hardware development.
I can see the next 10 years produce the same kind of push for larger machines that the 90's did. And we should probably expect the same kind of standards churn as our existing technologies for storage, memory, etc, don't scale up enough and new technologies become worth developing because there's demand for them.
Depends how much you value privacy and running uncensored models.
Personally, I’m waiting for hardware to hit the secondary market before I buy something to run unquantized models like GLM. But I have no doubt that I will, at some point.
How do the economics of your statement work out? Clearly inference providers don't have a time to ROI of 10 years on their hardware costs; and that's without even taking ongoing energy costs into account. What's missing here?
The inference providers are running batch sizes much larger than 10
Would you be better off pooling that money with some hackerspace group and then setting up shared inference infra, so that way you at least get better utilization?
And before you know it, you invented some openrouter provider from first principles...
Right. For example you will need to figure out how to share it and who maintains it.
You can then rent spare capacity out to people on a subscription or token basis ….wait
you can however, have fun with it.
oil workers buy 100k trucks they do not-much with. why not a 100k in computer?
I can't help but ask where this comment came from, you must have some exposure..
It is so easy to spend $100K on a pickup truck these days, it's not even funny.
A Honda minivan is > 50k.
Factory F350 Platinum is at least 90k sticker.
Because car loans can’t be used to buy computers
And there's your idea. If you could find a way to get people to add another $500/month over 80+ months to an auto loan, dealers would eat that up like filet mignon.
Yea as far has hobbies go, I feel like this is on the low end. I know people who collect watches and corvettes, that's way more expensive and functionally you can't really do anything special with them.
The difference is watches and corvettes typically appreciate in value, where as computer hardware typically drops like a rock.
> watches
Some, and the market fluctuates a ton.
> corvettes
Only the oldest, most unique model years: nobody is buying (C4-C5-realistically C6) mid-90s or early 2000s Corvettes for more than what they paid for them, and they never will.
Also LLMs are mainly used for work and if you can spend 6 digits on watches your likely financially independent.
> The difference is watches and corvettes typically appreciate in value
Both of those things' value drops like a rock as soon as you buy them and, at least for cars, they don't all appreciate. Most don't. Even so, they appreciate at an incredible slow rate.
I can't speak for watches but I'd be surprised if it wasn't the same situation.
At least the gpus can create value after you buy them before they are worthless.
hmm ok let's build a state of the art from 2021 homelab using 2x Epyc Milan chips + DDR4 RAM and lmk how much it costs...
Sure, If you want to light money on fire for entertainment, more power to you. There's probably worse ways to light 100k on fire. If I have an extra 100k laying around it's going to my family though.
As an individual I do not need the whole model. I don't need the model to have knowledge of the rain history of Algeria nor how many colors are in the Russian flag. Once they start trimming down the excess and making them field focused they will run just fine on people's individual devices.
> I do not need the whole model. I don't need the model to have knowledge of the rain history of Algeria nor how many colors are in the Russian flag
Isn’t the performance gap between quantized and full models indicative that even if you aren’t using it directly, the model knowing the colors in the Russian flag does have something to do with the intelligence you demand?
Do quantized models specifically prune out specific knowledge? I think they just compress things down but they're still in there. You'd most likely need to do that when you're doing the initial model training, but I'm not expert.
> they just compress things down but they're still in there
The compression is almost certainly in part specific knowledge getting fuzzed.
Yeah, but it's everything getting fuzzed, including the parts you care about.
Quantizing is one thing. But in general it's self-evident that training the model on information that is irrelevant to your use case does not necessarily improve ability, otherwise you'd have AGI just from reinforcing your model on memorizing the first 10^50 digits of pi.
Likewise, LLMs do not violate the laws of information theory, and therefore the only way to encode X amount of information in Y amount of bits where X > Y is by performing what is effectively lossy compression, and as X grows larger relative to Y the compression ratio must change to lose ever more information.
Yes, for the sake of making chatbots that are "conversational" in that they can interpret natural language as input and produce code as output you can easily benefit in incidental and unintuitive ways by training it on more natural language text. But for a given fixed parameter size, it's possible to produce a better model for a specific task by selectively not muddying its training set in the first place with things that are likely irrelevant to the task.
> it's self-evident that training the model on information that is irrelevant to your use case does not necessarily improve ability
We don’t understand AI or natural intelligence well enough to make such statements. As for self evidence, cross-domain competence in humans and the rise of generalist models over domain-specific ones (on competence, not cost) seems to pretty directly tank your hypothesis.
>But in general it's self-evident that training the model on information that is irrelevant to your use case does not necessarily improve ability, otherwise you'd have AGI just from reinforcing your model on memorizing the first 10^50 digits of pi.
It's hardly self-evident, and your counter-example is hardly applicable.
The first 10^50 of pi is not the same as having BREADTH of information in the training data, which is the whole point not just any random "information that is irrelevant to your use case".
not to mention that the first 10^50 digits of pi compress to quite small formula, so not much information there to begin with from a shannon/kolmogorov perspective
It is self-evident. Bringing up Kolmogorov complexity is irrelevant, we're talking about rote memorization, but if you can't ignore the given example then replace "digits of pi" with "bits of output from a true random number generator". There's an infinite amount of information that we could shove into a model, and a finite amount of bits with which to store any of that information such that it can be usefully recalled or form useful logical associations.
Apparently irrelevant data can help because model weights are entangled.
Yeah, the neoclouds and hyperscalers are taking massive losses right now, self hosting is basically signing yourself up to do the same. There are philosophical reasons to do so but it’s a terrible economic decision
Or you have data that HIPAA, GDPR, PII, or have to care about the concern of others training on your data.
That too.
> 50tps for a decade
assuming demand doesn't keep on increasing. even google has trouble having enough capacity apparently.