> And yes, if you want the absolute best, Opus 4.8 exists. It also costs more per 20 minutes of heavy use than I paid for this entire GPU and adapter setup combined. But the gap is shockingly small.
I don't think this is a fair characterization of the situation. I use frontier models via API pre-paid tokens every single day, and I can barely rack up $100 per month. The fact that we figured out how to burn double this in 20 minutes is impressive, but I don't think it reflects the reality that many are experiencing right now. There are some exceptionally gluttonous approaches to harnessing LLMs that I think are serving as convenient straw men in these discussions.
Paying for the API will almost always be more economical than self-hosting equivalent infrastructure. I am not against self-hosting, but the article suggests a primarily economic motivation for this effort. If you are consuming fewer than 10^9 tokens per month, I really don't think it's worth your time to try and compete with the hyperscalars. Most of the money is to be found in the integration of this technology with existing businesses.
show comments
Teknomadix
Tesla V100 SXM2 16GB is NOT DGX class as the author writes. It's HGX class. The V100 comes in two classes, SXM2 and SXM4, the latter coming with a Max of 80gb on board memory. Typically these are installed 8×A100 80GB SXM4 on an HGX riser, and what that gives you is NVSwitch fabric and 640GB of pooled HBM2e (on package stacked memory /w ~2 TB/s of memory bandwidth). 2u standard rack footprint too.
mickeyp
Impressive work. But the problem is not the 30 tok/s which is fine for agentic coding and chat.
If you have 100,000 tokens at ~150tok/s per the OP, you're looking at:
You have: 100000 / (150/s)
You want: hms
11 min + 6.6666667 sec
Which is quite a wait indeed.
show comments
mondainx
Great write-up, I've often considered these DC cards for a project and now you've convinced me to pick one up; you describe the price of the unit against what one spends on tokens and that does it for me.
abejfehr
Based on the title I was really hoping to see how this was used for gaming, but they just ran an LLM on it
show comments
omarqureshi
Could probably avoid the crazy fan with a waterblock - I've seen a whole kit, v100 + PCIE adapter + block for £235. Yes, you'll have to pay for pump, radiators and radiator fans, but that should really quieten it down
show comments
matja
The AMD MI250X GPUs are also interesting - 128GB of HBM2E at 3TB/s, sometimes you see them second-hand for under $1k, the catch obviously is that it needs an OAM socket. Never seen an easy way to hook them up to a regular mainboard.
show comments
lucamark
Congrats! Most people won’t want to debug drivers, kernels, ACPI, adapters, and fan headers. But for those who do, the capability-per-pound is absurd.
KnuthIsGod
AI written posts will kill HN.
ewy1
despite gaming being used in the title, it is not mentioned in the article, but i'm curious how this performs.
i've ran some multi vendor frankenstein setups before and sometimes it even works, so i'm curious to hear your experience with it.
whoamii
The real question: did your local LLM write this post?
show comments
viseyth
Volta (and Pascal, which I'm using) should still be supported with driver 580 as long as you don't use the open modules, and you can use up to cuda 12.9 and cudnn 9.10.2. No need to limit yourself to an old kernel.
pogue
But could you game with the GPU? Or is that purely a drivers issue?
jmyeet
Some context:
- In 2017, the v100 was a ~$10,000 GPU. I believe there was a PCI-e version but this is probably so cheap because SXM2 is going to be harder to use;
- A 5090 has 1800GB/s of internal memory bandwidth (compared to 900GB/s in the 9 year old GPU). Of course a 5090 is substantially more expensive;
- A 5090 has ~21k CUDA cores vs ~5k;
- The current $10k NVidia GPU is the RTX 6000 Pro w/ 96GB of VRAM. It has slightly more CUDA cores but it otherwise pretty much just a 5090. This is unsurprising. NVidia uses VRAM for market segmentation.
Consider this: in 5-10 years, the trillions spent on AI data centers will likewise be sold for scrap most likely. That's how short the runway is for OpenAI and Anthropic to recover that investment.
Anyway, I'm kind of impressed the author managed to get this all to work. I don't think it even would've occurred to me that someone had made an SXM2 adapter, particularly because it's not even used anymore. Like props to whoever did that.
show comments
wg0
Wait a few years, everyone will be able to put one at half the price.
gtirloni
> The compute is still real. The VRAM is still real. And the memory bandwidth is where it gets genuinely surprising.
sigh
axpy906
Wow. V100. That brings back memories. Way to go.
casey2
Some resell group is going to have to make this easier. The shear amount of these cards otherwise heading towards the landfill is staggering. That is if Big Tech don't destroy them to prevent model weights from leaking.
show comments
recursivegirth
> The compute is still real. The VRAM is still real. And the memory bandwidth is where it gets genuinely surprising.
Had to stop there. Annoying. I can't stand AI use for writing. It makes any otherwise great article feel so disingenuous.
show comments
lelanthran
> The compute is still real. The VRAM is still real. And the memory bandwidth is where it gets genuinely surprising.
> And yes, if you want the absolute best, Opus 4.8 exists. It also costs more per 20 minutes of heavy use than I paid for this entire GPU and adapter setup combined. But the gap is shockingly small.
I don't think this is a fair characterization of the situation. I use frontier models via API pre-paid tokens every single day, and I can barely rack up $100 per month. The fact that we figured out how to burn double this in 20 minutes is impressive, but I don't think it reflects the reality that many are experiencing right now. There are some exceptionally gluttonous approaches to harnessing LLMs that I think are serving as convenient straw men in these discussions.
Paying for the API will almost always be more economical than self-hosting equivalent infrastructure. I am not against self-hosting, but the article suggests a primarily economic motivation for this effort. If you are consuming fewer than 10^9 tokens per month, I really don't think it's worth your time to try and compete with the hyperscalars. Most of the money is to be found in the integration of this technology with existing businesses.
Tesla V100 SXM2 16GB is NOT DGX class as the author writes. It's HGX class. The V100 comes in two classes, SXM2 and SXM4, the latter coming with a Max of 80gb on board memory. Typically these are installed 8×A100 80GB SXM4 on an HGX riser, and what that gives you is NVSwitch fabric and 640GB of pooled HBM2e (on package stacked memory /w ~2 TB/s of memory bandwidth). 2u standard rack footprint too.
Impressive work. But the problem is not the 30 tok/s which is fine for agentic coding and chat.
It's prefill; slow prefill kills agentic workloads dead.
If you have 100,000 tokens at ~150tok/s per the OP, you're looking at:
Which is quite a wait indeed.Great write-up, I've often considered these DC cards for a project and now you've convinced me to pick one up; you describe the price of the unit against what one spends on tokens and that does it for me.
Based on the title I was really hoping to see how this was used for gaming, but they just ran an LLM on it
Could probably avoid the crazy fan with a waterblock - I've seen a whole kit, v100 + PCIE adapter + block for £235. Yes, you'll have to pay for pump, radiators and radiator fans, but that should really quieten it down
The AMD MI250X GPUs are also interesting - 128GB of HBM2E at 3TB/s, sometimes you see them second-hand for under $1k, the catch obviously is that it needs an OAM socket. Never seen an easy way to hook them up to a regular mainboard.
Congrats! Most people won’t want to debug drivers, kernels, ACPI, adapters, and fan headers. But for those who do, the capability-per-pound is absurd.
AI written posts will kill HN.
despite gaming being used in the title, it is not mentioned in the article, but i'm curious how this performs.
i've ran some multi vendor frankenstein setups before and sometimes it even works, so i'm curious to hear your experience with it.
The real question: did your local LLM write this post?
Volta (and Pascal, which I'm using) should still be supported with driver 580 as long as you don't use the open modules, and you can use up to cuda 12.9 and cudnn 9.10.2. No need to limit yourself to an old kernel.
But could you game with the GPU? Or is that purely a drivers issue?
Some context:
- In 2017, the v100 was a ~$10,000 GPU. I believe there was a PCI-e version but this is probably so cheap because SXM2 is going to be harder to use;
- A 5090 has 1800GB/s of internal memory bandwidth (compared to 900GB/s in the 9 year old GPU). Of course a 5090 is substantially more expensive;
- A 5090 has ~21k CUDA cores vs ~5k;
- The current $10k NVidia GPU is the RTX 6000 Pro w/ 96GB of VRAM. It has slightly more CUDA cores but it otherwise pretty much just a 5090. This is unsurprising. NVidia uses VRAM for market segmentation.
Consider this: in 5-10 years, the trillions spent on AI data centers will likewise be sold for scrap most likely. That's how short the runway is for OpenAI and Anthropic to recover that investment.
Anyway, I'm kind of impressed the author managed to get this all to work. I don't think it even would've occurred to me that someone had made an SXM2 adapter, particularly because it's not even used anymore. Like props to whoever did that.
Wait a few years, everyone will be able to put one at half the price.
> The compute is still real. The VRAM is still real. And the memory bandwidth is where it gets genuinely surprising.
sigh
Wow. V100. That brings back memories. Way to go.
Some resell group is going to have to make this easier. The shear amount of these cards otherwise heading towards the landfill is staggering. That is if Big Tech don't destroy them to prevent model weights from leaking.
> The compute is still real. The VRAM is still real. And the memory bandwidth is where it gets genuinely surprising.
Had to stop there. Annoying. I can't stand AI use for writing. It makes any otherwise great article feel so disingenuous.
> The compute is still real. The VRAM is still real. And the memory bandwidth is where it gets genuinely surprising.
Because humans write exactly like this /s
A little bit of local copium but neat read.
Isn't a rasbpi with 16gb of RAM $300 now?