Not that this really takes away from the substance of the article, but the first two paragraphs are giving heavy Claude smell. Semicolons, em dashes, "That sequencing matters"... I guess I'm just a little surprised that anyone could be arsed to take on a hardware project like this but can't be arsed to write their own introduction.
show comments
arjie
Pass the generated text through some kind of quality prompt. It’s got too much filler right now.
The story is interesting but it’s hard to read because it’s hard to tell which parts are meaningful and which parts are filler.
E.g. “we pulled the card cold - straight from the rig to the workbench”. Okay, but why would going straight from the rig to the workbench make it cold? If anything it would be warm. But it turns out the temperature is meaningful in your story.
show comments
cogman10
Ok, how are people powering these things? 2.4kW is well beyond a standard circuit in the US. Are people having 240V/30A circuits installed? Are they hijacking the dryer plugs? EV charger plugs? Hottub circuits?
show comments
tomaytotomato
What a time to be alive, I remember 10 years ago as a poor student waiting to buy a ATI Radeon X1600 Pro with 256mb, yes 256mb of RAM.
It cost about £190 in 2006.
Now we have GPUs that are in tens of thousands of pounds with insane performance, but what would their price be without the AI and Datacentre squeeze?
show comments
amluto
I wonder whether those cards ran the model that wrote the nonsense about the forces involved.
Hint: when you have a piece of metal stuck with thermal goop to a lot of components, the force doesn’t “concentrate” on one of them. You need to detach it from each one with however much force is needed to detach it from that component.
show comments
dwroberts
> Don’t RMA it, and don’t solder it yourself. A local phone-repair chain with a microsolder tech can put a 3 mm SMD part back on a GPU PCB in twenty minutes for the price of dinner. The skill is in your city. You just have to look.
The trouble with this though is, what if that is not the only issue with the card? That’s normally my thought process on reaching for RMA. The unit could be an all-round lemon that should not have passed QA etc. (and as noted in the post itself, working for a week on various tasks is not enough to prove it good)
NwtnsMthd
It's difficult to speculate as to the exact failure from blurry pictures but the solder on that choke (inductor) looks terrible.
Something went wrong in manufacturing. The solder should have wicked to cover the entire pad, not just a small square, and there should be no (brown) discoloration.
stryakr
Why does this post sound like it's an AI story based on the inputs from the engineer?
The phrasing is very claude like:
"That cracked joint is the whole story. The card had passed initial bring-up and ran fine at light loads for a week."
"That sequencing matters — it’s why we have a story to tell. The pilot card failed, taught us a lesson, and the lesson is the reason the other three went on without incident."
"Driver swaps, CUDA reinstalls, and inference-engine theories were dead ends I spent hours on. The failure pattern itself told the story — listen to it earlier."
show comments
josephg
Cool post. FYI you might be better off getting one big fan for your "radiator" instead of lots of little fans. Big fans don't need to spin as fast as small fans to push the same amount of air. So they run a lot quieter.
Those are SM120 so no tmem/tcgen05 and lack of support in main libraries (it's like everybody is focusing on B300/SM100).
For that money I'd buy a single B300, similar total AI TOPS, similar GPU bandwidth aggregated, and only 25% less total memory (probably saved in less implementation complexity), half the energy consumption...
Also by having all SMs local they have the special L1-level interconnect. SMs can collaborate on the same GEMM. And a bunch of other nice features.
Or, you know, rent it.
show comments
voidUpdate
Is that little computer training LLMs from scratch all by itself? That must take years to get any kind of progress, given the scale of training other providers do. Where do you get the training data from?
show comments
robin_reala
Complete side note, but I can’t work out how the author managed to mistype “at” as “Δt”.
Edit: reading fail on my part, nothing to see here.
show comments
iagooar
Out of curiosity, what are you training with these cards?
show comments
sabareesh
Converting four RTX PRO 6000 Blackwell cards to waterblocks, finding a VRM choke loose on the workbench, and getting back to 41k tok/s.
sandworm101
Ditch the tiny DC fans. Build a shroud and switch to a single ac-powered industrial blower / duct fan.
show comments
atemerev
If you want ready, well engineered, water-cooled multi-GPU research workstations, my colleagues at https://comino.com build and sell them. Or you can purchase fitted waterblocks from them for many GPUs, and build your own.
lightedman
You can tell this is AI slop by the horrible soldering descriptions - anyone with experience can look at that VRM and go "Oh, the solder is still in its stencil-applied state and has not flowed across the contacts at all, on either component or board. This is a reflow-in-oven issue from the manufacturer." This wasn't a cracking joint this was a poorly-done joint.
Not that this really takes away from the substance of the article, but the first two paragraphs are giving heavy Claude smell. Semicolons, em dashes, "That sequencing matters"... I guess I'm just a little surprised that anyone could be arsed to take on a hardware project like this but can't be arsed to write their own introduction.
Pass the generated text through some kind of quality prompt. It’s got too much filler right now.
The story is interesting but it’s hard to read because it’s hard to tell which parts are meaningful and which parts are filler.
E.g. “we pulled the card cold - straight from the rig to the workbench”. Okay, but why would going straight from the rig to the workbench make it cold? If anything it would be warm. But it turns out the temperature is meaningful in your story.
Ok, how are people powering these things? 2.4kW is well beyond a standard circuit in the US. Are people having 240V/30A circuits installed? Are they hijacking the dryer plugs? EV charger plugs? Hottub circuits?
What a time to be alive, I remember 10 years ago as a poor student waiting to buy a ATI Radeon X1600 Pro with 256mb, yes 256mb of RAM.
It cost about £190 in 2006.
Now we have GPUs that are in tens of thousands of pounds with insane performance, but what would their price be without the AI and Datacentre squeeze?
I wonder whether those cards ran the model that wrote the nonsense about the forces involved.
Hint: when you have a piece of metal stuck with thermal goop to a lot of components, the force doesn’t “concentrate” on one of them. You need to detach it from each one with however much force is needed to detach it from that component.
> Don’t RMA it, and don’t solder it yourself. A local phone-repair chain with a microsolder tech can put a 3 mm SMD part back on a GPU PCB in twenty minutes for the price of dinner. The skill is in your city. You just have to look.
The trouble with this though is, what if that is not the only issue with the card? That’s normally my thought process on reaching for RMA. The unit could be an all-round lemon that should not have passed QA etc. (and as noted in the post itself, working for a week on various tasks is not enough to prove it good)
It's difficult to speculate as to the exact failure from blurry pictures but the solder on that choke (inductor) looks terrible.
Something went wrong in manufacturing. The solder should have wicked to cover the entire pad, not just a small square, and there should be no (brown) discoloration.
Why does this post sound like it's an AI story based on the inputs from the engineer?
The phrasing is very claude like:
"That cracked joint is the whole story. The card had passed initial bring-up and ran fine at light loads for a week."
"That sequencing matters — it’s why we have a story to tell. The pilot card failed, taught us a lesson, and the lesson is the reason the other three went on without incident."
"Driver swaps, CUDA reinstalls, and inference-engine theories were dead ends I spent hours on. The failure pattern itself told the story — listen to it earlier."
Cool post. FYI you might be better off getting one big fan for your "radiator" instead of lots of little fans. Big fans don't need to spin as fast as small fans to push the same amount of air. So they run a lot quieter.
> 4× RTX PRO 6000 Blackwell Workstation (GB202, 96 GB GDDR7, 600 W)
Those are SM120 so no tmem/tcgen05 and lack of support in main libraries (it's like everybody is focusing on B300/SM100).
For that money I'd buy a single B300, similar total AI TOPS, similar GPU bandwidth aggregated, and only 25% less total memory (probably saved in less implementation complexity), half the energy consumption...
Also by having all SMs local they have the special L1-level interconnect. SMs can collaborate on the same GEMM. And a bunch of other nice features.
Or, you know, rent it.
Is that little computer training LLMs from scratch all by itself? That must take years to get any kind of progress, given the scale of training other providers do. Where do you get the training data from?
Complete side note, but I can’t work out how the author managed to mistype “at” as “Δt”.
Edit: reading fail on my part, nothing to see here.
Out of curiosity, what are you training with these cards?
Converting four RTX PRO 6000 Blackwell cards to waterblocks, finding a VRM choke loose on the workbench, and getting back to 41k tok/s.
Ditch the tiny DC fans. Build a shroud and switch to a single ac-powered industrial blower / duct fan.
If you want ready, well engineered, water-cooled multi-GPU research workstations, my colleagues at https://comino.com build and sell them. Or you can purchase fitted waterblocks from them for many GPUs, and build your own.
You can tell this is AI slop by the horrible soldering descriptions - anyone with experience can look at that VRM and go "Oh, the solder is still in its stencil-applied state and has not flowed across the contacts at all, on either component or board. This is a reflow-in-oven issue from the manufacturer." This wasn't a cracking joint this was a poorly-done joint.
Signed, IPC-610 certified tech.
AI slop post.