Can you folks add performance per watt as a metric to these comparisons, I honestly want to understand where AMD fits in the stack in terms of actual performance to dollars. I have had talks with companies wanting to build data centers outside of US and find it hard to source anything Nvidia in sufficient capacity and scale.
If AMD is competitive performance per watt and roughly reliable in terms of software support which is what most folks outside of US prioritize above all else, since outside of China and US electricity tends to at a relative premium.
Maybe if they make smaller data centers viable at the right price, AMD could be part of the stack outside of US where ever Nvidia is more limited in supply. Though I have genuinely no idea what sourcing an AMD GPU looks like.
I have never seen a company use AMD outside of wafer and a couple others mostly in US.
Genuinely intriguing or maybe not really (could be this stuff is common knowledge) and I am just stuck in my Nvidia bubble here.
show comments
hassaanr
While cool, quantization to FP4 is practically never lossless in actual use. A lot of providers are advertising high TPS on Kimi and GLM, but the models are functionally lobotomized and no longer close to frontier quality. Would love to see this not be true.
show comments
nxtfari
I think we should make it illegal to not specify the quantization in the headline for these types of posts.
show comments
mchusma
I was hoping they would be discussing some path to improving things faster and cheaper. But in this post it looks like they offer quantized version for the same price as full version, and a fast version at much higher cost.
sometimelurker
I like the metric of tok/joule a lot. it really brings to mind a lot of really nice ideas about energy and work and ideas and thought and efficiency
gcanyon
Isn't this pretty much a given? Performance per dollar has to be a ratcheting function because how would something more expensive replace something less expensive?
p1esk
There’s noticeable accuracy degradation when they switched from fp8 to mxfp4
show comments
ilaksh
The compute-in-memory and neuromorphic paradigms are likely to push this much, much farther over the next decade as more radical improvements make it out of the lab. Sooner or later it will involve new materials and new nano devices and providing multiple orders of magnitude better efficiency. And just scaling up existing things like MRAM.
tim333
Not a new phenomena - performance per dollar has been fairly steadily exponentialling since 1900 or so
I'm not surprised to see competition with Blackwell. Rubin is 5x faster than Blackwell at inference - Blackwell is the last generation Nvidia didn't optimize specifically for inference.
If I'm missing something, please let me know!
show comments
AussieWog93
The 2600 tok/s is an "aggregate", not the actual throughput.
show comments
conorcleary
*especially as many currencies weaken
johanvts
That sounds literally impossible.
show comments
oDot
Do these providers have 80+% gross margins or is something eating into them? Maybe utilization?
show comments
adammarples
Slight criticism of the headline there, you can't get cheaper per dollar.
hahahaa
What is a knee, in performance talk?
show comments
alienbaby
I'm interested if anyone knows how much legwork the assumed 60% cache hit, plus running a quantised model is doing? Esp. compared to what the headline half implies is a full fat GLM5.2
ilaksh
Can you actually rent an MI355X per hour anywhere right now?
killingtime74
No word on what this actually means as a consumer. What's the price. Is it lower than NVIDIA serving?
show comments
BurningFrog
So... the headline is about performance per dollar per dollar?
beffjezos
This is very interesting and yet not at the same time. This looks to be optimized for single-stream LLM traffic which is not viable to serve in a production setting. It's only interesting to hobbyists that want to run the model locally.
It's genuinely neat that AI can find the right optimization pathways in an AMD inference server to unlock this but at the same token (pun-intended) this is a classic case of benchmark hacking that doesn't stand up to real-world application.
show comments
gowthamsaiyadav
world is not limited by Nvidia, AMD can be used
calin2k
then why is token per dollar getting more expensive?
show comments
yieldcrv
Agentic coding drivers for different architectures is a massive unlock for the world
So much compute is under utilized waiting for a savant or company to prioritize an architecture, and now all the other engineers can tackle this at any time if they get inspired on the right prompts
show comments
zuzululu
yeah but we are still far far away from being able to run the frontier model equivalents locally without significant quantization
even having something like opus 4.8 locally would completely change the landscape
villgax
They fail to mention non speculative numbers & whether baseline was nvfp4 as well. So much for erosion against an older gen
bitwize
(in a high-pitched, pathetic regency-era British orphan voice) Please sir, may I have some compute as well?
shevy-java
But RAM prices skyrocketed!
The AI companies owe use money. As does e. g. NVIDIA for becoming a cartel.
Can you folks add performance per watt as a metric to these comparisons, I honestly want to understand where AMD fits in the stack in terms of actual performance to dollars. I have had talks with companies wanting to build data centers outside of US and find it hard to source anything Nvidia in sufficient capacity and scale.
If AMD is competitive performance per watt and roughly reliable in terms of software support which is what most folks outside of US prioritize above all else, since outside of China and US electricity tends to at a relative premium.
Maybe if they make smaller data centers viable at the right price, AMD could be part of the stack outside of US where ever Nvidia is more limited in supply. Though I have genuinely no idea what sourcing an AMD GPU looks like.
I have never seen a company use AMD outside of wafer and a couple others mostly in US.
Genuinely intriguing or maybe not really (could be this stuff is common knowledge) and I am just stuck in my Nvidia bubble here.
While cool, quantization to FP4 is practically never lossless in actual use. A lot of providers are advertising high TPS on Kimi and GLM, but the models are functionally lobotomized and no longer close to frontier quality. Would love to see this not be true.
I think we should make it illegal to not specify the quantization in the headline for these types of posts.
I was hoping they would be discussing some path to improving things faster and cheaper. But in this post it looks like they offer quantized version for the same price as full version, and a fast version at much higher cost.
I like the metric of tok/joule a lot. it really brings to mind a lot of really nice ideas about energy and work and ideas and thought and efficiency
Isn't this pretty much a given? Performance per dollar has to be a ratcheting function because how would something more expensive replace something less expensive?
There’s noticeable accuracy degradation when they switched from fp8 to mxfp4
The compute-in-memory and neuromorphic paradigms are likely to push this much, much farther over the next decade as more radical improvements make it out of the lab. Sooner or later it will involve new materials and new nano devices and providing multiple orders of magnitude better efficiency. And just scaling up existing things like MRAM.
Not a new phenomena - performance per dollar has been fairly steadily exponentialling since 1900 or so
1900 - 2010 https://www.thekurzweillibrary.com/exponential-growth-of-com...
1939 - 2023 https://medium.com/@timventura/kurzweils-law-for-the-ai-age-...
I'm not surprised to see competition with Blackwell. Rubin is 5x faster than Blackwell at inference - Blackwell is the last generation Nvidia didn't optimize specifically for inference.
If I'm missing something, please let me know!
The 2600 tok/s is an "aggregate", not the actual throughput.
*especially as many currencies weaken
That sounds literally impossible.
Do these providers have 80+% gross margins or is something eating into them? Maybe utilization?
Slight criticism of the headline there, you can't get cheaper per dollar.
What is a knee, in performance talk?
I'm interested if anyone knows how much legwork the assumed 60% cache hit, plus running a quantised model is doing? Esp. compared to what the headline half implies is a full fat GLM5.2
Can you actually rent an MI355X per hour anywhere right now?
No word on what this actually means as a consumer. What's the price. Is it lower than NVIDIA serving?
So... the headline is about performance per dollar per dollar?
This is very interesting and yet not at the same time. This looks to be optimized for single-stream LLM traffic which is not viable to serve in a production setting. It's only interesting to hobbyists that want to run the model locally.
It's genuinely neat that AI can find the right optimization pathways in an AMD inference server to unlock this but at the same token (pun-intended) this is a classic case of benchmark hacking that doesn't stand up to real-world application.
world is not limited by Nvidia, AMD can be used
then why is token per dollar getting more expensive?
Agentic coding drivers for different architectures is a massive unlock for the world
So much compute is under utilized waiting for a savant or company to prioritize an architecture, and now all the other engineers can tackle this at any time if they get inspired on the right prompts
yeah but we are still far far away from being able to run the frontier model equivalents locally without significant quantization
even having something like opus 4.8 locally would completely change the landscape
They fail to mention non speculative numbers & whether baseline was nvfp4 as well. So much for erosion against an older gen
(in a high-pitched, pathetic regency-era British orphan voice) Please sir, may I have some compute as well?
But RAM prices skyrocketed!
The AI companies owe use money. As does e. g. NVIDIA for becoming a cartel.