This throughput assumes 100% utilizations. A bunch of things raise the cost at scale:
- There are no on-demand GPUs at this scale. You have to rent them for multi-year contracts. So you have to lock in some number of GPUs for your maximum throughput (or some sufficiently high percentile), not your average throughput. Your peak throughput at west coast business hours is probably 2-3x higher than the throughput at tail hours (east coast morning, west coast evenings)
- GPUs are often regionally locked due to data processing issues + latency issues. Thus, it's difficult to utilize these GPUs overnight because Asia doesn't want their data sent to the US and the US doesn't want their data sent to Asia.
These two factors mean that GPU utilization comes in at 10-20%. Now, if you're a massive company that spends a lot of money on training new models, you could conceivably slot in RL inference or model training to happen in these off-peak hours, maximizing utilization.
But for those companies purely specializing in inference, I would _not_ assume that these 90% margins are real. I would guess that even when it seems "10x cheaper", you're only seeing margins of 50%.
show comments
caminanteblanco
There was some tangentially related discussion in this post: https://news.ycombinator.com/item?id=45050415, but this cost analysis answers so many questions, and gives me a better idea of how huge the margin on inference a lot of these providers could be taking. Plus I'm sure that Google or OpenAI can get more favorable data center rates than the average Joe Scmoe.
A node of 8 H100s will run you $31.40/hr on AWS, so for all 96 you're looking at $376.80/hr. With 188 million input tokens/hr and 80 million output tokens/hr, that comes out to around $2/million input tokens, and $4.70/million output tokens.
This is actually a lot more than Deepseek r1's rates of $0.10-$0.60/million input and $2/million output, but I'm sure major providers are not paying AWS p5 on-demand pricing.
Edit: those figures were per node, so the actual input and output prices would be divided by 12.$0.17/million input tokens, and $0.39/million output
Super helpful to see actual examples of what it (roughly) can look like to deploy production inference workloads, and also the latest optimization efforts.
I consult in this space and clients still don't fully understand how complex it can get to just "run your own LLM".
34679
"By deploying this implementation locally, it translates to a cost of $0.20/1M output tokens"
Is that just the cost of electricity, or does it include the cost of the GPUs spread out over their predicted lifetime?
show comments
s46dxc5r7tv8
Separation of the prefill and decoding layers with sglang is quite nifty! Normally 8xH100 would barely be able to hold the 4bit quantization of the model without even considering the KV cache. One prefill node for 3 decode nodes is also fascinating, nice writeup.
Just in case you have $3-4M lying around somewhere for some high quality inference. :)
SGLang quotes a 2.5-3.4x speedup as compared to the H100s. They also note that more optimizations are coming, but they haven't yet published a part 2 on the blog post.
show comments
guerrilla
Now if only it would stop prefacing all its output with "Of course!" ;)
This is why I use DS though. I think its the only ethical option due to its efficiency. I think that outweighs all other considerations at this point.
show comments
abdellah123
Wow, please edit the title to include Open-source !
For those commenting on cost per token:
This throughput assumes 100% utilizations. A bunch of things raise the cost at scale:
- There are no on-demand GPUs at this scale. You have to rent them for multi-year contracts. So you have to lock in some number of GPUs for your maximum throughput (or some sufficiently high percentile), not your average throughput. Your peak throughput at west coast business hours is probably 2-3x higher than the throughput at tail hours (east coast morning, west coast evenings)
- GPUs are often regionally locked due to data processing issues + latency issues. Thus, it's difficult to utilize these GPUs overnight because Asia doesn't want their data sent to the US and the US doesn't want their data sent to Asia.
These two factors mean that GPU utilization comes in at 10-20%. Now, if you're a massive company that spends a lot of money on training new models, you could conceivably slot in RL inference or model training to happen in these off-peak hours, maximizing utilization.
But for those companies purely specializing in inference, I would _not_ assume that these 90% margins are real. I would guess that even when it seems "10x cheaper", you're only seeing margins of 50%.
There was some tangentially related discussion in this post: https://news.ycombinator.com/item?id=45050415, but this cost analysis answers so many questions, and gives me a better idea of how huge the margin on inference a lot of these providers could be taking. Plus I'm sure that Google or OpenAI can get more favorable data center rates than the average Joe Scmoe.
A node of 8 H100s will run you $31.40/hr on AWS, so for all 96 you're looking at $376.80/hr. With 188 million input tokens/hr and 80 million output tokens/hr, that comes out to around $2/million input tokens, and $4.70/million output tokens.
This is actually a lot more than Deepseek r1's rates of $0.10-$0.60/million input and $2/million output, but I'm sure major providers are not paying AWS p5 on-demand pricing.
Edit: those figures were per node, so the actual input and output prices would be divided by 12.$0.17/million input tokens, and $0.39/million output
Interestingly, this is 10x cheaper than the cheapest provider on OpenRouter : https://openrouter.ai/deepseek/deepseek-r1?sort=price
Inference is more profitable than I thought.
Super helpful to see actual examples of what it (roughly) can look like to deploy production inference workloads, and also the latest optimization efforts.
I consult in this space and clients still don't fully understand how complex it can get to just "run your own LLM".
"By deploying this implementation locally, it translates to a cost of $0.20/1M output tokens"
Is that just the cost of electricity, or does it include the cost of the GPUs spread out over their predicted lifetime?
Separation of the prefill and decoding layers with sglang is quite nifty! Normally 8xH100 would barely be able to hold the 4bit quantization of the model without even considering the KV cache. One prefill node for 3 decode nodes is also fascinating, nice writeup.
The SGLang Team has a follow-up blog post that talks about DeepSeek inference performance on GB200 NVL72: https://lmsys.org/blog/2025-06-16-gb200-part-1/
Just in case you have $3-4M lying around somewhere for some high quality inference. :)
SGLang quotes a 2.5-3.4x speedup as compared to the H100s. They also note that more optimizations are coming, but they haven't yet published a part 2 on the blog post.
Now if only it would stop prefacing all its output with "Of course!" ;)
This is why I use DS though. I think its the only ethical option due to its efficiency. I think that outweighs all other considerations at this point.
Wow, please edit the title to include Open-source !