I like that the output rendering is closer to typical UIs -- syntax highlighting in code mode, tool calls, dim-italic reasoning.
One feature mine has that the author, or anyone else who vibe codes their own version after seeing this, might like to steal is modeling the distribution of output latencies. My implementation is hacky (log-normal roughyl estimated from p50, p90, and p99 values), but still, when you set those to realistic values, it recreates the "jitter" you see in many LLM UIs.
antirez is right that generation tok/s isn't flat as a function of context length, which is a weakness of both simulators.
SXX
I think your demo need more realistic thinking logs because thinking usually burns at least 2x to 3x of tokens of the code and for harder tasks much more.
show comments
ricardobeat
It's interesting how even 5 tok/s is still much faster than you'd typically type, but feels glacially slow for an agent.
On the other hand, I've been using Mimo and Minimax a lot recently. They routinely reach 100-150 tokens per second and that feels too fast, to the point where it's hard to keep up with what it's actually doing. Great for subagents though.
show comments
jerf
I'm flashing back to using a 1200 baud modem when the world was on 28.8k. Modems are much more regular-looking, though, since each character is a character. Unless you count color changes and such, which you only really notice at 1200...
aurareturn
We truly are in the dial up era of GenAI.
Aurornis
Cool visualization, but most of the token generation in my sessions doesn't go to output code or even the text I see. Reasoning tokens make up most of the output. That can only occur after processing the input files and context.
For non-trivial work I go through hundreds of thousands of tokens (combined prefill + tg of course) before even getting to some useful text output.
I mostly use LLMs for exploration and studies, rarely code generation. Prefill matters heavily for this. Even in the high hundreds or low thousands prefill rate I spend a lot of time waiting on the LLM (doing other things, not twiddling thumbs)
unglaublich
30tok/s looks fine when you're just streaming code, but the issue is that there's a lot of background noise like tool-calling conventions, metadata, "thinking", etc.
antirez
Token/sec only makes sense once you tell me three four things:
1. decoding t/s, that is, when the model is generating text in the autoregressive fashion.
2. prefill t/s, that is, prompt processing speed.
3. What is the slope of those two numbers as the context size increases. An implementation that decodes at 50t/s with 2k context but decodes at 7t/s at 100k context is going to be a lot less useful that it seems at a first glance for a big number of real world use cases.
4. What's your use case? Reading a huge text and then having a small output like, fraud probability=12%? Or Reading a small question and generating a lot of text? This changes substantially if a model is usable based on its prefill/decoding speed.
For instance my DS4F inference on the DGX Spark does prefill at 350 t/s and at 200 t/s on already large contexts. But decodes at 13 t/s.
On the Mac Ultra the prefill is like 400 t/s and decoding 35 t/s.
The two systems can perform dramatically differently or almost the same based on the use case. In general for local inference to be acceptable, even if slow, you want at least 100 t/s prefill, at least 10 t/s generation. To be ok-ish from 200 to 400 t/s prefill, 15-25 t/s generation. To be a wonderful experience thousands of t/s prefill, 100 t/s generation.
show comments
bjelkeman-again
Interesting. It seems to me that with that speed (20-30) on local hardware the real issue is quality of output, not tokens per sec.
show comments
adampzakaria
This is awesome!! I use Cursor and I've been trending towards medium thinking models as much as possible - I don't like the dev cadence with something like opus 4.7 (thinking: very high) (great for some tasks, like complex plans). Eventually I'd like to make my way to open models and open harness, and this tool or something like it could help me understand what performance I'd need for productive work - bookmarked!
dinkleberg
This reminds me of when I signed up for cerebras to try it out and dumped $20 in and hooked it into opencode and the speed was truly insane. But my one session burnt through like $15 of that in seemingly a matter of minutes. I've since used those really high tok/s options for specific application use cases, but would not advise as a coding agent. Much harder to catch issues when it is moving a million miles an hour and then it is too late and it has already spent a ton of tokens.
flockonus
Curious about the other way around, how many tokens per second a productive developer codes in a day?
show comments
emehrkay
I just looked up what my computer is capable of (m2 MacBook Air) and it says 15-35 tokens per second. I could live with that writing code with a local model.
ohadron
This is great.
Agentic coding at 600+ tokens/sec is going to be a radically different beast.
Coming soon-ish?
show comments
johng
Neat website, the visualization is great. I had a hard time wrapping my head around the tokens/s thing but this made it easy.
aurareturn
One thing I noticed is that prior to AI coding agents, I used to be able to tolerate 10 tokens/s. With AI Agents, I think 60 is minimum.
raverbashing
On avg 1 token = 4 chars
So 75 tokens/s is ~ 300 chars per second which is the speed you'd get with a 2400 baud modem
casey2
Not very far til we reach 1MTk/s per LLM. Computing is going to look very different in the future.
niek_pas
RIP my browser history, I guess
kirugan
Nice, I always thought 15 tok/sec is too "slow"
bob1029
The non-linear scaling on the slider is an excellent UX.
show comments
dfollent
Neat visual. 5 tok/s is still faster than me!
show comments
dario-dentes
Thank you for this great utility. I love the "gut feel" calibration utilities like this one!
tantalor
> Now switch between c and t at the same rate. The difference is striking — and intentional.
Very cool!
> Unless you've actually watched tokens stream at those rates, the numbers are hard to internalize. This is the rendering.
I built something similar recently, for the same reason: https://modal.com/llm-almanac/token-timing-simulator.
I like that the output rendering is closer to typical UIs -- syntax highlighting in code mode, tool calls, dim-italic reasoning.
One feature mine has that the author, or anyone else who vibe codes their own version after seeing this, might like to steal is modeling the distribution of output latencies. My implementation is hacky (log-normal roughyl estimated from p50, p90, and p99 values), but still, when you set those to realistic values, it recreates the "jitter" you see in many LLM UIs.
antirez is right that generation tok/s isn't flat as a function of context length, which is a weakness of both simulators.
I think your demo need more realistic thinking logs because thinking usually burns at least 2x to 3x of tokens of the code and for harder tasks much more.
It's interesting how even 5 tok/s is still much faster than you'd typically type, but feels glacially slow for an agent.
On the other hand, I've been using Mimo and Minimax a lot recently. They routinely reach 100-150 tokens per second and that feels too fast, to the point where it's hard to keep up with what it's actually doing. Great for subagents though.
I'm flashing back to using a 1200 baud modem when the world was on 28.8k. Modems are much more regular-looking, though, since each character is a character. Unless you count color changes and such, which you only really notice at 1200...
We truly are in the dial up era of GenAI.
Cool visualization, but most of the token generation in my sessions doesn't go to output code or even the text I see. Reasoning tokens make up most of the output. That can only occur after processing the input files and context.
For non-trivial work I go through hundreds of thousands of tokens (combined prefill + tg of course) before even getting to some useful text output.
I mostly use LLMs for exploration and studies, rarely code generation. Prefill matters heavily for this. Even in the high hundreds or low thousands prefill rate I spend a lot of time waiting on the LLM (doing other things, not twiddling thumbs)
30tok/s looks fine when you're just streaming code, but the issue is that there's a lot of background noise like tool-calling conventions, metadata, "thinking", etc.
Token/sec only makes sense once you tell me three four things:
1. decoding t/s, that is, when the model is generating text in the autoregressive fashion.
2. prefill t/s, that is, prompt processing speed.
3. What is the slope of those two numbers as the context size increases. An implementation that decodes at 50t/s with 2k context but decodes at 7t/s at 100k context is going to be a lot less useful that it seems at a first glance for a big number of real world use cases.
4. What's your use case? Reading a huge text and then having a small output like, fraud probability=12%? Or Reading a small question and generating a lot of text? This changes substantially if a model is usable based on its prefill/decoding speed.
For instance my DS4F inference on the DGX Spark does prefill at 350 t/s and at 200 t/s on already large contexts. But decodes at 13 t/s.
On the Mac Ultra the prefill is like 400 t/s and decoding 35 t/s.
The two systems can perform dramatically differently or almost the same based on the use case. In general for local inference to be acceptable, even if slow, you want at least 100 t/s prefill, at least 10 t/s generation. To be ok-ish from 200 to 400 t/s prefill, 15-25 t/s generation. To be a wonderful experience thousands of t/s prefill, 100 t/s generation.
Interesting. It seems to me that with that speed (20-30) on local hardware the real issue is quality of output, not tokens per sec.
This is awesome!! I use Cursor and I've been trending towards medium thinking models as much as possible - I don't like the dev cadence with something like opus 4.7 (thinking: very high) (great for some tasks, like complex plans). Eventually I'd like to make my way to open models and open harness, and this tool or something like it could help me understand what performance I'd need for productive work - bookmarked!
This reminds me of when I signed up for cerebras to try it out and dumped $20 in and hooked it into opencode and the speed was truly insane. But my one session burnt through like $15 of that in seemingly a matter of minutes. I've since used those really high tok/s options for specific application use cases, but would not advise as a coding agent. Much harder to catch issues when it is moving a million miles an hour and then it is too late and it has already spent a ton of tokens.
Curious about the other way around, how many tokens per second a productive developer codes in a day?
I just looked up what my computer is capable of (m2 MacBook Air) and it says 15-35 tokens per second. I could live with that writing code with a local model.
This is great. Agentic coding at 600+ tokens/sec is going to be a radically different beast. Coming soon-ish?
Neat website, the visualization is great. I had a hard time wrapping my head around the tokens/s thing but this made it easy.
One thing I noticed is that prior to AI coding agents, I used to be able to tolerate 10 tokens/s. With AI Agents, I think 60 is minimum.
On avg 1 token = 4 chars
So 75 tokens/s is ~ 300 chars per second which is the speed you'd get with a 2400 baud modem
Not very far til we reach 1MTk/s per LLM. Computing is going to look very different in the future.
RIP my browser history, I guess
Nice, I always thought 15 tok/sec is too "slow"
The non-linear scaling on the slider is an excellent UX.
Neat visual. 5 tok/s is still faster than me!
Thank you for this great utility. I love the "gut feel" calibration utilities like this one!
> Now switch between c and t at the same rate. The difference is striking — and intentional.
I don't see a big difference.
This is cool, thanks for making it.
super cool, thanks
This is great.