Tried this out with Cline using my own API key (Cerebras is also available as a provider for Qwen3 Coder via via openrouter here: https://openrouter.ai/qwen/qwen3-coder) and realized that without caching, this becomes very expensive very quickly. Specifically, after each new tool call, you're sending the entire previous message history as input tokens - which are priced at $2/1M via the API just like output tokens.
The quality is also not quite what Claude Code gave me, but the speed is definitely way faster. If Cerebras supported caching & reduced token pricing for using the cache I think I would run this more, but right now it's too expensive per agent run.
show comments
thanhhaimai
> running at speeds of up to 2,000 tokens per second, with a 131k-token context window, no proprietary IDE lock-in, and no weekly limits!
I was excited, then I read this:
> Send up to 1,000 messages per day—enough for 3–4 hours of uninterrupted vibe coding.
I don't mind paying for services I use. But it's hard to take this seriously when the first paragraph claim is contradicting the fine prints.
show comments
crawshaw
If you would like to try this in a coding agent (we find the qwen3-coder model works really well in agents!), we have been experimenting with Cerebras Code in Sketch. We just pushed support, so you can run it with the latest version, 0.0.33:
Our experience is it seems overloaded right now, to the point where we have better results with our usual hosted version:
sketch --model=qwen
unraveller
Some users who signed up for pro ($50 p.m.) are reporting further limitations than those advertised.
>While they advertise a 1,000-request limit, the actual daily constraint is a 7.5 million-token limit. [1]
Assumes an average of 7.5k/request whereas in their marketing videos they show API requests ballooning by ~24k per request. Still lower than the API price.
2k tokens/second is insane. While I'm very much against vibe coding, such performance essentially means you can get near-github copilot level speed with drastically better quality.
For in-editor use that's game changing.
show comments
rbitar
This token throughput is incredible and going to set a new bar in the industry. The main issue with the cerebras code plan is that number of requests/minute is throttled, and with agentic coding systems each tool call is treated as new "message" so you can easily hit the api limits (10 messages/minute).
One workaround we're doing now that seems to work is use claude for all tasks but delegate specific tools with cerebras/qwen-3-coder-480b model to generate files or other token heavy tasks to avoid spiking the total number of requests. This has cost and latency consequences (and adds complexity to the code), but until those throttle limits are lifted seems to be a good combo. I also find that claude has better quality with tool selection when the number of tools required is > 15 which our current setup has.
namanyayg
I was waiting for more subscription base services to pop up to compete with the influence provider on a commodities level.
I think a lot more companies will follow suit and the competition will make pricing much better for the end user.
congrats on the launch Cerebras team!
scosman
Anyone get this working in Cursor? I can connect openrouter just fine, but Cerebras just errors out instantly. Same url/key works via curl, so some sort of Cerebras/Cursor compatibility issue.
show comments
ktsakas
Does it work with claude-code-router? I was getting API errors this week trying to use qwen3 Cerebras through OpenRouter with Claude code router.
show comments
sneilan1
I'm so excited to see a real competitor to Claude Code! Gemini CLI, while decent, does not have a $200/month pricing model and they charge per API access - Codex is the same. I'm trying to get into the https://cloud.cerebras.ai/ to try the $50/month plan but I can't even get in.
show comments
lvl155
Their hardware is incredible. Why aren’t more investors lining up for this in this environment?
show comments
hereme888
So for <$1.7/day I can hire a programmer at a sort-of Claude Sonnet 4 level? I know it's got its quirks, limits, and needs supervision, but it's like 20x cheaper than an average programmer.
Ok it's fast, but rate limits seem to kick in extremely quickly and the results are less good than Claude Code and it ends up more expensive?
Who is the intended audience for Cerebras?
show comments
clbrmbr
At $200/month the comparable should be Opus 4 not Sonnet 4.
show comments
sophia01
My understanding is that the coding agents people use can be modified to plug into any LLM provider's API?
The difference here seems to be that Cerebras does not appear to have Qwen3-Coder through their API! So now there is a crazy fast (and apparently good too?) model that they only provide if you pay the crazy monthly sub?
show comments
ixel
The usage limit on Cerebras Code is rather limited, $50 plan apparently gives you 7.5 million tokens per day which doesn't last long. This also isn't clearly advertised on the plans prior to purchasing.
show comments
attentive
Attn: Cerebras
Any attempt to deal with "<think>" in the code gets it replaced with "<tool_call>".
Both in inference.cerebras.ai chat and API.
Same model on chat.qwen.ai doesn't do it.
another_twist
How does context buildup work for the code generating machines generally ?
Do the programs just use human notes + current code directly ? Are there some specific ranking steps that need to be done ?
JackYoustra
I've been waiting on this for a LONG time. Integration with Cursor when Cerebras released their earlier models was patchy at best, even through openrouter. It's nice to finally see official support, although I'm a bit worried about long-term the time for bash mcp calls ending up dominating.
Still, definitely the right direction!
EDIT: doesn't seem like anything but a first-party api with a monthly plan.
deevus
I'm finding myself switching between subscriptions to ChatGPT, T3 Chat, DeepSeek, Claude Code etc. Their subscription models aren't compatible with making it easy to take your data with you. I wish I could try this out and import all my data.
jedisct1
I'm a little bit confused.
I subscribed to the $50 plan. It's super fast for sure, but rate limits kick in after just a couple requests. completely defeating the fact that responses are fast.
Did I miss something?
unshavedyak
Super curious to see some comparisons to claude code. Especially Opus, since they're primarily comparing it to Sonnet in that graph.
atkailash
I use regular cerebras for plan stage in cline, so I’m very excited to try this out
lxe
Is this available as cline/roo-code integration? I think it might be on openrouter too.
show comments
dpkirchner
For those that have tried this, what kind of time-to-first-token latency are you seeing?
show comments
scosman
Groq also probably has this in the works. Fun times.
show comments
cellis
What are the token prices?
show comments
knicholes
It says it works with your favorite IDE-- How do you (the reader) plan to use this? I use Cursor, but I'm not sure if this replaces my need to pay for Cursor, or if I need to pay for Cursor AND this, and add in the LLM?
Or is VS code pretty good at this point? Or is there something better? These are the only two ways I'd know how to actually consume this with any success.
show comments
esafak
They should just host all the latest open source models FTW.
HardCodedBias
This has to be a monstrous money loser.
If they can maintain this pricing level, and if Qwen3‑Coder is as good as people say then they will have an enormous hit on their hands. A massive money losing hit, but a hit.
Very interesting!
PS: Did they reduce the context window, it looks like it.
show comments
romanovcode
> and no weekly limits!
No weekly limits so far. Just you wait if you get same or more traction as Claude you are going to go same playbook as they did.
Tried this out with Cline using my own API key (Cerebras is also available as a provider for Qwen3 Coder via via openrouter here: https://openrouter.ai/qwen/qwen3-coder) and realized that without caching, this becomes very expensive very quickly. Specifically, after each new tool call, you're sending the entire previous message history as input tokens - which are priced at $2/1M via the API just like output tokens.
The quality is also not quite what Claude Code gave me, but the speed is definitely way faster. If Cerebras supported caching & reduced token pricing for using the cache I think I would run this more, but right now it's too expensive per agent run.
> running at speeds of up to 2,000 tokens per second, with a 131k-token context window, no proprietary IDE lock-in, and no weekly limits!
I was excited, then I read this:
> Send up to 1,000 messages per day—enough for 3–4 hours of uninterrupted vibe coding.
I don't mind paying for services I use. But it's hard to take this seriously when the first paragraph claim is contradicting the fine prints.
If you would like to try this in a coding agent (we find the qwen3-coder model works really well in agents!), we have been experimenting with Cerebras Code in Sketch. We just pushed support, so you can run it with the latest version, 0.0.33:
Our experience is it seems overloaded right now, to the point where we have better results with our usual hosted version:Some users who signed up for pro ($50 p.m.) are reporting further limitations than those advertised.
>While they advertise a 1,000-request limit, the actual daily constraint is a 7.5 million-token limit. [1]
Assumes an average of 7.5k/request whereas in their marketing videos they show API requests ballooning by ~24k per request. Still lower than the API price.
[1] https://old.reddit.com/r/LocalLLaMA/comments/1mfeazc/cerebra...
Windsurf also has Cerebras/Qwen3-Coder. 1000 user messages per month for $15
https://x.com/windsurf/status/1951340259192742063
2k tokens/second is insane. While I'm very much against vibe coding, such performance essentially means you can get near-github copilot level speed with drastically better quality.
For in-editor use that's game changing.
This token throughput is incredible and going to set a new bar in the industry. The main issue with the cerebras code plan is that number of requests/minute is throttled, and with agentic coding systems each tool call is treated as new "message" so you can easily hit the api limits (10 messages/minute).
One workaround we're doing now that seems to work is use claude for all tasks but delegate specific tools with cerebras/qwen-3-coder-480b model to generate files or other token heavy tasks to avoid spiking the total number of requests. This has cost and latency consequences (and adds complexity to the code), but until those throttle limits are lifted seems to be a good combo. I also find that claude has better quality with tool selection when the number of tools required is > 15 which our current setup has.
I was waiting for more subscription base services to pop up to compete with the influence provider on a commodities level.
I think a lot more companies will follow suit and the competition will make pricing much better for the end user.
congrats on the launch Cerebras team!
Anyone get this working in Cursor? I can connect openrouter just fine, but Cerebras just errors out instantly. Same url/key works via curl, so some sort of Cerebras/Cursor compatibility issue.
Does it work with claude-code-router? I was getting API errors this week trying to use qwen3 Cerebras through OpenRouter with Claude code router.
I'm so excited to see a real competitor to Claude Code! Gemini CLI, while decent, does not have a $200/month pricing model and they charge per API access - Codex is the same. I'm trying to get into the https://cloud.cerebras.ai/ to try the $50/month plan but I can't even get in.
Their hardware is incredible. Why aren’t more investors lining up for this in this environment?
So for <$1.7/day I can hire a programmer at a sort-of Claude Sonnet 4 level? I know it's got its quirks, limits, and needs supervision, but it's like 20x cheaper than an average programmer.
FYI, you are probably going to use up your tokens because there's a total limit of tokens per day, so in about 300 requests it's feasible to use it all up. See https://www.reddit.com/r/LocalLLaMA/comments/1mfeazc/cerebra...
Ok it's fast, but rate limits seem to kick in extremely quickly and the results are less good than Claude Code and it ends up more expensive?
Who is the intended audience for Cerebras?
At $200/month the comparable should be Opus 4 not Sonnet 4.
My understanding is that the coding agents people use can be modified to plug into any LLM provider's API?
The difference here seems to be that Cerebras does not appear to have Qwen3-Coder through their API! So now there is a crazy fast (and apparently good too?) model that they only provide if you pay the crazy monthly sub?
The usage limit on Cerebras Code is rather limited, $50 plan apparently gives you 7.5 million tokens per day which doesn't last long. This also isn't clearly advertised on the plans prior to purchasing.
Attn: Cerebras
Any attempt to deal with "<think>" in the code gets it replaced with "<tool_call>".
Both in inference.cerebras.ai chat and API.
Same model on chat.qwen.ai doesn't do it.
How does context buildup work for the code generating machines generally ? Do the programs just use human notes + current code directly ? Are there some specific ranking steps that need to be done ?
I've been waiting on this for a LONG time. Integration with Cursor when Cerebras released their earlier models was patchy at best, even through openrouter. It's nice to finally see official support, although I'm a bit worried about long-term the time for bash mcp calls ending up dominating.
Still, definitely the right direction!
EDIT: doesn't seem like anything but a first-party api with a monthly plan.
I'm finding myself switching between subscriptions to ChatGPT, T3 Chat, DeepSeek, Claude Code etc. Their subscription models aren't compatible with making it easy to take your data with you. I wish I could try this out and import all my data.
I'm a little bit confused.
I subscribed to the $50 plan. It's super fast for sure, but rate limits kick in after just a couple requests. completely defeating the fact that responses are fast.
Did I miss something?
Super curious to see some comparisons to claude code. Especially Opus, since they're primarily comparing it to Sonnet in that graph.
I use regular cerebras for plan stage in cline, so I’m very excited to try this out
Is this available as cline/roo-code integration? I think it might be on openrouter too.
For those that have tried this, what kind of time-to-first-token latency are you seeing?
Groq also probably has this in the works. Fun times.
What are the token prices?
It says it works with your favorite IDE-- How do you (the reader) plan to use this? I use Cursor, but I'm not sure if this replaces my need to pay for Cursor, or if I need to pay for Cursor AND this, and add in the LLM?
Or is VS code pretty good at this point? Or is there something better? These are the only two ways I'd know how to actually consume this with any success.
They should just host all the latest open source models FTW.
This has to be a monstrous money loser.
If they can maintain this pricing level, and if Qwen3‑Coder is as good as people say then they will have an enormous hit on their hands. A massive money losing hit, but a hit.
Very interesting!
PS: Did they reduce the context window, it looks like it.
> and no weekly limits!
No weekly limits so far. Just you wait if you get same or more traction as Claude you are going to go same playbook as they did.
How is this even possible?
[flagged]