Ok I find it funny that people compare models and are like, opus 4.7 is SOTA and is much better etc, but I have used glm 5.1 (I assume this comes form them training on both opus and codex) for things opus couldn't do and have seen it make better code, haven't tried the qwen max series but I have seen the local 122b model do smarter more correct things based on docs than opus so yes benchmarks are one thing but reality is what the modes actually do and you should learn and have the knowledge of the real strengths that models posses. It is a tool in the end you shouldn't be saying a hammer is better then a wrench even tho both would be able to drive a nail in a piece of wood.
show comments
seanw265
Kimi K2.6 also released today. I think it's fair to compare the two models.
Qwen appears to be much more expensive:
- Qwen: $1.3 in / $7.8 out
- Kimi: $0.95 in / $4 out
--
The announcement posts only share two overlapping benchmark results. Qwen appears to score slightly lower on SWE-Bench Pro and Terminal-Bench 2.0.
Qwen:
- Teminal-Bench 2.0: 65.4
- SWE-Bench Pro: 57.3
Kimi:
- Terminal-Bench 2.0: 66.8
- SWE-Bench Pro: 58.6
--
Different models have different strong suits, and benchmarks don't cover everything. But from a numbers perspective, Kimi looks much more appealing.
show comments
ninjahawk1
The way to develop in this space seems to be to give away free stuff, get your name out there, then make everything proprietary. I hope they still continue releasing open weights. The day no one releases open weights is a sad day for humanity. Normal people won’t own their own compute if that ever happens.
show comments
0xbadcafebee
Everybody's out here chasing SOTA, meanwhile I'm getting all my coding done with MiniMax M2.5 in multiple parallel sessions for $10/month and never running into limits.
show comments
jjice
With them comparing to Opus 4.5, I find it hard to take some of these in good faith. Opus 4.7 is new, so I don't expect that, but Opus 4.6 has been out for quite some time.
While Qwen advertises large context windows, in practice the effectiveness of long-context usage seems to depend heavily on its context caching behavior. According to the official documentation, Qwen provides both implicit and explicit context caching, but these come with constraints such as short TTL (around a few minutes), prefix-based matching, and minimum token thresholds.
Because of these constraints, especially in workflows like coding agents where context grows over time, cache reuse may not scale as effectively as expected. As a result, even though the per-token price looks low, the effective cost in long sessions can feel higher due to reduced cache hit rates and repeated computation.
That said, in certain areas such as security-related tasks, I’ve personally had cases where Qwen performed better than Opus.
In my personal experience, Qwen tends to perform much better than Opus on shorter units like individual methods or functions. However, when looking at the overall coding experience, I found it works better as a function-level generator rather than as an autonomous, end-to-end coding assistant like Claude.
show comments
greyskull
I've been using Claude Code regularly at work for several months, and I successfully used it for a small personal project (a website) not long ago. Last weekend, I explored self-hosting for the first time.
Does anyone have a similar experience of having thoroughly used CC/Codex/whatever and also have an analogous self-hosted setup that they're somewhat happy with? I'm struggling a bit.
I have 32GB of DDR5 (seems inadequate nowadays), an AMD 7800X3D, and an RTX 4090. I'm using Windows but I have WSL enabled.
I tried a few combinations of ollama, docker desktop model runner, pi-coding-agent and opencode; and for models, I think I tried a few variants each of Gemma 4, Qwen, GLM-5.1. My "baseline" RAM usage was so high from the handful of regular applications that IIRC it wasn't enough to use the best models; e.g., I couldn't run Gemma4-31B.
Things work okay in a Windows-only setup, though the agent struggled to get file paths correct. I did have some success running pi/opencode in WSL and running ollama and the model via docker desktop.
In terms of actual performance, it was painfully slow compared to the throughput I'm used to from CC, and the tooling didn't feel as good as the CC harness. Admittedly I didn't spend long enough actually using it after fiddling with setup for so long, it was at least a fun experiment.
show comments
fr3on
The irony of this announcement is in the name: Max-Preview is proprietary, cloud-only. The Qwen models that actually matter — the ones running on real hardware people own — are the open weights series. I run the 32B and 72B variants locally on dual A4000s. The gap between those and the hosted Max is real, but it's shrinking with every release. The interesting question isn't how Max compares to Opus. It's how long until the open-weight tier makes the cloud tier irrelevant for most workloads.
trvz
The fun thing is, you can be aware of the entire range of Qwen models that are available for local running, but not at all about their cloud models.
I knew of all the 3.5’s and the one 3.6, but only now heard about the Plus.
show comments
wg0
Notice the pattern that Chinese providers are now:
1. Keeping models closed source.
2. Jacking up pricing. A lot. Sometimes up to 100% increase.
show comments
atilimcetin
Nowadays, I'm working on a realtime path tracer where you need proper understanding of microfacet reflection models, PDFs, (multiple) importance sampling, ReSTIR, etc.. Saying that mine is a somewhat specific use case.
And I use Claude, Gemini, GLM, Qwen to double check my math, my code and to get practical information to make my path tracer more efficient. Claude and Gemini failed me more than a couple of times with wrong, misleading and unnecessary information but on the other hand Qwen always gave me proper, practical and correct information. I’ve almost stopped using Claude and Gemini to not to waste my time anymore.
Claude code may shine developing web applications, backends and simple games but it's definitely not for me. And this is the story of my specific use case.
show comments
johnnyApplePRNG
Nowhere near the power of ChatGPT 5.4 Pro imho... thought for maybe 15 seconds on a problem that pro would have spend 15 minutes on... and the results really show :/
djyde
I've been using glm5.1 for pretty much all my coding work, but Claude is too expensive for me. Haven't tried qwen yet though. China's coding models are now very cost-effective.
show comments
Oras
I find it odd that none of OpenAI models was used in comparison, but used Z GLM 5.1. Is Z (GLM 5.1) really that good? It is crushing Opus 4.5 in these benchmarks, if that is true, I would have expected to read many articles on HN on how people flocked CC and Codex to use it.
show comments
XCSme
A bit weird to be comparing it to Opus-4.5 when 4.7 was released...
chatmasta
Is this going to be an open weights model or not? The post doesn’t make it clear. It seems the weights are not available today, but maybe that’s because it’s in preview?
show comments
digimantis
i dont get why people defend $200/month models against open source model that cost 1/10 of the price, like literally
marsulta
I think the benchmarks and numbers need to be easier to read. Those benchmarks are useless to the regular consumer.
o10449366
I have the M3 Max MBP with 128 GB of memory and the 40 core GPU. What's the best local model I can run today for coding?
show comments
fragmede
This thing on Celebras is going to be ridiculous.
Aeroi
why do people continue to benchmark their sota models against older models.
xmly
Very impressive!
DeathArrow
I am trying since one week to subscribe Alibaba Coding Plan (to use Qwen 3.6 Plus) but it's always out of stock.
They brag about Qwen but don't let people use it.
dakolli
ToKeN PrIcEs ArE gOiNg tO PluMmEt, InTelLigEnCe WiLl Be AfForDaBlE FoR EvErYOnE
Ok I find it funny that people compare models and are like, opus 4.7 is SOTA and is much better etc, but I have used glm 5.1 (I assume this comes form them training on both opus and codex) for things opus couldn't do and have seen it make better code, haven't tried the qwen max series but I have seen the local 122b model do smarter more correct things based on docs than opus so yes benchmarks are one thing but reality is what the modes actually do and you should learn and have the knowledge of the real strengths that models posses. It is a tool in the end you shouldn't be saying a hammer is better then a wrench even tho both would be able to drive a nail in a piece of wood.
Kimi K2.6 also released today. I think it's fair to compare the two models.
Qwen appears to be much more expensive:
- Qwen: $1.3 in / $7.8 out
- Kimi: $0.95 in / $4 out
--
The announcement posts only share two overlapping benchmark results. Qwen appears to score slightly lower on SWE-Bench Pro and Terminal-Bench 2.0.
Qwen:
- Teminal-Bench 2.0: 65.4
- SWE-Bench Pro: 57.3
Kimi:
- Terminal-Bench 2.0: 66.8
- SWE-Bench Pro: 58.6
--
Different models have different strong suits, and benchmarks don't cover everything. But from a numbers perspective, Kimi looks much more appealing.
The way to develop in this space seems to be to give away free stuff, get your name out there, then make everything proprietary. I hope they still continue releasing open weights. The day no one releases open weights is a sad day for humanity. Normal people won’t own their own compute if that ever happens.
Everybody's out here chasing SOTA, meanwhile I'm getting all my coding done with MiniMax M2.5 in multiple parallel sessions for $10/month and never running into limits.
With them comparing to Opus 4.5, I find it hard to take some of these in good faith. Opus 4.7 is new, so I don't expect that, but Opus 4.6 has been out for quite some time.
https://www.alibabacloud.com/help/en/model-studio/context-ca... I’ve also been testing models like Opus, Codex, and Qwen, and Qwen is strong in many coding tasks. However, my main concern is how it behaves in long-running sessions.
While Qwen advertises large context windows, in practice the effectiveness of long-context usage seems to depend heavily on its context caching behavior. According to the official documentation, Qwen provides both implicit and explicit context caching, but these come with constraints such as short TTL (around a few minutes), prefix-based matching, and minimum token thresholds.
Because of these constraints, especially in workflows like coding agents where context grows over time, cache reuse may not scale as effectively as expected. As a result, even though the per-token price looks low, the effective cost in long sessions can feel higher due to reduced cache hit rates and repeated computation.
That said, in certain areas such as security-related tasks, I’ve personally had cases where Qwen performed better than Opus.
In my personal experience, Qwen tends to perform much better than Opus on shorter units like individual methods or functions. However, when looking at the overall coding experience, I found it works better as a function-level generator rather than as an autonomous, end-to-end coding assistant like Claude.
I've been using Claude Code regularly at work for several months, and I successfully used it for a small personal project (a website) not long ago. Last weekend, I explored self-hosting for the first time.
Does anyone have a similar experience of having thoroughly used CC/Codex/whatever and also have an analogous self-hosted setup that they're somewhat happy with? I'm struggling a bit.
I have 32GB of DDR5 (seems inadequate nowadays), an AMD 7800X3D, and an RTX 4090. I'm using Windows but I have WSL enabled.
I tried a few combinations of ollama, docker desktop model runner, pi-coding-agent and opencode; and for models, I think I tried a few variants each of Gemma 4, Qwen, GLM-5.1. My "baseline" RAM usage was so high from the handful of regular applications that IIRC it wasn't enough to use the best models; e.g., I couldn't run Gemma4-31B.
Things work okay in a Windows-only setup, though the agent struggled to get file paths correct. I did have some success running pi/opencode in WSL and running ollama and the model via docker desktop.
In terms of actual performance, it was painfully slow compared to the throughput I'm used to from CC, and the tooling didn't feel as good as the CC harness. Admittedly I didn't spend long enough actually using it after fiddling with setup for so long, it was at least a fun experiment.
The irony of this announcement is in the name: Max-Preview is proprietary, cloud-only. The Qwen models that actually matter — the ones running on real hardware people own — are the open weights series. I run the 32B and 72B variants locally on dual A4000s. The gap between those and the hosted Max is real, but it's shrinking with every release. The interesting question isn't how Max compares to Opus. It's how long until the open-weight tier makes the cloud tier irrelevant for most workloads.
The fun thing is, you can be aware of the entire range of Qwen models that are available for local running, but not at all about their cloud models.
I knew of all the 3.5’s and the one 3.6, but only now heard about the Plus.
Notice the pattern that Chinese providers are now:
1. Keeping models closed source.
2. Jacking up pricing. A lot. Sometimes up to 100% increase.
Nowadays, I'm working on a realtime path tracer where you need proper understanding of microfacet reflection models, PDFs, (multiple) importance sampling, ReSTIR, etc.. Saying that mine is a somewhat specific use case.
And I use Claude, Gemini, GLM, Qwen to double check my math, my code and to get practical information to make my path tracer more efficient. Claude and Gemini failed me more than a couple of times with wrong, misleading and unnecessary information but on the other hand Qwen always gave me proper, practical and correct information. I’ve almost stopped using Claude and Gemini to not to waste my time anymore.
Claude code may shine developing web applications, backends and simple games but it's definitely not for me. And this is the story of my specific use case.
Nowhere near the power of ChatGPT 5.4 Pro imho... thought for maybe 15 seconds on a problem that pro would have spend 15 minutes on... and the results really show :/
I've been using glm5.1 for pretty much all my coding work, but Claude is too expensive for me. Haven't tried qwen yet though. China's coding models are now very cost-effective.
I find it odd that none of OpenAI models was used in comparison, but used Z GLM 5.1. Is Z (GLM 5.1) really that good? It is crushing Opus 4.5 in these benchmarks, if that is true, I would have expected to read many articles on HN on how people flocked CC and Codex to use it.
A bit weird to be comparing it to Opus-4.5 when 4.7 was released...
Is this going to be an open weights model or not? The post doesn’t make it clear. It seems the weights are not available today, but maybe that’s because it’s in preview?
i dont get why people defend $200/month models against open source model that cost 1/10 of the price, like literally
I think the benchmarks and numbers need to be easier to read. Those benchmarks are useless to the regular consumer.
I have the M3 Max MBP with 128 GB of memory and the 40 core GPU. What's the best local model I can run today for coding?
This thing on Celebras is going to be ridiculous.
why do people continue to benchmark their sota models against older models.
Very impressive!
I am trying since one week to subscribe Alibaba Coding Plan (to use Qwen 3.6 Plus) but it's always out of stock.
They brag about Qwen but don't let people use it.
ToKeN PrIcEs ArE gOiNg tO PluMmEt, InTelLigEnCe WiLl Be AfForDaBlE FoR EvErYOnE