While GLM-5 seems impressive, this release also included lots of new cool stuff!
> GLM-5 can turn text or source materials directly into .docx, .pdf, and .xlsx files—PRDs, lesson plans, exams, spreadsheets, financial reports, run sheets, menus, and more.
A new type of model has joined the series, GLM-5-Coder.
GLM-5 was trained on Huawei Ascend, last time when DeepSeek tried to use this chip, it flopped and they resorted to Nvidia again. This time seems like a success.
If you go to chat.z.ai, there is a new toggle in the prompt field, you can now toggle between chat/agentic. It is only visible when you switch to GLM-5.
It is for sure not as good but the generous limits mean that for a price I can afford I can use it all day and that is game changer for me.
I can't use this model yet as they are slowly rolling it out but I'm excited to try it.
Aurornis
The benchmarks are impressive, but it's comparing to last generation models (Opus 4.5 and GPT-5.2). The competitor models are new, but they would have easily had enough time to re-run the benchmarks and update the press release by now.
Although it doesn't really matter much. All of the open weights models lately come with impressive benchmarks but then don't perform as well as expected in actual use. There's clearly some benchmaxxing going on.
show comments
pcwelder
It's live on openrouter now.
In my personal benchmark it's bad. So far the benchmark has been a really good indicator of instruction following and agentic behaviour in general.
To those who are curious, the benchmark is just the ability of model to follow a custom tool calling format. I ask it to using coding tasks using chat.md [1] + mcps. And so far it's just not able to follow it at all.
Been using GLM-4.7 for a couple weeks now. Anecdotally, it’s comparable to sonnet, but requires a little bit more instruction and clarity to get things right. For bigger complex changes I still use anthropic’s family, but for very concise and well defined smaller tasks the price of GLM-4.7 is hard to beat.
show comments
Havoc
Been playing with it in opencode for a bit and pretty impressed so far. Certainly more of an incremental improvement than a big bang change, but it does seem better a good bit better than 4.7, which in turn was a modest but real improvement over 4.6.
Certainly seems to remember things better and is more stable on long running tasks.
esafak
I got fed up with GLM-4.7 after using it for a few weeks; it was slow through z.ai and not as good as the benchmarks lead me to believe (esp. with regards to instruction following) but I'm willing to give it another try.
show comments
kristianp
So that was pony alpha (1). Now what's Aurora Alpha?
What is truly amazing here is the fact that they trained this entirely on Huawei Ascend chips per reporting [1]. Hence we can conclude the semiconductor to model Chinese tech stack is only 3 months behind the US, considering Opus 4.5 released in November. (Excluding the lithography equipment here, as SMIC still uses older ASML DUV machines) This is huge especially since just a few months ago it was reported that Deepseek were not using Huawei chips due to technical issues [2].
US attempts to contain Chinese AI tech totally failed. Not only that, they cost Nvidia possibly trillions of dollars of exports over the next decade, as the Chinese govt called the American bluff and now actively disallow imports of Nvidia chips as a direct result of past sanctions [3]. At a time when Trump admin is trying to do whatever it can to reduce the US trade imbalance with China.
I am using it with Claude Code and so far so good. Can't tell if it's as good as Opus 4.6 or not yet
2001zhaozhao
GLM-4.7-Flash was the first local coding model that I felt was intelligent enough to be useful. It feels something like Claude 4.5 Haiku at a parameter size where other coding models are still getting into loops and making bewilderingly stupid tool calls. It also has very clear reasoning traces that feel like Claude, which does result in the ability to inspect its reasoning to figure out why it made certain decisions.
So far I haven't managed to get comparably good results out of any other local model including Devstral 2 Small and the more recent Qwen-Coder-Next.
show comments
goldenarm
If you're tired of cross-referencing the cherry-picked benchmarks, here's the
geometric mean of SWE-bench Verified & HLE-tools :
Claude Opus 4.6: 65.5%
GLM-5: 62.6%
GPT-5.2: 60.3%
Gemini 3 Pro: 59.1%
woeirua
It might be impressive on benchmarks, but there's just no way for them to break through the noise from the frontier models. At these prices they're just hemorrhaging money. I can't see a path forward for the smaller companies in this space.
show comments
mnicky
What I haven't seen discussed anywhere so far is how big a lead Anthropic seems to have in intelligence per output token, e.g. if you look at [1].
We already know that intelligence scales with the log of tokens used for reasoning, but Anthropic seems to have much more powerful non-reasoning models than its competitors.
I read somewhere that they have a policy of not advancing capabilities too much, so could it be that they are sandbagging and releasing models with artificially capped reasoning to be at a similar level to their competitors?
Why is GLM 5 more expensive than GLM 4.7 even when using sparse attention?
There is also a GLM 5-code model.
show comments
beAroundHere
I'd say that they're super confident about the GLM-5 release, since they're directly comparing it with Opus 4.5 and don't mention Sonnet 4.5 at all.
I am still waiting if they'd launch GLM-5 Air series,which would run on consumer hardware.
show comments
CDieumegard
Interesting timing — GLM-4.7 was already impressive for local use on 24GB+ setups. Curious to see when the distilled/quantized versions of GLM-5 drop. The gap between what you can run via API vs locally keeps shrinking. I've been tracking which models actually run well at each RAM tier and the Chinese models (Qwen, DeepSeek, GLM) are dominating the local inference space right now
pu_pe
Really impressive benchmarks. It was commonly stated that open source models were lagging 6 months behind state of the art, but they are likely even closer now.
benchmark and pricing made me realize how good kimi 2.5 is. im an opus 4.6 person but wow, its almost 5x cheaper.
mohas
I kinda feel this bench-marking thing with Chinese models is like university Olympiads, they specifically study for those but when time comes for the real world work they seriously lack behind.
show comments
nullbyte
GLM 5 beats Kimi on SWE bench and Terminal bench. If it's anywhere near Kimi in price, this looks great.
Edit: Input tokens are twice as expensive. That might be a deal breaker.
show comments
ExpertAdvisor01
They increased their prices substantially
meffmadd
It will be tough to run on our 4x H200 node… I wish they stayed around the 350B range. MLA will reduce KV cache usage but I don’t think the reduction will be significant enough.
tgtweak
Why are we not comparing to opus 4.6 and gpt 5.3 codex...
Honestly these companies are so hard to takes seriously with these release details. If it's an open source model and you're only comparing open source - cool.
If you're not top in your segment, maybe show how your token cost and output speed more than make up for that.
Purposely showing prior-gen models in your release comparison immediately discredits you in my eyes.
show comments
unltdpower
I predict a new speculative market will emerge where adherents buy and sell misween coded companies.
Betting on whether they can actually perform their sold behaviors.
Passing around code repositories for years without ever trying to run them, factory sealed.
karolist
The amount of times benchmarks of competitors said something is close to Claude and it was remotely close in practice in the past year: 0
show comments
eugene3306
why don't they publish at ARC-AGI ? too expensive?
show comments
seydor
I wish China starts copying Demis' biotech models as well soon
woah
Is this a lot cheaper to run (on their service or rented GPUs) than Claude or ChatGPT?
show comments
surrTurr
we're seeing so many LLM releases that they can't even keep their benchmark comparisons updated
dana321
Just tried it, its practically the same as glm-4.7 - it isn't as "wide" as claude or codex so even on a simple prompt is misses out on one important detail - instead of investigating it ploughs ahead with the next best thing it thinks you asked for instead of investigating fully before starting a project.
While GLM-5 seems impressive, this release also included lots of new cool stuff!
> GLM-5 can turn text or source materials directly into .docx, .pdf, and .xlsx files—PRDs, lesson plans, exams, spreadsheets, financial reports, run sheets, menus, and more.
A new type of model has joined the series, GLM-5-Coder.
GLM-5 was trained on Huawei Ascend, last time when DeepSeek tried to use this chip, it flopped and they resorted to Nvidia again. This time seems like a success.
Looks like they also released their own agentic IDE, https://zcode.z.ai
I don’t know if anyone else knows this but Z.ai also released new tools excluding the Chat! There’s Zread (https://zread.ai), OCR (seems new? https://ocr.z.ai), GLM-Image gen https://image.z.ai and Voice cloning https://audio.z.ai
If you go to chat.z.ai, there is a new toggle in the prompt field, you can now toggle between chat/agentic. It is only visible when you switch to GLM-5.
Very fascinating stuff!
Pelican generated via OpenRouter: https://gist.github.com/simonw/cc4ca7815ae82562e89a9fdd99f07...
Solid bird, not a great bicycle frame.
I've been using GLM 4.7 with opencode.
It is for sure not as good but the generous limits mean that for a price I can afford I can use it all day and that is game changer for me.
I can't use this model yet as they are slowly rolling it out but I'm excited to try it.
The benchmarks are impressive, but it's comparing to last generation models (Opus 4.5 and GPT-5.2). The competitor models are new, but they would have easily had enough time to re-run the benchmarks and update the press release by now.
Although it doesn't really matter much. All of the open weights models lately come with impressive benchmarks but then don't perform as well as expected in actual use. There's clearly some benchmaxxing going on.
It's live on openrouter now.
In my personal benchmark it's bad. So far the benchmark has been a really good indicator of instruction following and agentic behaviour in general.
To those who are curious, the benchmark is just the ability of model to follow a custom tool calling format. I ask it to using coding tasks using chat.md [1] + mcps. And so far it's just not able to follow it at all.
[1] https://github.com/rusiaaman/chat.md
Been using GLM-4.7 for a couple weeks now. Anecdotally, it’s comparable to sonnet, but requires a little bit more instruction and clarity to get things right. For bigger complex changes I still use anthropic’s family, but for very concise and well defined smaller tasks the price of GLM-4.7 is hard to beat.
Been playing with it in opencode for a bit and pretty impressed so far. Certainly more of an incremental improvement than a big bang change, but it does seem better a good bit better than 4.7, which in turn was a modest but real improvement over 4.6.
Certainly seems to remember things better and is more stable on long running tasks.
I got fed up with GLM-4.7 after using it for a few weeks; it was slow through z.ai and not as good as the benchmarks lead me to believe (esp. with regards to instruction following) but I'm willing to give it another try.
So that was pony alpha (1). Now what's Aurora Alpha?
(1) https://openrouter.ai/openrouter/pony-alpha
What is truly amazing here is the fact that they trained this entirely on Huawei Ascend chips per reporting [1]. Hence we can conclude the semiconductor to model Chinese tech stack is only 3 months behind the US, considering Opus 4.5 released in November. (Excluding the lithography equipment here, as SMIC still uses older ASML DUV machines) This is huge especially since just a few months ago it was reported that Deepseek were not using Huawei chips due to technical issues [2].
US attempts to contain Chinese AI tech totally failed. Not only that, they cost Nvidia possibly trillions of dollars of exports over the next decade, as the Chinese govt called the American bluff and now actively disallow imports of Nvidia chips as a direct result of past sanctions [3]. At a time when Trump admin is trying to do whatever it can to reduce the US trade imbalance with China.
[1] https://tech.yahoo.com/ai/articles/chinas-ai-startup-zhipu-r...
[2] https://www.techradar.com/pro/chaos-at-deepseek-as-r2-launch...
[3] https://www.reuters.com/world/china/chinas-customs-agents-to...
I am using it with Claude Code and so far so good. Can't tell if it's as good as Opus 4.6 or not yet
GLM-4.7-Flash was the first local coding model that I felt was intelligent enough to be useful. It feels something like Claude 4.5 Haiku at a parameter size where other coding models are still getting into loops and making bewilderingly stupid tool calls. It also has very clear reasoning traces that feel like Claude, which does result in the ability to inspect its reasoning to figure out why it made certain decisions.
So far I haven't managed to get comparably good results out of any other local model including Devstral 2 Small and the more recent Qwen-Coder-Next.
If you're tired of cross-referencing the cherry-picked benchmarks, here's the geometric mean of SWE-bench Verified & HLE-tools :
Claude Opus 4.6: 65.5%
GLM-5: 62.6%
GPT-5.2: 60.3%
Gemini 3 Pro: 59.1%
It might be impressive on benchmarks, but there's just no way for them to break through the noise from the frontier models. At these prices they're just hemorrhaging money. I can't see a path forward for the smaller companies in this space.
What I haven't seen discussed anywhere so far is how big a lead Anthropic seems to have in intelligence per output token, e.g. if you look at [1].
We already know that intelligence scales with the log of tokens used for reasoning, but Anthropic seems to have much more powerful non-reasoning models than its competitors.
I read somewhere that they have a policy of not advancing capabilities too much, so could it be that they are sandbagging and releasing models with artificially capped reasoning to be at a similar level to their competitors?
How do you read this?
[1] https://imgur.com/a/EwW9H6q
Here is the pricing per M tokens. https://docs.z.ai/guides/overview/pricing
Why is GLM 5 more expensive than GLM 4.7 even when using sparse attention?
There is also a GLM 5-code model.
I'd say that they're super confident about the GLM-5 release, since they're directly comparing it with Opus 4.5 and don't mention Sonnet 4.5 at all.
I am still waiting if they'd launch GLM-5 Air series,which would run on consumer hardware.
Interesting timing — GLM-4.7 was already impressive for local use on 24GB+ setups. Curious to see when the distilled/quantized versions of GLM-5 drop. The gap between what you can run via API vs locally keeps shrinking. I've been tracking which models actually run well at each RAM tier and the Chinese models (Qwen, DeepSeek, GLM) are dominating the local inference space right now
Really impressive benchmarks. It was commonly stated that open source models were lagging 6 months behind state of the art, but they are likely even closer now.
Probably related: https://news.ycombinator.com/item?id=46974853
benchmark and pricing made me realize how good kimi 2.5 is. im an opus 4.6 person but wow, its almost 5x cheaper.
I kinda feel this bench-marking thing with Chinese models is like university Olympiads, they specifically study for those but when time comes for the real world work they seriously lack behind.
GLM 5 beats Kimi on SWE bench and Terminal bench. If it's anywhere near Kimi in price, this looks great.
Edit: Input tokens are twice as expensive. That might be a deal breaker.
They increased their prices substantially
It will be tough to run on our 4x H200 node… I wish they stayed around the 350B range. MLA will reduce KV cache usage but I don’t think the reduction will be significant enough.
Why are we not comparing to opus 4.6 and gpt 5.3 codex...
Honestly these companies are so hard to takes seriously with these release details. If it's an open source model and you're only comparing open source - cool.
If you're not top in your segment, maybe show how your token cost and output speed more than make up for that.
Purposely showing prior-gen models in your release comparison immediately discredits you in my eyes.
I predict a new speculative market will emerge where adherents buy and sell misween coded companies.
Betting on whether they can actually perform their sold behaviors.
Passing around code repositories for years without ever trying to run them, factory sealed.
The amount of times benchmarks of competitors said something is close to Claude and it was remotely close in practice in the past year: 0
why don't they publish at ARC-AGI ? too expensive?
I wish China starts copying Demis' biotech models as well soon
Is this a lot cheaper to run (on their service or rented GPUs) than Claude or ChatGPT?
we're seeing so many LLM releases that they can't even keep their benchmark comparisons updated
Just tried it, its practically the same as glm-4.7 - it isn't as "wide" as claude or codex so even on a simple prompt is misses out on one important detail - instead of investigating it ploughs ahead with the next best thing it thinks you asked for instead of investigating fully before starting a project.
Earlier: https://news.ycombinator.com/item?id=46974853
Whoa, I think GPT-5.3-Codex was a disappointment, but GLM-5 is definitely the future!