I find it quite funny how this blog post has a big "Ask ChatGPT" box at the bottom. So you might think you could ask a question about the contents of the blog post, so you type the text "summarise this blog post". And it opens a new chat window with the link to the blog post followed by "summarise this blog post". Only to be told "I can't access external URLs directly, but if you can paste the relevant text or describe the content you're interested in from the page, I can help you summarize it. Feel free to share!"
That's hilarious. Does OpenAI even know this doesn't work?
show comments
__jl__
What a model mess!
OpenAI now has three price points: GPT 5.1, GPT 5.2 and now GPT 5.4. There version numbers jump across different model lines with codex at 5.3, what they now call instant also at 5.3.
Anthropic are really the only ones who managed to get this under control: Three models, priced at three different levels. New models are immediately available everywhere.
Google essentially only has Preview models! The last GA is 2.5. As a developer, I can either use an outdated model or have zero insurances that the model doesn't get discontinued within weeks.
show comments
minimaxir
The marquee feature is obviously the 1M context window, compared to the ~200k other models support with maybe an extra cost for generations beyond >200k tokens. Per the pricing page, there is no additional cost for tokens beyond 200k: https://openai.com/api/pricing/
Also per pricing, GPT-5.4 ($2.50/M input, $15/M output) is much cheaper than Opus 4.6 ($5/M input, $25/M output) and Opus has a penalty for its beta >200k context window.
I am skeptical whether the 1M context window will provide material gains as current Codex/Opus show weaknesses as its context window is mostly full, but we'll see.
I've only used 5.4 for 1 prompt (edit: 3@high now) so far (reasoning: extra high, took really long), and it was to analyse my codebase and write an evaluation on a topic. But I found its writing and analysis thoughtful, precise, and surprisingly clearly written, unlike 5.3-Codex. It feels very lucid and uses human phrasing.
It might be my AGENTS.md requiring clearer, simpler language, but at least 5.4's doing a good job of following the guidelines. 5.3-Codex wasn't so great at simple, clear writing.
show comments
kgeist
>Today, we’re releasing <..> GPT‑5.3 Instant
>Today, we’re releasing GPT‑5.4 in ChatGPT (as GPT‑5.4 Thinking),
>Note that there is not a model named GPT‑5.3 Thinking
They held out for eight months without a confusing numbering scheme :)
show comments
gavinray
The "RPG Game" example on the blogpost is one of the most impressive demo's of autonomous engineering I've seen.
It's very similar to "Battle Brothers", and the fact that RPG games require art assets, AI for enemy moves, and a host of other logical systems makes it all the more impressive.
show comments
mattas
"GPT‑5.4 interprets screenshots of a browser interface and interacts with UI elements through coordinate-based clicking to send emails and schedule a calendar event."
They show an example of 5.4 clicking around in Gmail to send an email.
I still think this is the wrong interface to be interacting with the internet. Why not use Gmail APIs? No need to do any screenshot interpretation or coordinate-based clicking.
show comments
Chance-Device
I’m sure the military and security services will enjoy it.
show comments
Alifatisk
So let me get this straight, OpenAi previously had an issue with LOTS of different models snd versions being available. Then they solved this by introducing GPT-5 which was more like a router that put all these models under the hood so you only had to prompt to GPT-5, and it would route to the best suitable model. This worked great I assume and made the ui for the user comprehensible. But now, they are starting to introduce more of different models again?
We got:
- GPT-5.1
- GPT-5.2 Thinking
- GPT-5.3 (codex)
- GPT-5.3 Instant
- GPT-5.4 Thinking
- GPT-5.4 Pro
Who’s to blame for this ridiculous path they are taking? I’m so glad I am not a Chat user, because this adds so much unnecessary cognitive load.
The good news here is the support for 1M context window, finally it has caught up to Gemini.
show comments
smoody07
Surprised to see every chart limited to comparisons against other OpenAI models. What does the industry comparison look like?
show comments
zone411
Results from my Extended NYT Connections benchmark:
GPT-5.4 extra high scores 94.0 (GPT-5.2 extra high scored 88.6).
GPT-5.4 medium scores 92.0 (GPT-5.2 medium scored 71.4).
GPT-5.4 no reasoning scores 32.8 (GPT-5.2 no reasoning scored 28.1).
I no longer want to support OpenAI at all. Regardless of benchmarks or real world performance.
show comments
nickysielicki
can anyone compare the $200/mo codex usage limits with the $200/mo claude usage limits? It’s extremely difficult to get a feel for whether switching between the two is going to result in hitting limits more or less often, and it’s difficult to find discussion online about this.
In practice, if I buy $200/mo codex, can I basically run 3 codex instances simultaneously in tmux, like I can with claude code pro max, all day every day, without hitting limits?
show comments
yanis_t
These releases are lacking something. Yes, they optimised for benchmarks, but it’s just not all that impressive anymore.
It is time for a product, not for a marginally improved model.
show comments
consumer451
I am very curious about this:
> Theme park simulation game made with GPT‑5.4 from a single lightly specified prompt, using Playwright Interactive for browser playtesting and image generation for the isometric asset set.
Is "Playwright Interactive" a skill that takes screenshots in a tight loop with code changes, or is there more to it?
> Steerability: Similarly to how Codex outlines its approach when it starts working, GPT‑5.4 Thinking in ChatGPT will now outline its work with a preamble for longer, more complex queries. You can also add instructions or adjust its direction mid-response.
This was definitely missing before, and a frustrating difference when switching between ChatGPT and Codex. Great addition.
jryio
1 million tokens is great until you notice the long context scores fall off a cliff past 256K and the rest is basically vibes and auto compacting.
motbus3
Sam Altman can keep his model intentionally to himself. Not doing business with mass murderers
senko
Just tested it with my version of the pelican test: a minimal RTS game implementation (zero-shot in codex cli): https://gist.github.com/senko/596a657b4c0bfd5c8d08f44e4e5347... (you'll have to download and open the file, sadly GitHub refuses to serve it with the correct content type)
This is on the edge of what the frontier models can do. For 5.4, the result is better than 5.3-Codex and Opus 4.6. (Edit: nowhere near the RPG game from their blog post, which was presumably much more specced out and used better engineering setup).
I also tested it with a non-trivial task I had to do on an existing legacy codebase, and it breezed through a task that Claude Code with Opus 4.6 was struggling with.
I don't know when Anthropic will fire back with their own update, but until then I'll spend a bit more time with Codex CLI and GPT 5.4.
hmokiguess
They hired the dude from OpenClaw, they had Jony Ive for a while now, give us something different!
daft_pink
I’ve officially got model fatigue. I don’t care anymore.
Bit concerning that we see in some cases significantly worse results when enabling thinking. Especially for Math, but also in the browser agent benchmark.
Not sure if this is more concerning for the test time compute paradigm or the underlying model itself.
Maybe I'm misunderstanding something though? I'm assuming 5.4 and 5.4 Thinking are the same underlying model and that's not just marketing.
Not the best pelican compared to gemini 3.1 pro, but I am sure with coding or excel does remarkably better given those are part of its measured benchmarks.
show comments
bazmattaz
Anyone else feel that it’s exhausting keeping up with the pace of new model releases. I swear every other week there’s a new release!
show comments
dandiep
Anyone know why OpenAI hasn't released a new model for fine tuning since 4.1? It'll be a year next month since their last model update for fine tuning.
show comments
jcmontx
5.4 vs 5.3-Codex? Which one is better for coding?
show comments
atkrad
What is the main difference between this version with the previous one?
butILoveLife
Anyone else completely not interested? Since GPT5, its been cost cutting measure after cost cutting measure.
I imagine they added a feature or two, and the router will continue to give people 70B parameter-like responses when they dont ask for math or coding questions.
paxys
"Here's a brand new state-of-the-art model. It costs 10x more than the previous one because it's just so good. But don't worry, if you don't want all this power you can continue to use the older one."
A couple months later:
"We are deprecating the older model."
show comments
Aldipower
So did they raised the ridiculous small "per tool call token limit" when working with MCP servers? This makes Chat useless... I do not care, but my users.
Inline poll: What reasoning levels do you work with?
This becomes increasingly less clear to me, because the more interesting work will be the agent going off for 30mins+ on high / extra high (it's mostly one of the two), and that's a long time to wait and an unfeasible amount of code to a/b
show comments
melbourne_mat
Quick: let's release something new that gives the appearance that we're still relevant
ltbarcly3
Not a single comparison between 5.4 and Gemini or Claude. OpenAI continues to fall further behind.
GPT is not even close yo Claude in terms of responding to BS.
7777777phil
83% win rate over industry professionals across 44 occupations.
I'd believe it on those specific tasks. Near-universal adoption in software still hasn't moved DORA metrics. The model gets better every release. The output doesn't follow. Just had a closer look on those productivity metrics this week: https://philippdubach.com/posts/93-of-developers-use-ai-codi...
show comments
alpineman
No thanks. Already cancelled my sub.
OsrsNeedsf2P
Does anyone know what website is the "Isometric Park Builder" shown off here?
show comments
strongpigeon
It's interesting that they charge more for the > 200k token window, but the benchmark score seems to go down significantly past that. That's judging from the Long Context benchmark score they posted, but perhaps I'm misunderstanding what that implies.
show comments
cj
I use ChatGPT primarily for health related prompts. Looking at bloodwork, playing doctor for diagnosing minor aches/pains from weightlifting, etc.
Interesting, the "Health" category seems to report worse performance compared to 5.2.
show comments
iamronaldo
Notably 75% on os world surpassing humans at 72%... (How well models use operating systems)
bob1029
I was just testing this with my unity automation tool and the performance uplift from 5.2 seems to be substantial.
motza
No doubt this was released early to ease the bad press
swingboy
Even with the 1m context window, it looks like these models drop off significantly at about 256k. Hopefully improving that is a high priority for 2026.
nthypes
$30/M Input and $180/M Output Tokens is nuts. Ridiculous expensive for not that great bump on intelligence when compared to other models.
show comments
brcmthrowaway
How much of LLM improvement comes from regular ChatGPT usage these days?
vicchenai
Honestly at this point I just want to know if it follows complex instructions better than 5.1. The benchmark numbers stopped meaning much to me a while ago - real usage always feels different.
beernet
Sam really fumbled the top position in a matter of months, and spectacularly so. Wow. It appears that people are much more excited by Anthropic and Google releases, and there are good reasons for that which were absolutely avoidable.
gigatexal
Is it any good at coding?
woeirua
Feels incremental. Looks like OpenAI is struggling.
world2vec
Benchmarks barely improved it seems
throwaway5752
Does this model autonomously kill people without human approval or perform domestic surveillance of US citizens?
thefounder
Is it just me or the price for 5.4 pro is just insane?
ilaksh
Remember when everyone was predicting that GPT-5 would take over the planet?
show comments
koakuma-chan
Anyone else getting artifacts when using this model in Cursor?
Now with more and improved domestic espionage capabilities
OutOfHere
What is with the absurdity of skipping "5.3 Thinking"?
lostmsu
What is Pro exactly and is it available in Codex CLI?
show comments
HardCodedBias
We'll have to wait a day or two, maybe a week or two, to determine if this is more capable in coding than 5.3, which seems to be the economically valuable capability at this time.
In terms of writing and research even Gemini, with a good prompt, is close to useable. That's likely not a differentiator.
oytis
Everyone is mindblown in 3...2...1
wahnfrieden
No Codex model yet
show comments
tmpz22
Does this improve Tomahawk Missile accuracy?
show comments
ignorantguy
it shows a 404 as of now.
show comments
iamleppert
I wouldn't trust any of these benchmarks unless they are accompanied by some sort of proof other than "trust me bro". Also not including the parameters the models were run at (especially the other models) makes it hard to form fair comparisons. They need to publish, at minimum, the code and runner used to complete the benchmarks and logs.
Not including the Chinese models is also obviously done to make it appear like they aren't as cooked as they really are.
I find it quite funny how this blog post has a big "Ask ChatGPT" box at the bottom. So you might think you could ask a question about the contents of the blog post, so you type the text "summarise this blog post". And it opens a new chat window with the link to the blog post followed by "summarise this blog post". Only to be told "I can't access external URLs directly, but if you can paste the relevant text or describe the content you're interested in from the page, I can help you summarize it. Feel free to share!"
That's hilarious. Does OpenAI even know this doesn't work?
What a model mess!
OpenAI now has three price points: GPT 5.1, GPT 5.2 and now GPT 5.4. There version numbers jump across different model lines with codex at 5.3, what they now call instant also at 5.3.
Anthropic are really the only ones who managed to get this under control: Three models, priced at three different levels. New models are immediately available everywhere.
Google essentially only has Preview models! The last GA is 2.5. As a developer, I can either use an outdated model or have zero insurances that the model doesn't get discontinued within weeks.
The marquee feature is obviously the 1M context window, compared to the ~200k other models support with maybe an extra cost for generations beyond >200k tokens. Per the pricing page, there is no additional cost for tokens beyond 200k: https://openai.com/api/pricing/
Also per pricing, GPT-5.4 ($2.50/M input, $15/M output) is much cheaper than Opus 4.6 ($5/M input, $25/M output) and Opus has a penalty for its beta >200k context window.
I am skeptical whether the 1M context window will provide material gains as current Codex/Opus show weaknesses as its context window is mostly full, but we'll see.
Per updated docs (https://developers.openai.com/api/docs/guides/latest-model), it supercedes GPT-5.3-Codex, which is an interesting move.
I've only used 5.4 for 1 prompt (edit: 3@high now) so far (reasoning: extra high, took really long), and it was to analyse my codebase and write an evaluation on a topic. But I found its writing and analysis thoughtful, precise, and surprisingly clearly written, unlike 5.3-Codex. It feels very lucid and uses human phrasing.
It might be my AGENTS.md requiring clearer, simpler language, but at least 5.4's doing a good job of following the guidelines. 5.3-Codex wasn't so great at simple, clear writing.
>Today, we’re releasing <..> GPT‑5.3 Instant
>Today, we’re releasing GPT‑5.4 in ChatGPT (as GPT‑5.4 Thinking),
>Note that there is not a model named GPT‑5.3 Thinking
They held out for eight months without a confusing numbering scheme :)
The "RPG Game" example on the blogpost is one of the most impressive demo's of autonomous engineering I've seen.
It's very similar to "Battle Brothers", and the fact that RPG games require art assets, AI for enemy moves, and a host of other logical systems makes it all the more impressive.
"GPT‑5.4 interprets screenshots of a browser interface and interacts with UI elements through coordinate-based clicking to send emails and schedule a calendar event."
They show an example of 5.4 clicking around in Gmail to send an email.
I still think this is the wrong interface to be interacting with the internet. Why not use Gmail APIs? No need to do any screenshot interpretation or coordinate-based clicking.
I’m sure the military and security services will enjoy it.
So let me get this straight, OpenAi previously had an issue with LOTS of different models snd versions being available. Then they solved this by introducing GPT-5 which was more like a router that put all these models under the hood so you only had to prompt to GPT-5, and it would route to the best suitable model. This worked great I assume and made the ui for the user comprehensible. But now, they are starting to introduce more of different models again?
We got:
- GPT-5.1
- GPT-5.2 Thinking
- GPT-5.3 (codex)
- GPT-5.3 Instant
- GPT-5.4 Thinking
- GPT-5.4 Pro
Who’s to blame for this ridiculous path they are taking? I’m so glad I am not a Chat user, because this adds so much unnecessary cognitive load.
The good news here is the support for 1M context window, finally it has caught up to Gemini.
Surprised to see every chart limited to comparisons against other OpenAI models. What does the industry comparison look like?
Results from my Extended NYT Connections benchmark:
GPT-5.4 extra high scores 94.0 (GPT-5.2 extra high scored 88.6).
GPT-5.4 medium scores 92.0 (GPT-5.2 medium scored 71.4).
GPT-5.4 no reasoning scores 32.8 (GPT-5.2 no reasoning scored 28.1).
The actual card is here https://deploymentsafety.openai.com/gpt-5-4-thinking/introdu... the link currently goes to the announcement.
I no longer want to support OpenAI at all. Regardless of benchmarks or real world performance.
can anyone compare the $200/mo codex usage limits with the $200/mo claude usage limits? It’s extremely difficult to get a feel for whether switching between the two is going to result in hitting limits more or less often, and it’s difficult to find discussion online about this.
In practice, if I buy $200/mo codex, can I basically run 3 codex instances simultaneously in tmux, like I can with claude code pro max, all day every day, without hitting limits?
These releases are lacking something. Yes, they optimised for benchmarks, but it’s just not all that impressive anymore. It is time for a product, not for a marginally improved model.
I am very curious about this:
> Theme park simulation game made with GPT‑5.4 from a single lightly specified prompt, using Playwright Interactive for browser playtesting and image generation for the isometric asset set.
Is "Playwright Interactive" a skill that takes screenshots in a tight loop with code changes, or is there more to it?
If you don't want to click in, easy comparison with other 2 frontier models - https://x.com/OpenAI/status/2029620619743219811?s=20
Article: https://openai.com/index/introducing-gpt-5-4/
gpt-5.4
Input: $2.50 /M tokens
Cached: $0.25 /M tokens
Output: $15 /M tokens
---
gpt-5.4-pro
Input: $30 /M tokens
Output: $180 /M tokens
Wtf
> Steerability: Similarly to how Codex outlines its approach when it starts working, GPT‑5.4 Thinking in ChatGPT will now outline its work with a preamble for longer, more complex queries. You can also add instructions or adjust its direction mid-response.
This was definitely missing before, and a frustrating difference when switching between ChatGPT and Codex. Great addition.
1 million tokens is great until you notice the long context scores fall off a cliff past 256K and the rest is basically vibes and auto compacting.
Sam Altman can keep his model intentionally to himself. Not doing business with mass murderers
Just tested it with my version of the pelican test: a minimal RTS game implementation (zero-shot in codex cli): https://gist.github.com/senko/596a657b4c0bfd5c8d08f44e4e5347... (you'll have to download and open the file, sadly GitHub refuses to serve it with the correct content type)
This is on the edge of what the frontier models can do. For 5.4, the result is better than 5.3-Codex and Opus 4.6. (Edit: nowhere near the RPG game from their blog post, which was presumably much more specced out and used better engineering setup).
I also tested it with a non-trivial task I had to do on an existing legacy codebase, and it breezed through a task that Claude Code with Opus 4.6 was struggling with.
I don't know when Anthropic will fire back with their own update, but until then I'll spend a bit more time with Codex CLI and GPT 5.4.
They hired the dude from OpenClaw, they had Jony Ive for a while now, give us something different!
I’ve officially got model fatigue. I don’t care anymore.
I think the most exciting change announced here is the use of tool search to dynamically load tools as needed: https://developers.openai.com/api/docs/guides/tools-tool-sea...
Bit concerning that we see in some cases significantly worse results when enabling thinking. Especially for Math, but also in the browser agent benchmark.
Not sure if this is more concerning for the test time compute paradigm or the underlying model itself.
Maybe I'm misunderstanding something though? I'm assuming 5.4 and 5.4 Thinking are the same underlying model and that's not just marketing.
Beat Simon Willison ;)
https://www.svgviewer.dev/s/gAa69yQd
Not the best pelican compared to gemini 3.1 pro, but I am sure with coding or excel does remarkably better given those are part of its measured benchmarks.
Anyone else feel that it’s exhausting keeping up with the pace of new model releases. I swear every other week there’s a new release!
Anyone know why OpenAI hasn't released a new model for fine tuning since 4.1? It'll be a year next month since their last model update for fine tuning.
5.4 vs 5.3-Codex? Which one is better for coding?
What is the main difference between this version with the previous one?
Anyone else completely not interested? Since GPT5, its been cost cutting measure after cost cutting measure.
I imagine they added a feature or two, and the router will continue to give people 70B parameter-like responses when they dont ask for math or coding questions.
"Here's a brand new state-of-the-art model. It costs 10x more than the previous one because it's just so good. But don't worry, if you don't want all this power you can continue to use the older one."
A couple months later:
"We are deprecating the older model."
So did they raised the ridiculous small "per tool call token limit" when working with MCP servers? This makes Chat useless... I do not care, but my users.
Seems to be quite similar to 5.3-codex, but somehow almost 2x more expensive: https://aibenchy.com/compare/openai-gpt-5-4-medium/openai-gp...
Inline poll: What reasoning levels do you work with?
This becomes increasingly less clear to me, because the more interesting work will be the agent going off for 30mins+ on high / extra high (it's mostly one of the two), and that's a long time to wait and an unfeasible amount of code to a/b
Quick: let's release something new that gives the appearance that we're still relevant
Not a single comparison between 5.4 and Gemini or Claude. OpenAI continues to fall further behind.
I only want to see how it performs on the Bullshit-benchmark https://petergpt.github.io/bullshit-benchmark/viewer/index.v...
GPT is not even close yo Claude in terms of responding to BS.
83% win rate over industry professionals across 44 occupations.
I'd believe it on those specific tasks. Near-universal adoption in software still hasn't moved DORA metrics. The model gets better every release. The output doesn't follow. Just had a closer look on those productivity metrics this week: https://philippdubach.com/posts/93-of-developers-use-ai-codi...
No thanks. Already cancelled my sub.
Does anyone know what website is the "Isometric Park Builder" shown off here?
It's interesting that they charge more for the > 200k token window, but the benchmark score seems to go down significantly past that. That's judging from the Long Context benchmark score they posted, but perhaps I'm misunderstanding what that implies.
I use ChatGPT primarily for health related prompts. Looking at bloodwork, playing doctor for diagnosing minor aches/pains from weightlifting, etc.
Interesting, the "Health" category seems to report worse performance compared to 5.2.
Notably 75% on os world surpassing humans at 72%... (How well models use operating systems)
I was just testing this with my unity automation tool and the performance uplift from 5.2 seems to be substantial.
No doubt this was released early to ease the bad press
Even with the 1m context window, it looks like these models drop off significantly at about 256k. Hopefully improving that is a high priority for 2026.
$30/M Input and $180/M Output Tokens is nuts. Ridiculous expensive for not that great bump on intelligence when compared to other models.
How much of LLM improvement comes from regular ChatGPT usage these days?
Honestly at this point I just want to know if it follows complex instructions better than 5.1. The benchmark numbers stopped meaning much to me a while ago - real usage always feels different.
Sam really fumbled the top position in a matter of months, and spectacularly so. Wow. It appears that people are much more excited by Anthropic and Google releases, and there are good reasons for that which were absolutely avoidable.
Is it any good at coding?
Feels incremental. Looks like OpenAI is struggling.
Benchmarks barely improved it seems
Does this model autonomously kill people without human approval or perform domestic surveillance of US citizens?
Is it just me or the price for 5.4 pro is just insane?
Remember when everyone was predicting that GPT-5 would take over the planet?
Anyone else getting artifacts when using this model in Cursor?
numerusformassistant to=functions.ReadFile մեկնաբանություն 天天爱彩票网站json {"path":
Now with more and improved domestic espionage capabilities
What is with the absurdity of skipping "5.3 Thinking"?
What is Pro exactly and is it available in Codex CLI?
We'll have to wait a day or two, maybe a week or two, to determine if this is more capable in coding than 5.3, which seems to be the economically valuable capability at this time.
In terms of writing and research even Gemini, with a good prompt, is close to useable. That's likely not a differentiator.
Everyone is mindblown in 3...2...1
No Codex model yet
Does this improve Tomahawk Missile accuracy?
it shows a 404 as of now.
I wouldn't trust any of these benchmarks unless they are accompanied by some sort of proof other than "trust me bro". Also not including the parameters the models were run at (especially the other models) makes it hard to form fair comparisons. They need to publish, at minimum, the code and runner used to complete the benchmarks and logs.
Not including the Chinese models is also obviously done to make it appear like they aren't as cooked as they really are.
What is the point of gpt codex?
More discussion here on the blog post announcement which has been confusingly penalized by Hacker News's algorithm: https://news.ycombinator.com/item?id=47265005
some sloppy improvements
Wow insane improvements in targeting systems for military targets over children