A Research Preview of Codex

465 points392 commentsa day ago
johnjwang

Some engineers on my team at Assembled and I have been a part of the alpha test of Codex, and I'll say it's been quite impressive.

We’ve long used local agents like Cursor and Claude Code, so we didn’t expect too much. But Codex shines in a few areas:

Parallel task execution: You can batch dozens of small edits (refactors, tests, boilerplate) and run them concurrently without context juggling. It's super nice to run a bunch of tasks at the same time (something that's really hard to do in Cursor, Cline, etc.)

It kind of feels like a junior engineer on steroids, you just need to point it at a file or function, specify the change, and it scaffolds out most of a PR. You still need to do a lot of work to get it production ready, but it's as if you have an infinite number of junior engineers at your disposal now all working on different things.

Model quality is good, but hard to say it's that much better than other models. In side-by-side tests with Cursor + Gemini 2.5-pro, naming, style and logic are relatively indistinguishable, so quality meets our bar but doesn’t yet exceed it.

show comments
nadis

In the preview video, I appreciated Katy Shi's comment on "I think this is a reflection of where engineering work has moved over the past where a lot of my time now is spent reviewing code rather than writing it."

Preview video from Open AI: https://www.youtube.com/watch?v=hhdpnbfH6NU&t=878s

As I think about what "AI-native" or just the future of building software loos like, its interesting to me that - right now - developers are still just reading code and tests rather than looking at simulations.

While a new(ish) concept for software development, simulations could provide a wider range of outcomes and, especially for the front end, are far easier to evaluate than just code/tests alone. I'm biased because this is something I've been exploring but it really hit me over the head looking at the Codex launch materials.

show comments
ofirpress

[I'm one of the co-creators of SWE-bench] The team managed to improve on the already very strong o3 results on SWE-bench, but it's interesting that we're just seeing an improvement of a few percentage points. I wonder if getting to 85% from 75% on Verified is going to take as long as it took to get from 20% to 75%.

show comments
solresol

I'm not sure what's wrong with me, but I just wasted several hours wrestling codex to make it behave.

Here's my workflow that keeps failing: - it writes some code. It looks good a first glance - I push it to github - automated tests on github show that there's a problem - go back to codex and ask it to fix it - it does stuff. It looks good again.

Now what do I do? If I ask it to push again to github, then it will often create a pull request that doesn't include stuff from the first pull request, but it's not a pull request that stacks on top of the previous pull request, it's a pull request that stacks on top of main.

When asked to write something that called out to gpt-4.1-mini, it used openai.ChatCompletion.create (!?!!?)

I just found myself using claude to fix codex's mistakes.

show comments
ionwake

Im sorry if Im being silly, but I have paid for the Pro version, $200 a month, everytime I click on Try Codex, it takes me to a pricing page with the "Team Plan" https://chatgpt.com/codex#pricing.

Is this still rolling out? I dont need the team plan too do I?

I have been using openAI products for years now and I am keen to try but I have no idea what I am doing wrong.

show comments
blixt

They mentioned "microVM" in the live stream. Notably there's no browser or internet access. It makes sense, running specialized Firecracker/Unikraft/etc microkernels is way faster and cheaper so you can scale it up. But there will be a big technical scalability difficulty jump from this to the "agents with their own computers". ChatGPT Operator already does have a browser, so they definitely can do this, but I imagine the demand is orders of magnitudes different.

There must be room for a Modal/Cloudflare/etc infrastructure company that focuses only on providing full-fledged computer environments specifically for AI with forking/snapshotting (pause/resume), screen access, human-in-the-loop support, and so forth, and it would be very lucrative. We have browser-use, etc, but they don't (yet) capture the whole flow.

show comments
ZeroCool2u

"23 SWE-Bench Verified samples that were not runnable on our internal infrastructure were excluded."

What does that mean? Surely this should have a bit more elaboration. If you're just excluding a double digit number of tasks in the benchmark as uncompleted, that should be reflected in the scores.

bearjaws

I spend maybe $20 a month on Claude code, $200 a month is hard to swallow.

I don't understand why OAI puts their alpha release products under a $200 a month plan instead of just charging for tokens.

show comments
alvis

I used to work for a bank and the legal team used to ping us to make tiny changes to the app for compliance related issues. Now they can fix themselves. I think they’d be very proud and happy

show comments
zrg

I've been experimenting with providers offering similar functionality for the last year and it is really vastly superior experience this codex like approach than cursor, devin etc

CSMastermind

I really wish they'd add support for Gitlab or better any arbitrary git repo.

show comments
bionhoward

What about privacy, training opt out?

What about using it for AI / developing models that compete with our new overlords?

Seems like using this is just asking to get rug pulled for competing with em when they release something that competes with your thing. Am I just an old who’s crowing about nothing? It’s ok for them to tell us we own outputs we can’t use to compete with em?

show comments
kleiba

Just curious: is your company happy sharing their code-base with an AI provider? Or are you using a local installation?

show comments
haffi112

(watching live) I'm wondering how it performs on the METR benchmark (https://metr.org/blog/2025-03-19-measuring-ai-ability-to-com...).

asdev

is the point of this to actually assign tasks to an AI to complete end to end? Every task I do with AI requires atleast some bit of hand holding, sometimes reprompting etc. So I don't see why I would want to run tasks in parallel, I don't think it would increase throughput. Curious if others have better experiences with this

show comments
swisniewski

Has anyone else been able to get "secrets" to work?

They seem to be injected fine in the "environment setup" but don't seem to be injected when running tasks against the enviornment. This consistently repros even if I delete and re-create the enviornment and archive and resubmit the task.

yanis_t

So it's looking like it's only running in the cloud, that is it will push commits to my remote repo before I have a chance to see if it works?

When I'm using aider, after it make a commit what I do, I then immediately run git reset HEAD^ and then git diff (actually I use github desktop client to see the diff) to evaluate what exactly it did, and if I like it or not. Then I usually make some adjustments and only after that commit and push.

show comments
simianwords

I wonder if tools like these are best for semi structured refactors like upgrade to python3, migrate to postgres etc

SketchySeaBeast

Is this the same idea as when we switched to multicore machines? The rate of change on the capabilities of a single agent has slowed enough now the only way for OpenAI to appearing to be making decent progress is to have many?

bmcahren

So there's this thing called "Setup Scripts" but they don't explicitly say these are equivalent to AWS Metadata and configured inside of Codex web interface - not a setup.sh or a package.json preinstall declaration. I wasted several hours (and lots of compute where Codex was as confused as I was) trying to figure out how to convince codex to pnpm install.

show comments
orliesaurus

Why hasn't Github released this? Why it's OpenAI releasing this?!

show comments
sudohalt

When it runs the code I assume it does so via a docker container, does anyone know how it is configured? Assuming the user hasn't specified an AGENTS.md file or a Dockerfile in the repo. Does it generate it via LLM based on the repo, and what it thinks is needed? Does it use static analysis (package.json, requirements txt, etc)? Do they just have a super generic Dockerfile that can handle most envs? Combination of different things?

show comments
tptacek

Maddening: "codex" is also the name of their open-source Claude-Code-alike, and was previously the name of an at-the-time frontier coding model. It's like they name things just to fuck with us.

show comments
asadm

Is there an open source version of this? that essentially uses microvms to git clone my repo and essentially run codex-cli or equivalent and sends me a PR.

I made one for github action but it's not as realtime and is 2 years old now: https://github.com/asadm/chota

show comments
hintymad

I remember HN had a repeating popular post on the the most important data structures. They are all the basic ones that a first-year college student can learn. The youngest one was skiplist, which was invented in 1990. When I was a student, my class literally read the original paper and implemented the data structure and analyzed the complexity in our first data structure course.

This seems imply that the software engineering as a profession has been quite mature and saturated for a while, to the point that a model can predict most of the output. Yes, yes, I know there are thousands of advanced algorithms and amazing systems in production. It's just that the market does not need millions of engineers for such advanced skills.

Unless we get yet another new domain like cloud or like internet, I'm afraid the core value of software engineers: trailblazing for new business scenarios, will continue diminishing and being marginalized by AI. As a result, we get way less demand for our job, and many of us will either take a lower pay, or lose our jobs for extended time.

simianwords

Does any one how the quality drops with size of codebase?

fullstackchris

Reading these threads its clear to me people are so cooked and no longer understand (or perhaps never did) understand the simple process of how source code is shared, built, and merged together with multiple editors has ever worked

tough

can someone give me a test prompt to one-shot something in go for testing?

(Im trying something)

what would be an impressive program that an agent should be able to one-shot in one go?

btbuildem

> To balance safety and utility, Codex was trained to identify and precisely refuse requests aimed at development of malicious software, while clearly distinguishing and supporting legitimate tasks.

I can't say I am a big fan of neutering these paradigm-shifting tools according to one culture's code of ethics / way of doing business / etc.

One man's revolutionary is another's enemy combatant and all that. What if we need top-notch malware to take down the robot dogs lobbing mortars at our madmaxian compound?!

show comments
alvis

Is it surprising? Hmm perhaps nope. But is it better than cursor etc? Hmm perhaps it’s a wrong question.

Feels like codex is for product managers to fix bugs without touching any developer resources. Then it’s insanely surprising!

show comments
colesantiago

I think the benchmark test for these programming agents that I would like to see an Agent making a flawless PR or patch to the BSD / Linux kernel.

This should be possible today and surely Linus would also see this in the future.

show comments
scudsworth

pleased to see a paragraph-long comment in the examples. now thats good coding.

show comments
theappsecguy

I am so damn tired of all the AI garbage shoved down our throats every day. Can't wait for all of it to crash and burn.

show comments
adamTensor

not buying windsurf then???

show comments
prhn

Is anyone using any of these tools to write non boilerplate code?

I'm very interested.

In my experience ChatGPT and Gemini are absolutely terrible at these types of things. They are constantly wrong. I know I'm not saying anything new, but I'm waiting to personally experience an LLM that does something useful with any of the code I give it.

These tools aren't useless. They're great as search engines and pointing me in the right direction. They write dumb bash scripts that save me time here and there. That's it.

And it's hilarious to me how these people present these tools. It generates a bunch of code, and then you spend all your time auditing and fixing what is expected to be wrong.

That's not the type of code I'm putting in my company's code base, and I could probably write the damn code more correctly in less time than it takes to review for expected errors.

What am I missing?

show comments
skovati

I'm curious how many ICs are truly excited about these advancements in coding agents. It seems to me the general trend is we become more like PMs managing agents and reviewing PRs, all for the sake of productivity gains.

I imagine many engineers are like myself in that they got into programming because they liked tinkering and hacking and implementation details, all of which are likely to be abstracted over in this new era of prompting.

show comments
ilaksh

As someone who works on his own open source agent framework/UI (https://github.com/runvnc/mindroot), it's kind of interesting how announcements from vendors tend to mirror features that I am working on.

For example, in the last month or so, I added a job queue plugin. The ability to run multiple tasks that they demoed today is quite similar. The issue I ran into with users is that without Enterprise plans, complex tasks run into rate limits when trying to run concurrently.

So I am adding an ability to have multiple queues, with each possibly using different models and/or providers, to get around rate limits.

By the way, my system has features that are somewhat similar not only to this tool they are showing but also things like Manus. It is quite rough around the edges though because I am doing 100% of it myself.

But it is MIT Licensed and it would be great if any developer on the planet wanted to contribute anything.

RhysabOweyn

I believe that code from one of these things will eventually cause a disaster affecting the capital owners. Then all of a sudden you will need a PE license, ABET degree, 5 years working experience, etc. to call yourself a software engineer. It would not even be historically unique. Charlatans are the reason that lawyers, medical doctors, and civil engineers have to go through lots of education, exams, and vocational training to get into their profession. AI will probably force software engineering as a profession into that category as well.

On the other hand, if your job was writing code at certain companies whose profits were based on shoving ads in front of people then I would agree that no one will care if it is written by a machine or not. The days of those jobs making >$200k a year are numbered.

show comments
ianbutler

Im super curious to see how this actually does at finding significant bugs, we've been working in the space on https://www.bismuth.sh for a while and one of the things we're focused on is deep validation of the code being outputted.

There's so many of these "vibe coding" tools and there has to be real engineering rigor at some point. I saw them demo "find the bug" but the bugs they found were pretty superficial and thats something we've seen in our internal benchmark from both Devin and Cursor. A lot of noise and false positives or superficial fixes.

tough

so i just upgraded to pro plan but yet https://chatgpt.com/codex doesnt work for me and asks me to -try chatgpt pro- and shows me the upsell modal, even if already on the higher tier

sigh

show comments
energy123

Where can I read OpenAI's promise that it won't use the repos I upload for training?

DGAP

If you still don't think software engineering as a high paying job is over, I don't know what to tell you.

show comments