The last six months in LLMs in five minutes

hollowturtle

> The coding agents got really good

It's since november 2025, the so called "inflection point", that I'm still wondering for who coding agents become "really good".

All I observe they got better at tool call and answering questions about big codebases, especially if the question has a vague pattern to search, and they're superuseful for that! For generating production code even with a lot of steering and baby sitting?

Absolutely not, not quite there not even close in my experience.

But we should stop talking about 1s and 0s, especially with marketing hype trains, there exist a gradient of capabalities that agents have that really depends on the intricacies of the codebase you're working on, I think everyone has yet to discover how to better apply these tools in their day to day work.

But that totally collides with the current narrative, that flattens out our work to be always the same and that can be automated easily in each case, it's not!

That's why the debate is so polizered imo, there isn't a shared experience

show comments

jimbobthemighty

I asked Gemini for a video of 'pelican riding a unicycle in hyde park' - I was blown away by the output:

https://gemini.google.com/share/55e250c99693

show comments

wewewedxfgdf

Does this guy have a "publish to front page of HN" button on his blog editor?

show comments

Insanity

I wonder how much the 'inflection point' is a thing vs marketing. I'm sure the models got somewhat better, but even now when I'm trying to 'vibe code' a game with the latest models (combination of Codex w/ gpt5.5 and gpt5.3-codex), they really do struggle.

They definitely get something barebones up and running, but it's far from a fully fledged application.

show comments

eloisant

About Pelicans on bicycles:

> there’s zero chance any AI lab would train a model for such a ridiculous task

Well, I think this guy's tests have got enough visibility that I wouldn't be surprised if some AI models are trained on it specifically...

LZ_Khan

I'm curious how the 6 months have looked from a non-programmer's perspective. What kind of co-working tools and similar optimizations have people from other fields experienced?

show comments

zarzavat

Somewhere right now some human artist is being tasked with drawing illustrations of pelicans riding bicycles to be used as training data at a big AI lab.

show comments

shepherdjerred

> and there’s zero chance any AI lab would train a model for such a ridiculous task.

I'm not sure that's true anymore considering how popular Simon's blog is

show comments

pineapple_opus

All I see is mention of how various models generate image of "pelican riding bicycle(s)"

show comments

tptacek

If you're a vulnerability researcher or a security person generally, there's a big inflection point from Spring of this year.

show comments

pr337h4m

Something that’s largely been ignored: DeepSeek has made context caching virtually free with V4-Flash.

throwaway2027

December 2025 was the breakthrough for me. January Claude was euphoric, ChatGPT was up there. February Gemini cooked for a second there. March amazing. April the big bad nerf. May GPT 5.5 is just pure bliss altough 2x limits temporarily, not sure about Claude it's sort of okay still not as good as it felt before, slowly increasing limits with more compute and rebuilding good will.

show comments

grey-area

Haven’t noticed much significant progress in LLMs myself in 6 months (significant as in new or vastly improved capabilities or understanding, not new releases, there are plenty of those).

I feel like if anything people started to realise the significant limitations of LLMs when you try to use them as ‘agents’ which was the big direction LLM companies tried to push recently.

Best use of LLMs so far IMO is finding vulnerabilities (with human help) and pattern matching in other domains. For generating code and prose they are still mediocre and somewhat unreliable and for use as personal assistant agents I wouldn’t trust them.

So what’s happening with openclaw, the biggest experiment in agentic, vibe coded by the agents themselves? The thing that was so hot a few months ago.

https://github.com/openclaw/openclaw/pulse?period=daily

279 commits to main from 77 authors in the last 24 hours.

Why is there so much churn and how could you trust it with your data? This is changes in ONE day!

If these are useful changes, surely it’d be superhuman by now given months of this pace.

What are people using this for?

ionwake

why is there no talk about the world is already run by AI by proxy? ie bureaucrats using chatgpt to make their speeches decisions shopfront designs etc. I just dont seem to read about this, intead its more this nebulous specific date in the future

ramon156

> Google released the Gemma 4 series of models, which are the most capable open weight models I’ve seen from a US company.

Implying another country has a better model? I'm being pokey here because I'm very curious! I know Gemma is efficient, but I also remember Qwen and Kiwi being referred to as optimized. The difference being that Gemma is using less tokens, but maybe Qwen/Kiwi's quality is higher? I dont know.

show comments

hansmayer

TL;DR:

"Coding agents got really good - here, a bunch of non-releavant slop-pictures of pelicans riding bikes as a key benchmark AND a couple of hardly relevant edge-case demo-projects of mine to prove it right! "

Come on man, where is the AI writing all the code in 6 months? We're close to June and Amodei's latest statement from January does not look like going into fulfilling over the next weeks, does it now?

LarsDu88

My goal post for "AI will definitely replace most SWEs" was to reproduce a particular 90s programming game one shot and then add multiplayer support with minimal prompting.

Opus 4.5 hit that point in November.

show comments

vishal_new

what are your thoughts on Software engineer replacement. My team has already seen big reductions. Q/A team is gone. Software Engineer reduced by a third. Scared for the future

show comments

rTX5CMRXIfFG

Am I crazy, or are these differences between the best models so marginal that you’d get roughly the same performance if you use the same high-quality harness (ie preloaded instructions from md files, including custom skills)?

show comments

bob1029

It definitely seems like the point of no return has been passed.

The size of the codebase doesn't matter anymore. In fact, I am finding that the larger the codebase the better the performance. Starting from scratch with vague ambition is not the same as solving a specific stack trace over a mountain of decade-old code. The later performs better and is also more exciting for the business. It would seem more callers = more constraints to verify against.

For the last 3 months I've felt like I've been dropping gps guided bombs from orbit. No one can tell the difference between AI authored and my hand written code, other than via the implication of the radically increased daily work volume. There's definitely AI in there, but it's like a homogeneous cybernetic blend of my work and the computer's. I own all of it, can explain all of it immediately, but I only wrote maybe 10% of it by hand.

The development team should be mostly "solved" by now with regard to the AI transformation. If you are still at Home Depot picking out your proverbial hammer, it's time to start heading for the self checkout. The rest of the business is where the real money and headlines will be made at this point. AI writing code is ancient news now. Custom harnesses that business people can use to automate workflows will print a lot more money. Bringing some bacon to the rest of the business may also help to preserve your career path in these uncertain times.

Remember what Jobs said about the customer. A lot of times, people don’t know what they want until you show it to them. Most people wouldn't have believed the iPhone was even remotely possible until the moment it was publicly revealed and made available for purchase. I am finding the same effect in the business with AI. What it can actually do when well engineered and applied to the domain will usually outperform the expectations of its users by a wide margin. All these fears about alignment, hallucinations, cost, ethics, the environment, my ego/career, etc., seem to melt away like some kind of luxurious chocolate once the performance becomes clear to the executive staff. I was able to convince the board with an unsolicited, 5 minute demo I didn't even personally deliver. I've never seen these people sign contracts so quickly.

dnnddidiej

Also LinkedIn wars of people trying to claim throne as most AI-pilled, throwing down strawmen stories of luddites yelling at data centres who'll lose their job to a single person doing 100x work.

inglor_cz

"there’s zero chance any AI lab would train a model for such a ridiculous task"

Hmm, given how small the nerd community is and how often I met that task either on Hacker News, or on various Substacks, I am not so sure that the AI labs would ignore it completely.

bunzee

Spot on. Building our tool, we found AI is magic at scraping competitor data, but terrible at market validation. The 'why' is strictly human.

show comments

ex-aws-dude

Is the RLVR the key breakthrough for the uplift or is there more to it?

Does that suggest the uplift was only for things that are easily verifiable like code?

show comments

bradley13

As someone who uses AI daily (not in agent mode, just user-interactive), I have definitely noticed major quality improvements over the past few months. And that's surprising, because when you use something daily, you tend to overlook the big jumps.

I haven't looked into any sort of "agent" mode, just because I don't yet quite trust the AI not to do something dumb. Also, I don't use M365, where Copilot is integrated, so I suppose I would have to set it up myself.

DeathArrow

Apart from GLM 5.1 and Qwen 3.6, there are other Chinese models that are noteworthy: Kimi K2.6, Xiaomi MiMo V2.5 Pro, Deepseek v4 and MiniMax M2.7.

show comments

bluegatty

'Producing Images' or even 'Some Code that is Valid and Compiles' is in some ways one of the most misleading ways we assess quality of the AI.

It is getting very good at producing code that compiles - at the algorithmic level.

This is definitely noteworthy - and the AI is crossing a critical 'productivity threshold'.

But 'Drawing of a Proper Duck' is almost arbitrary because it may have nothing to do with the 'Specific Duck You Wanted'.

Everyone has tried to get AI to 'Draw The Thing They Want' and you notice immediately how it's almost impossible to 'adjust the image' along the vector you want - because ... and this is key:

-> the AI doesn't really understand what a Duck is, it's components, or fully how it made the duck <-

It just knows how to 'incant' the duck.

This becomes very clear when you try to get the AI to write proper documentation - it fails so miserably, even with direct guidance.

This is really strong evidence of how poorly the AI is generalizing, and that it is not 'understanding' rather it's 'synthesizing' from patterns.

We already kind of knew that - but we have not yet built an intuition for that until now.

Only now can we see 'how amazing the pattern synthesis' is - it's almost magic, and yet how it falls off a cliff otherwise

This has deep implications for the 'road ahead' and the kinds of things we're going to be able to do with AI.

In short: the AI is 'Wizard Level Code Helper, Researcher, and Worker' - but it very clearly lacks capabilities even one level of abstraction above the code itself.

LLMs were first trained by 'text' and now ... they are 'trained by our compilers'. Basically g++, javac, tsc are the 'Verifiable Human Rewards' in the post-training and reinforcement learning - and the AI is getting extremely good at producing 'code that compiles', but that's definitely an indirection from 'code that does what we want'.

It's astonishing that it took us all this time to internalize and start to discover what I think will be in hindsight a very obvious 'threshold' of it's capabilities.

We are constantly 'amazed' at the work that it can do, and therefore over-project it's capabilities.

I have no doubt that even with these limitations - the AI will unlock a lot more as it gets better - and - that it will 'creep up' the layers of abstraction of it's understanding.

But I strongly believe that the AI is going to get much 'wider' (pattern matching dominance) before it gets 'higher' (intrinsic understanding) - and - that this may be a fundamental limitation.

This may be 'the Le Cunn' insight - when he talks about the limitations of LLMs in detail - I believe this is that insight writ large.

Even the term AI - or certainly 'AGI' may be a misleading metaphor - were we to have always called it 'Stochastic Algorithms' or something along those lines, it's possible that our intuition would be framed a bit better.

The most interesting thing is how it is definitely amazing, world changing, novel and powerful and some ways - and obviously useless in others at the same time. That's the 'threshold' we need to better understand.

show comments

gib444

Starting from zero today, how would someone quickly get upto speed with the latest and greatest AI tooling on an extremely limited budget?

Is the only choice to pay for the "max" plans?

Or just read so much about it that you bs your way through an interview and then use the company's resources?

Simon, I'm curious too how much you invest each month researching all the latest and great AI tech?

show comments

tayo42

The claw thing really came and went fast lol

show comments

DeathArrow

I think that there's a lot to be improved in harnesses and the way the models are interacting with harnesses. For example, the harness should be able to steer the model when thinking.

aizk

I'm so glad Simon is documenting this. The field is evolving so fast, so rapidly, so hungry for data and money, that few are willing to zoom out and document everything big picture so we can see the changes over time. I mean do you guys remember "Do anything now"? Just a distant memory, a funny party trick.

iekekke

It’s good to see dates being hard coded re. Improvements in the models that should deliver material gains.

As time progresses one now has a yard stick to measure against progress. No more excuses - show me the money baby.

bb88

I met Simon for the first time this year at pycon. Wow, what a great guy.

jrowen

There's something fitting about the mystical nature of LLMs and scrolling through a bunch of goofy pelicans on bicycles representing report cards for the bleeding edge of technology.

How are these even graded? Qwen3.6-35B-A3B gets high marks for a pelican with a gaping hole in its bill?

edit: Just noticed its feet are disconnected from its legs as well (but right on the pedals!). Pardon my French but that's Chinese af.

zkmon

What real world problem is closely linked to the skill of drawing a pelican riding a bicycle?