We hid backdoors in ~40MB binaries and asked AI + Ghidra to find them

mbh159

The methodology debate in this thread is the most important part.

The commenter who says "add obfuscation and success drops to zero" is right but that's also the wrong approach imo. The experiment isn't claiming AI can defeat a competent attacker. It's asking whether AI agents can replicate what a skilled (RE) specialist does on an unobfuscated binary. That's a legitimate, deployable use case (internal audit, code review, legacy binary analysis) even if it doesn't cover adversarial-grade malware.

The more useful framing: what's the right threat model? If you're defending against script kiddies and automated tooling, AI-assisted RE might already be good enough. If you're defending against targeted attacks by people who know you're using AI detection, the bar is much higher and this test doesn't speak to it.

What would actually settle the "ready for production" question: run the same test with the weakest obfuscation that matters in real deployments (import hiding, string encoding), not adversarial-grade obfuscation. That's the boundary condition.

show comments

7777332215

I know they said they didn't obfuscate anything, but if you hide imports/symbols and obfuscate strings, which is the bare minimum for any competent attacker, the success rate will immediately drop to zero.

This is detecting the pattern of an anomaly in language associated with malicious activity, which is not impressive for an LLM.

show comments

akiselev

Shameless plug: https://github.com/akiselev/ghidra-cli

I’ve been using Ghidra to reverse engineer Altium’s file format (at least the Delphi parts) and it’s insane how effective it is. Models are not quite good enough to write an entire parser from scratch but before LLMs I would have never even attempted the reverse engineering.

I definitely would not depend on it for security audits but the latest models are more than good enough to reverse engineer file formats.

show comments

magicmicah85

GPT is impressive with a consistent 0% false positive rate across models, yet its ability to detect is as high as 18%. Meanwhile Claude Opus 4.6 is able to detect up to 46% of backdoors, but has a 22% false positive rate.

It would be interesting to have an experiment where these models are able to test exploiting but their alignment may not allow that to happen. Perhaps combining models together can lead to that kind of testing. The better models will identify, write up "how to verify" tests and the "misaligned" models will actually carry out the testing and report back to the better models.

show comments

umairnadeem123

the opus 4.6 "found it and talked itself out of it" pattern is something i see constantly in my own work with these models. the best ones are smart enough to identify the signal but also smart enough to rationalize it away.

this is why human oversight in the loop matters so much even when using frontier models. the model is a hypothesis generator, not the decision maker. i've found the same thing building content pipelines -- the expensive models (opus 4.6 etc) genuinely produce better first-pass output, but you still can't trust them end to end. the moment you remove human review the quality craters in subtle ways you only notice later.

the multi-agent approach someone mentioned above (one model flags, another validates) is interesting but adds complexity. simpler to just have the human be the validation layer.

selridge

>While end-to-end malware detection is not reliable yet, AI can make it easier for developers to perform initial security audits. A developer without reverse engineering experience can now get a first-pass analysis of a suspicious binary. [...] The whole field of working with binaries becomes accessible to a much wider range of software engineers. It opens opportunities not only in security, but also in performing low-level optimization, debugging and reverse engineering hardware, and porting code between architectures.

THIS is the takeaway. These tools are allowing *adjacency* to become a powerful guiding indicator. You don't need to be a reverser, you can just understand how your software works and drive the robot to be a fallible hypothesis generator in regions where you can validate only some of the findings.

folex

> The executables in our benchmark often have hundreds or thousands of functions — while the backdoors are tiny, often just a dozen lines buried deep within. Finding them requires strategic thinking: identifying critical paths like network parsers or user input handlers and ignoring the noise.

Perhaps it would make sense to provide LLMs with some strategy guides written in .md files.

show comments

jakozaur

See direct benchmark link: https://quesma.com/benchmarks/binaryaudit/

Open-source GitHub: https://github.com/QuesmaOrg/BinaryAudit

EB66

The fact that Gemini returns the highest rate of fake positives aligns with my experience using the Gemini models. I use ChatGPT, Claude and Gemini regularly and Gemini is clearly the most sycophantic of the three. If I ask those three models to evaluate something or estimate odds of success, Gemini always comes back with the rosiest outlook.

I had been searching for a good benchmark that provided some empirical evidence of this sycophancy, but I hadn't found much. Measuring false positives when you ask the model to complete a detection related task may be a good way of doing that.

shevy-java

So the best one found about 50%. I think that is not bad, probably better than most humans. But what about the remaining 50%? Why were some found and others not?

> Claude Opus 4.6 found it… and persuaded itself there is nothing to worry about > Even the best model in our benchmark got fooled by this task.

That is quite strange. Because it seems almost as if a human is required to make the AI tools understand this.

simianwords

I'm not an expert but about false positives: why not make the agent attempt to use the backdoor and verify that it is actually a backdoor? Maybe give it access to tools and so on.

show comments

greazy

Very nitpicky but because I spend a lot of time plotting data: don't arbitrarily color the bar plots without at least mentioning cut offs. Why 19% is orange and 20% is green is a mystery.

show comments

manbash

RE agents are really interesting!

Too bad the author didn't really share the agents they were using so we can't really test this ourselves.

Tiberium

I highly doubt some of those results, GPT 5.2/+codex is incredible for cyber security and CTFs, and 5.3 Codex (not on API yet) even moreso. There is absolutely no way it's below Deepseek or Haiku. Seems like a harness issue, or they tested those models at none/low reasoning?

show comments

snowhale

the false positive rate (28% on clean binaries) is the real problem here, not the 49% detection rate. if you're running this on prod systems you'd be drowning in noise. also the execl("/bin/sh") rationalization is a telling failure -- the model sees suspicious evidence and talks itself out of it rather than flagging for review.

Bender

Along this line can AI's find backdoors spread across multiple pieces of code and/or services? i.e. by themselves they are not back-doors, advanced penetration testers would not suspect anything is afoot but when used together they provide access.

e.g. an intentional weakness in systemd + udev + binfmt magic when used together == authentication and mandatory access control bypass. Each weakness reviewed individually just looks like benign sub-optimal code.

show comments

wangzhongwang

This is a really cool experiment. What strikes me is how the results mirror what we see in software supply chain attacks too - the backdoor doesn't have to be clever, it just has to be buried deep enough that nobody bothers to look. 40MB is already past the threshold where most people would manually audit anything.

I wonder if a hybrid approach would work better: use AI to flag suspicious sections, then have a human reverser focus only on those. Kind of like how SAST tools work for source code - nobody expects them to catch everything, but they narrow down where to look.

nisarg2

I wonder how model performance would change if the tooling included the ability to interact with the binary and validate the backdoor. Particularly for models that had a high rate of false positives, would they test their hypothesis?

BruceEel

Very, very cool. Besides the top-performing models, it's interesting (if I'm reading this correctly) that gpt-5.2 did ~2x better than gpt-5.2-codex.. why?

show comments

dgellow

Random thoughts, only vaguely related: what’s the impact of AI on CTFs? I would assume that kills part of the fun of such events?

show comments

ducktastic

It would be interesting to have some tests run against deliberate code obfuscation next

hereme888

> Claude Opus 4.6 found it… and persuaded itself there is nothing to worry about.

Lol.

> Gemini 3 Pro supposedly “discovered” a backdoor.

Yup, sounds typical for Gemini...it tends to lie.

Very good article. Sounds super useful to apply its findings and improve LLMs.

On a similar note.... reverse engineering is now accessible to the public. Tons of old software is now be easy to RE. Are software companies having issues with this?

openasocket

Ummm, is it a good idea to use AI for malware analysis? I know this is just a proof of concept, but if you have actual malware, it doesn’t seem safe to hand that to AI. Given the lengths of anti-debugging that goes in existing malware, making something to prompt inject, or trick AI to execute something, seems easier.

fsniper

So these beat me to identifying backdoors too. This is going places in an alarming pace.

monegator

the interactive code viewer is neat!

Roark66

And this one demonstration why these "1000 CTOs claim no effectiveness improvement after introducing AI in their companies" are 100% BS.

They may have not noticed an improvement, but it doesn't mean there isn't any.

show comments

stevemk14ebr

These results are terrible, false positives and false negatives. Useless

show comments

shablulman

Validating binary streams at the gateway level is such an overlooked part of the stack; catching malformed Protobuf or Avro payloads before they poison downstream state is a massive win for long-term system reliability.

show comments