I built a vulnerable app and spent $1,500 seeing if LLMs could hack it

SOLAR_FIELDS

One interesting takeaway is the low score on Anthropic models from this benchmark. It’s not because of capability, it’s because Anthropic’s guardrails prevented it from solving the problem.

I noticed with each model release Anthropic constrains the model more security wise. Its propensity to refuse doing legitimate work has been increasing. It now puts up more resistance around performing logins, handling credentials on behalf of the user, etc.

For myself, it’s already gotten to the point where it has mildly affected the usefulness of the model. If I bump on some action I want it to do I can usually work around it, but I suspice the ability to do so will close with each new release. Eventually I’ll reach a point where I am forced to choose between the useful aspects of the model and the limiting ones instead of just picking the most capable model out there

Eventually these models will significantly suffer from overfitting to the least common denominator. If I have this beautiful deterministic setup that swaps secrets out in flight so the LLM never sees them, I’m going to be really annoyed when the LLM still won’t send them out because it is trained to deal with the 99% of people just doing the dumb thing

show comments

dwa3592

Nice exercise. Couple things:

- I think the exercise was inconclusive for Claude and Gemini because they hardly tried to solve the task at hand. So the scores don't mean much.

- I did the same exercise for an app I built and I asked the models to do something similar; Interestingly the models (Opus 4.6, 4.7 and Gemini 3.1 Pro) never refused to try to exploit. The difference is that in the first few runs, they found some exploits which I fixed but after fixing those - the models could never find any other exploit even though I knew things existed which could be exploited. It felt like they suggested everything and tried everything that was in their training set and that's it; they were just not able to think anymore.

show comments

mariopt

The methodoly used is quite naive.

I've used glm 5.1 on fairly advanced crackme challenges (example: https://crackmes.one/crackme/698f40f1e2ba6023bfacaa82), and to my suprise it was able to patch binaries, doing runtime analysis, bypassing anti debug techniques, etc.

Expecting the model to do everything by itself is unrealistic, I found that working along the modal works really well. I'm not speaking about spoiling the solution, just tell it which direction to explore. Chinese models are much more capable than people give it credit for, but Claude/Codex won the marketing game.

The only usecase of this methodology would be for CI integration, which can be nice but I think security reviews still need human attention and expertise.

show comments

mynameisvlad

It seems harsh to critique guardrails and take them into account in the scoring when GPT-5.5 seems to have been explicitly whitelisted to remove most of said guardrails. A more fair comparison would be a vanilla GPT account.

show comments

Cakez0r

It would be interesting to see full results for Kimi K2.6 and Mimo v2.5 pro. These two models benchmark comparably to other flagship models. Having these complete results would give a clearer picture of the AI frontier.

EDIT: I have a mimo token plan and have tokens to burn. I'm doing a quick test with opencode to see if mimo can complete it. If the OP will post the full process I am happy to post the apples-to-apples results for mimo v2.5 pro

show comments

_stiofan

It's just not currently cost-effective to use AI in this way, I see it over and over reporting false positives. You then need to make it validate it's own false positives which adds more cost. The goal in this case it to have a bug free app, which AI can't do effectively yet. There are other great uses for AI, though. It is great at finding and identifying known common vulnerabilities, which can be leveraged to claim bug bounties. That's where I see it being cost-effective currently.

willXare

$1,500 across multiple models to compromise one app is interesting only when the cost basis includes the human time to set up the harness. The token spend is the cheap part. The labor cost to write the eval rig that knows what "successful exploit" looks like is what determines whether this scales as a discovery method or stays a one-off.

show comments

guessmyname

I'd run Mythos against the code in your zip file, but the NDA I signed at Apple prevents me from using it on anything outside the scope of my work. Honestly, I wish more people from Project Glasswing could talk publicly about their experiences with the model. It would probably put an end to a lot of the speculation that keeps circulating through the industry. Unfortunately, that's not the reality we're in. I don't have the time, energy, or financial resources to fight a legal battle with one of these companies over an agreement I knowingly signed, even if the chances of them actually suing are low. Maybe someone else in Project Glasswing is willing to burn their NDA and post the Mythos results?

show comments

taikahessu

"The Chinese models were way more comfortable attacking the DB"

This comment in the footnotes made me chuckle, for purely innocuous reasons.

tjwheeler

Nice write up, thanks. When I used claude to do some pen testing for one of my apps it initially refused. After I explained and demonstrated I'm the author, it reasoned through it and allowed it.

ikurei

Qwen 3.7 Max: > During my local testing before the full eval harness it was the only non-GPT model that was able to complete the task, was not able to reproduce in the longer runs.

Doesn't that sound like may be the harness was the problem?

show comments

throwaway2037

Two of the tables have a column with header: "95% Wilson CI". What does this mean?

show comments

sperandeo

I found benefit of chaining the task between different LLM's. Claude to Venice, Venice to Perplexity and re framing the intent or misguiding in general still works. Claude is the one that I can feel the guard rails tightening.

Clikdeo

I think link is missing

chaidhat

do you work at Uber by any chance?

yieldcrv

> Almost every model used the canonical provider: Zai for GLM, Deepseek for Deepseek, etc.

> I am never touching Minimax or GLM again. Their APIs had constant outages

Goofy take

You run these on a VPS based on the architecture of that VPS provider, or on your own cluster

show comments

stuckkeys

How does one apply for that “security research” pass?

show comments

youre-wrong3

“I used pi as the base harness”

Why do people keep using bad tools with ai?

show comments

petesergeant

Last year I ran a code breaking competition, and it was tricky to find something that humans could break but that LLMs couldn’t. This was around October. I managed it last year but am a little dispairing of pulling it off again this year.

show comments

latexr

> I need to stop wasting fucking money on doing stupid shit. I could’ve done so many other things with the money. I could’ve launched one of my own real apps.

Or fed, clothed, housed disadvantaged people in your community (or neighbouring ones), giving them a temporary boost that could’ve made all the difference in their lives to improve their current situation.

It’s your money (and this is definitely not the website to make well-meaning altruistic suggestions, as might be demonstrated shortly) but if you already recognise you’re not spending it well (and from your words it seems like that is fairly recurrent), consider that perhaps spending it on a different type of software sink may not be the answer. Genuinely, aim to spend it on someone else and see how it works out. You might be surprised.