Testing distributed systems with AI agents

73 points10 comments9 hours ago
perkovsky

I like the “claim-driven” framing.

For stateful systems, tests named after setup details often get weakened over time. Tests named after the claim they are trying to falsify are harder to water down.

The part I’d be most interested in is how well this works for business invariants like idempotent posting, no lost acknowledgements and recovery after partial failure.

show comments
jumploops

Indirectly related, but has anyone else found repeatable success with pure markdown skills?

I’ve built a similar workflow (but for system design/execution) and it works surprisingly well with the frontier models.

The skill includes scripts to ensure the work was actually done/followed, but I’ve been testing it without the scripts and it does a decent job.

Yesterday in GPT-5.5 xhigh[0] however I noticed some hallucinations, where the model stated it had created files, when in fact it hadn’t.

A small hiccup like this is usually fine, as the model realizes the files don’t exist sometime later, but in this particular instance, it claimed the files were created and then just continued on.

tl;dr - I fell into the trap of trusting markdown-only workflows, just to be bitten by the models hallucinating steps.

[0]xhigh is on, but in this particular turn there was no reasoning presented, so it may have been a degradation of the LLM/harness.

aphyr

Welp. Glad to see Li Shen's using the last fifteen years of my work to automate away my job. :-/

-- edit --

I've seen clients and some colleagues working on things like this, and I can't seem to put into words how disheartening it is. With the exception of some private analysis work, I've shared everything I've built, with everyone, for free. Papers like Elle took years to think through, implement, test, and write. That's free. High-quality checkers, Knossos, Jepsen itself, and the analyses I've put my life into: all public, all free. I put a lot of time into docs and support; essentially all unpaid. I teach classes and give conference talks to make these techniques broadly accessible because I want other engineers to be able to make high-quality systems.

At the same time, I've got a giant pile of debt from an old house that just won't quit throwing curveballs at me, and it's gonna be a few more decades before I can retire. The fact that my clients are willing to pay for this work is why I can invest so much time in R&D and give it all away. When I see someone roll in and just tell an LLM "Go use Jepsen and Elle and figure this out", it's like... well fuck. Is this even possible any more?

Thankfully, LLMs are still really bad at my job, but I don't know if, or how long, that will last. They also don't need to be good to be useful.

And if these LLM tools work, it's good, right? They find bugs, systems get safer. I want systems to be safer. On the other hand, I'm motivated to share what I do because I really want to help people. If it's just LLMs... it feels hollow. I think about this every time I've tried to work on open-source in the last few months. When I spend hours trying to figure out how to keep naming consistent, how to preserve compatibility over a decade, how to make complex code approachable through quality documentation... I have a person in mind. Someone I'll never meet, but they'll see that work, and their life will be a little easier, and maybe they'll smile. I've been talking with my therapist about it: how the work I used to do thinking about other human beings now feels purposeless. How the effort I put into making these tools and ideas accessible will inevitably cannibalize my own employment, because someone, somewhere, is going to tell an LLM "Hey, go do that", and I work in a very, very small niche. It feels like incipient depression.

Recently I've been thinking about taking Jepsen and its supporting libraries closed-source, and changing the way I write reports--instead of teaching people how to test and what to look for, just telling people the results. I don't want to do this. It's bad for everyone, but maybe it buys me a few years of runway. Enough to pay down some of the debt and figure out what I can do next with this body.

Fuck.

show comments