Designing agentic loops

85 points55 comments7 hours ago
akashakya

For lightweight sandboxing on Linux you can use bubblewrap or firejail instead of Docker. They are faster and _simpler_. Here is a bwrap script I wrote to run Claude in a minimal sandbox an hour back:

    exec bwrap \
      --ro-bind /usr /usr \
      --ro-bind /etc /etc \
      --ro-bind /run /run \
      --ro-bind "$NODE_PATH" /node \
      --proc /proc \
      --dev /dev \
      --symlink usr/lib64 /lib64 \
      --tmpfs /tmp \
      --unshare-all \
      --share-net \
      --die-with-parent \
      --new-session \
      --bind "$HOME/claude" /claude \
      --bind "$HOME/.claude.json" /claude/.claude.json \
      --bind "$HOME/.claude" /claude/.claude \
      --setenv HOME /claude \
      --setenv PATH "/node:/claude/bin:/usr/bin" \
      --bind "$(pwd)" /work \
      --chdir /work \
      /claude/bin/claude "$@"
show comments
mike_hearn

I recently built my own coding agent, due to dissatisfaction with the ones that are out there (though the Claude Code UI is very nice). It works as suggested in the article. It starts a custom Docker container and asks the model, GPT-5 in this case, to send shell scripts down the wire which are then run in the container. The container is augmented with some extra CLI tools to make the agent's life easier.

My agent has a few other tricks up its sleeve and it's very new, so I'm still experimenting with lots of different ideas, but there are a few things I noticed.

One is that GPT-5 is extremely willing to speculate. This is partly because of how I prompt it, but it's willing to write scripts that try five or six things at once in a single script, including things like reading files that might not exist. This level of speculative execution speeds things up dramatically especially as GPT-5 is otherwise a very slow model that likes to think about things a lot.

Another is that you can give it very complex "missions" and it will drive things to completion using tactics that I've not seen from other agents. For example, if it needs to check something that's buried in a library dependency, it'll just clone the upstream repository into its home directory and explore that to find what it needs before going back to working on the user's project.

None of this triggers any user interaction due to running in the container. In fact, no user interaction is possible. You set it going and do something else until it finishes. The model is very much to queue up "missions" that can then run in parallel and you merge them together at the end. The agent also has a mode where it takes the mission, writes a spec, reviews the spec, updates the spec given the review, codes, reviews the code, etc.

Even though it's early days I've set this agent missions that it spent 20 minutes of continuous uninterrupted inferencing time on, and succeeded excellently. I think this UI paradigm is the way to go. You can't scale up AI assisted coding if you're constantly needing to interact with the agent. Getting the most out of models requires maximally exploiting parallelism, so sandboxing is a must.

show comments
BoppreH

> If anything goes wrong it’s a Microsoft Azure machine somewhere that’s burning CPU and the worst that can happen is code you checked out into the environment might be exfiltrated by an attacker, or bad code might be pushed to the attached GitHub repository.

Isn't that risking getting banned from Azure? The compromised agent might not accomplish anything useful, but its attempts might get (correctly!) flagged by the cloud provider.

show comments
ademup

I'm surprised to see so many people using containers when setting up a KVM is so easy, gives the most robust environment possible, and to my knowledge much has better isolation. A vanilla build of Linux plus your IDE of choice and you're off to the races.

show comments
refset

For anyone else curious about what a practical loop implementation might look like, Steve Yegge YOLO-bootstrapped his 'Efrit' project using a few lines of Elisp: https://github.com/steveyegge/efrit/blob/4feb67574a330cc789f...

And for more context on Efrit this is a fun watch: "When Steve Gives Claude Full Access To 50 Years of Emacs Capabilities" https://www.youtube.com/watch?v=ZJUyVVFOXOc

tptacek

This is great. I feel like most of the oxygen is going to go to the sandboxing question (fair enough). But I'm kind of obsessed with what agent loops for engineering tasks that aren't coding look like, and also the tweaks you need for agent loops that handle large amounts of anything (source code lines, raw metrics or oTel span data, whatever).

There was an interval where the notion of "context engineering" came into fashion, and we quickly dunked all over it (I don't blame anybody for that; "prompt engineering" seemed pretty cringe-y to me), but there's definitely something to the engineering problems of managing a fixed-size context window while iterating indefinitely through a complex problem, and there's all sorts of tricks for handling it.

tra3

On one hand, folks are complaining that even basic tasks are impossible with LLMs.

On the other hand we have cursed language [0] which was fully driven by AI and seems to be functional. Btw, how much did that cost?

I feel like I've been hugely successful tools like Claude Code, aider, open code. Especially when I can define custom tools. "You" have to be a part of the loop, in some capacity. To provide guidance and/or direction. I'm puzzled by the fact that people are surprised by this. When I'm working with other entities (people) who are not artificially intelligent, the majority of the time is spent clarifying requirements and aligning on goals. Why would it be different with LLMs?

0: https://ghuntley.com/cursed/

show comments
jujugoboom

Wrong takeaway, but claude code was released February of this year??? I swear people have been glazing it for way longer... my memory isn't that bad right?

show comments
mccoyb

I think this is a strictly worse name than "agentic harness", which is already a term used by open-source agentic IDEs (https://github.com/search?q=repo%3Aopenai%2Fcodex%20harness&... or https://github.com/openai/codex/discussions/1174)

Any reason why you want to rename it?

Edit: to say more about my opinions, "agentic loop" could mean a few things -- it could mean the thing you say, or it could mean calling multiple individual agents in a loop ... whereas "agentic harness" evokes a sort of interface between the LLM and the digital outside world which mediates how the LLM embodies itself in that world. That latter thing is exactly what you're describing, as far as I can tell.

show comments
CuriouslyC

One important issue with agentic loops is that agents are lazy, so you need some sort of retrigger mechanism. Claude code supports hooks, you can wire your agent stop hook to a local LLM, feed the context in and ask the model to prompt Claude to continue if needed. It works pretty well, Claude can override retriggers if it's REALLY sure it's done.

Regarding sandboxing, VMs are the way. Prompt injected agents WILL be able to escape containers 100%.

show comments
simonw

I updated this post to link to the Claude Code docs that suggest running YOLO mode using their Docker dev container: https://www.anthropic.com/engineering/claude-code-best-pract... - which locks down network access to just a small set of domains: https://github.com/anthropics/claude-code/blob/5062ed93fc67f...

saltyoldman

I wouldn't be surprised if agents start getting managed by a distributed agentic system - think about it. Right now you get codex/claude/etc... and it's system prompt and various other internally managed prompts are locked down to the version you downloaded. What if a distributed system ran experimental prompts and monitored the success rate (what code makes it into a commit) and provides feedback to the agent manager. That could help automatically fine tune it's own prompts.

show comments