I've been saying for a while that given a proper harness, small local models can perform incredibly well. When you have a system that can try everything, it will eventually get it right as long as you can prevent it from getting it wrong in the meantime.
show comments
lwansbrough
Had a couple thoughts in this realm, and am working them into my own harness. Curious to see what others think. I'm not sure if this is generalizable, as my harness is fairly specialized:
- Breaking down a problem into a planned execution, with executing agent providing the initial plan which includes explicit objectives such as which tools it calls and what it would consider to be a successful execution.
- The harness then executes the plan in order
- Each step that involves a tool call will be executed by breaking down the tool call into component parts: the harness interrogates the agent for a valid parameter value for the current tool argument. The tool definition contains validators for each argument. If the validator fails, the harness rewinds the conversation and injects the failure reason into the next try.
- Once the agent produces a valid response for the argument, the harness proceeds to the next argument.
- Once all the arguments have been filled, the harness calls the tool. It passes the agent's initial expected value along with the actual value, along with any errors that may have been produced and asks the agent if it is satisfied with the result. If it isn't, the agent provides a reason and the harness then retries the tool call process from the beginning rewinding the conversation and inserting the reasoning for the retry.
- The agent may request to re-plan if it discovers a flaw in its initial plan. The harness will also attempt to re-plan if the agent produces too many failures in a row.
This proves to be quite effective at reducing tool call failures. One benefit is that the sub-agent gets a perfect conversation history where it makes no mistakes. I'm not sure if it's actually better at completing tasks though, I haven't tried to benchmark it.
show comments
jonnyasmar
The tool-call ambiguity point — yeah, I hit that at frontier scale too. Running Claude Code, Codex, and Gemini CLI in parallel for daily dev, the most common failure mode I see is grep/find returning exit code 1 (no matches): the model reads it as "the tool failed" instead of "search ran, here's the negative space," then either bails or retries with slightly different syntax instead of broadening the search.
The retry-nudge layer maps almost 1:1 to what I do manually multiple times an hour: "no, that wasn't a tool failure, the file just doesn't contain that pattern, try X." Encoding it at the framework level is the right shape.
Have you looked at whether these guardrails close the smaller frontier-model gap on long-horizon tasks? My intuition is the 87→99 delta on Sonnet won't quite hold past ~50 steps, where context drift starts dominating more than retry semantics.
show comments
jf
Tangentially related: Since you are at Texas Instruments, I wonder if you could find out what the status is of the intellectual property for the TI Explorer lisp machines. I know who owns the IP for Genera, but wasn’t able to find out about TI’s lisp OS
show comments
6r17
Very cool work ! I'm running harness system myself and could measure improvement of token use of 2x to 10x on gsm8k only by running a math harness - i'm confident the future is bright for people who will know how to sell tech that is appropriately scaled to one's need. We absolutely do not need to run Claude 123 for most tasks and we better prepare for the rag-pull !
seemaze
> One thing I really didn't expect: the serving backend matters. Same Mistral-Nemo 12B weights produce 7% accuracy on llama-server with native function calling and 83% on Llamafile in prompt mode.
I thought Llamafile was just a model and llama.cpp bundled in to a single binary - is this the difference between Llamafile injecting a default sysmtem prompt vs hitting the raw llama-server endpoint with no harness?
That seems like comparing apples to apple pie, there's some ingredients missing.
show comments
88j88
Something very similar I was experimenting with on, but had different results that you may be interested in, some of my findings were interesting
This was part of testing out how well a tool of mine worked (github.com/jsuppe/loom), which aims to be used to extracts requirements, specs, creates tests. At first I had no intention of using it for code generation but then tried it out with some early success. I tried splitting the work by using the tool with different frontier models, and then providing work to a local ollama instance running one of several models. Not all local models had the same outcome, not all coding languages had the same outcome. I also found in this experiment, when nailing down the coding tasks I wanted to set up positive and negative scenarios- which is where I found setting guardrails can sometimes backfire with inversion- this essentially elaborates on previous work by Khan 2025 (https://arxiv.org/abs/2510.22251); the most interesting finding to me was that if you give guardrails with a rationale, it reduces compliance and may cause the inversion
For coding tasks I found that the improvement was not only ability to use a lower cost model for these broken down tasks, but wall clock time was improved over using frontier model alone, with equivalent outcomes.
show comments
azurewraith
Interestingly enough we have found the same net result -- structural guardrails are the unlock for smaller models. Our approach in particular layers three things: a parse rescue for malformed/incorrect tool calls (similar to your retry nudges), content-level intervention (diff size rejection, checkpoint forcing) and state machine enforcement on top (per-phase tool restriction, transition guards). On 13B models we saw completion of a selection of SWE-bench tasks went from ~20% to 100%. With frontier models we saw a reduction in API calls from reduced thrashing.
One of the most surprising findings was when a 9B model self-corrected through 4 tool parse failures within the guard rails. It tried to use a complex tool (patch_file), kept failing and eventually downshifted to a simpler tool (edit_line) that it could actually execute. The guardrails didn't make the model smarter, it just narrowed the execution space until it could find something that worked.
Impressive work, love seeing tools that boost local LLM reliability without touching the model itself
show comments
_pdp_
Maybe I am reading it wrong but I don't think this does what it claim it does or at least how it sounds.
Basically this is a tool auto-complete that has a workflow element to it with certain steps that need to happen in certain order. In other words the order is defined in advance. Am I correct?
Basically execute step 1 first, then step 2 and finally step 3 and this is the schema for each step. That is effectively the guardrail and there is retry logic.
If it is the case, this is obviously useful but in a very specific set of problems where the solution is kind of known in advance. A workflow automation might work but this is kind of N8N where each step is LLM step.
Anyway, I might me wrong but I wanted to share a few thoughts.
show comments
tempoponet
Why this entire tool chain instead of building within something like pi code?
I've been exploring this area and a project like https://github.com/itayinbarr/little-coder (not my work) lets me mix and match with my current setup or any plugins built for pi.
show comments
bglusman
Funny timing. I’ve been building something adjacent, though from a different angle: not primarily local-model reliability, but a control layer around agent execution, tools, routing, and operator intent. I was calling these "synthetic models", but decided yesterday "LLM middleware" is a clearer description.
The common thread I see is treating the harness around the model as first-class infrastructure. Forge seems focused on tool-call correctness and recovery; Wardwright is more about controlling what the agent is supposed to do, where work gets routed, and how the operator stays in the loop.
Curious whether you see those as complementary layers. I’m planning to try Forge and would be interested in seeing whether they fit together cleanly.
show comments
tommica
What are "guardrails" in this context? Is it correctly understood that this would sit between my pi agent and llama-server, and it would do what exactly?
show comments
roger_
Would putting this between a small model and an agent like Hermes improve performance?
show comments
k__
So, this basically ensures that models call the right tools with the correct format?
show comments
nzeid
> # External mode — you manage llama-server, forge proxies it
This is a good example because I've currently stuck with llama.cpp's UI. I can read your code (or throw Gemma at it =p ) but thought I'd ask anyway.
In this example, what is it exactly that your proxy is fortifying? The HTTP SSE requests? (Those would be `/chat/completions`.)
show comments
jamesponddotco
This seems pretty awesome; being able to use an 8B model for tool calling would be perfect.
Interested in using this for Home Assistant using a Mac Mini as my server. Does it run on MacOS?
How is the latency when using the proxy? I’m using Claude Haiku 4.5 for my voice assistant right now and it’s pretty fast, but if I could keep the LLM local, it’d be even better.
show comments
MWil
have you considered implementing the addition of a leading canary sentinel that fires at the earliest/cheapest possible point instead of only on lag of some actual load-bearing constraint violation?
show comments
ElenaDaibunny
guardrails this well-designed matter way more than just throwing bigger models at agent tasks tbh
show comments
lucrbvi
How does this differ from dottxt's Outlines[0] on the technical level? Are you using some JSON grammar to force the LM head distribution to follow it?
Curious if this would help larger local models? Qwen 3.6 varieties of deepseek4?
show comments
mholubowski
Hey I'm really impressed and hoping to connect. I followed you on X just now, is that a decent place to shoot you a DM? I don't want anything from you, we just seem to be working on similar things (I'm working on our internal agent harness here, at a healthcare startup).
show comments
zambelli
Happy to answer questions about the eval methodology, the backend findings, or anything in the repo. I'll be around.
show comments
dpweb
Hello. Interesting project! Haven't gone through it yet, but want to consider using this in my CS master's capstone. While you have benchmarks I may create my own specific scenarios and comparisons vis-a-vis hosted inference to highlight specific economic benefit. Any suggestions?
show comments
tim-projects
I've been working on the same thing and even nearly called it forge. Instead I called it hammer.
I'll be keen to look through the code on this!
show comments
pianopatrick
Do you think a similar approach would work with smaller models, like 1.5B models?
show comments
simonw
This is a neat project, but the description made me realize that I don't actually know what the term "guardrails" means.
... which lead me to realize that it's one of those terms with multiple meanings - like "agent" or even "AI" itself - but where people who use it may not be aware of how many different definitions are floating around.
In this project it refers to validating tool calls - fixing invalid tool responses, making sure certain required tool calls have been made, maintaining an error budget after which the task is abandoned with an error.
Other projects might use "guardrails" to mean protecting against unsafe content (Llama Gaurd), refusing off-topic queries (NVIDIA NeMo Guardrails "topical rails", filtering PII, detecting jailbreaks, or human-in-the-loop checks of specific actions.
I've even seen people talk about running a coding agent in a sandbox (Docker, Firecracker etc) as a form of guardrail.
show comments
xiaod
I'd be curious about the eval methodology. In production coding tasks, the gap between benchmark scores and actual workflow integration can be significant. What does the error recovery loop look like?
show comments
GrinningFool
That's a huge gap for llama.cpp server - any idea why?
show comments
rebekkamikkoa
Hi Antoine!
Interesting point about backend variance. Do you think serving layer should become part of standard LLM eval reporting?
show comments
jedisct1
Interesting!
The https://swival.dev harness already has retry nudges, step enforcement, error recovery, context awareness, etc. to try to support small models as much as possible.
Curious to see how it compares with forge, and if both could be combined.
show comments
choonway
no different from how the mcdonalds system can turn any random person on the street to a smiling cog in the machine.
snovv_crash
I get a strong LLM smell in your description. If you couldn't bother to write it, why should I bother to read it?
I've been saying for a while that given a proper harness, small local models can perform incredibly well. When you have a system that can try everything, it will eventually get it right as long as you can prevent it from getting it wrong in the meantime.
Had a couple thoughts in this realm, and am working them into my own harness. Curious to see what others think. I'm not sure if this is generalizable, as my harness is fairly specialized:
- Breaking down a problem into a planned execution, with executing agent providing the initial plan which includes explicit objectives such as which tools it calls and what it would consider to be a successful execution.
- The harness then executes the plan in order
- Each step that involves a tool call will be executed by breaking down the tool call into component parts: the harness interrogates the agent for a valid parameter value for the current tool argument. The tool definition contains validators for each argument. If the validator fails, the harness rewinds the conversation and injects the failure reason into the next try.
- Once the agent produces a valid response for the argument, the harness proceeds to the next argument.
- Once all the arguments have been filled, the harness calls the tool. It passes the agent's initial expected value along with the actual value, along with any errors that may have been produced and asks the agent if it is satisfied with the result. If it isn't, the agent provides a reason and the harness then retries the tool call process from the beginning rewinding the conversation and inserting the reasoning for the retry.
- The agent may request to re-plan if it discovers a flaw in its initial plan. The harness will also attempt to re-plan if the agent produces too many failures in a row.
This proves to be quite effective at reducing tool call failures. One benefit is that the sub-agent gets a perfect conversation history where it makes no mistakes. I'm not sure if it's actually better at completing tasks though, I haven't tried to benchmark it.
The tool-call ambiguity point — yeah, I hit that at frontier scale too. Running Claude Code, Codex, and Gemini CLI in parallel for daily dev, the most common failure mode I see is grep/find returning exit code 1 (no matches): the model reads it as "the tool failed" instead of "search ran, here's the negative space," then either bails or retries with slightly different syntax instead of broadening the search.
The retry-nudge layer maps almost 1:1 to what I do manually multiple times an hour: "no, that wasn't a tool failure, the file just doesn't contain that pattern, try X." Encoding it at the framework level is the right shape.
Have you looked at whether these guardrails close the smaller frontier-model gap on long-horizon tasks? My intuition is the 87→99 delta on Sonnet won't quite hold past ~50 steps, where context drift starts dominating more than retry semantics.
Tangentially related: Since you are at Texas Instruments, I wonder if you could find out what the status is of the intellectual property for the TI Explorer lisp machines. I know who owns the IP for Genera, but wasn’t able to find out about TI’s lisp OS
Very cool work ! I'm running harness system myself and could measure improvement of token use of 2x to 10x on gsm8k only by running a math harness - i'm confident the future is bright for people who will know how to sell tech that is appropriately scaled to one's need. We absolutely do not need to run Claude 123 for most tasks and we better prepare for the rag-pull !
> One thing I really didn't expect: the serving backend matters. Same Mistral-Nemo 12B weights produce 7% accuracy on llama-server with native function calling and 83% on Llamafile in prompt mode.
I thought Llamafile was just a model and llama.cpp bundled in to a single binary - is this the difference between Llamafile injecting a default sysmtem prompt vs hitting the raw llama-server endpoint with no harness?
That seems like comparing apples to apple pie, there's some ingredients missing.
Something very similar I was experimenting with on, but had different results that you may be interested in, some of my findings were interesting
This was part of testing out how well a tool of mine worked (github.com/jsuppe/loom), which aims to be used to extracts requirements, specs, creates tests. At first I had no intention of using it for code generation but then tried it out with some early success. I tried splitting the work by using the tool with different frontier models, and then providing work to a local ollama instance running one of several models. Not all local models had the same outcome, not all coding languages had the same outcome. I also found in this experiment, when nailing down the coding tasks I wanted to set up positive and negative scenarios- which is where I found setting guardrails can sometimes backfire with inversion- this essentially elaborates on previous work by Khan 2025 (https://arxiv.org/abs/2510.22251); the most interesting finding to me was that if you give guardrails with a rationale, it reduces compliance and may cause the inversion
For coding tasks I found that the improvement was not only ability to use a lower cost model for these broken down tasks, but wall clock time was improved over using frontier model alone, with equivalent outcomes.
Interestingly enough we have found the same net result -- structural guardrails are the unlock for smaller models. Our approach in particular layers three things: a parse rescue for malformed/incorrect tool calls (similar to your retry nudges), content-level intervention (diff size rejection, checkpoint forcing) and state machine enforcement on top (per-phase tool restriction, transition guards). On 13B models we saw completion of a selection of SWE-bench tasks went from ~20% to 100%. With frontier models we saw a reduction in API calls from reduced thrashing.
One of the most surprising findings was when a 9B model self-corrected through 4 tool parse failures within the guard rails. It tried to use a complex tool (patch_file), kept failing and eventually downshifted to a simpler tool (edit_line) that it could actually execute. The guardrails didn't make the model smarter, it just narrowed the execution space until it could find something that worked.
Brief: https://statewright.ai/research
Impressive work, love seeing tools that boost local LLM reliability without touching the model itself
Maybe I am reading it wrong but I don't think this does what it claim it does or at least how it sounds.
Basically this is a tool auto-complete that has a workflow element to it with certain steps that need to happen in certain order. In other words the order is defined in advance. Am I correct?
Basically execute step 1 first, then step 2 and finally step 3 and this is the schema for each step. That is effectively the guardrail and there is retry logic.
If it is the case, this is obviously useful but in a very specific set of problems where the solution is kind of known in advance. A workflow automation might work but this is kind of N8N where each step is LLM step.
Anyway, I might me wrong but I wanted to share a few thoughts.
Why this entire tool chain instead of building within something like pi code?
I've been exploring this area and a project like https://github.com/itayinbarr/little-coder (not my work) lets me mix and match with my current setup or any plugins built for pi.
Funny timing. I’ve been building something adjacent, though from a different angle: not primarily local-model reliability, but a control layer around agent execution, tools, routing, and operator intent. I was calling these "synthetic models", but decided yesterday "LLM middleware" is a clearer description.
Very early prototype, so I’m looking more for architectural/conceptual reactions than polish: https://wardwright.dev / https://github.com/bglusman/wardwright
The common thread I see is treating the harness around the model as first-class infrastructure. Forge seems focused on tool-call correctness and recovery; Wardwright is more about controlling what the agent is supposed to do, where work gets routed, and how the operator stays in the loop.
Curious whether you see those as complementary layers. I’m planning to try Forge and would be interested in seeing whether they fit together cleanly.
What are "guardrails" in this context? Is it correctly understood that this would sit between my pi agent and llama-server, and it would do what exactly?
Would putting this between a small model and an agent like Hermes improve performance?
So, this basically ensures that models call the right tools with the correct format?
> # External mode — you manage llama-server, forge proxies it
> python -m forge.proxy --backend-url http://localhost:8080 --port 8081
This is a good example because I've currently stuck with llama.cpp's UI. I can read your code (or throw Gemma at it =p ) but thought I'd ask anyway.
In this example, what is it exactly that your proxy is fortifying? The HTTP SSE requests? (Those would be `/chat/completions`.)
This seems pretty awesome; being able to use an 8B model for tool calling would be perfect.
Interested in using this for Home Assistant using a Mac Mini as my server. Does it run on MacOS?
How is the latency when using the proxy? I’m using Claude Haiku 4.5 for my voice assistant right now and it’s pretty fast, but if I could keep the LLM local, it’d be even better.
have you considered implementing the addition of a leading canary sentinel that fires at the earliest/cheapest possible point instead of only on lag of some actual load-bearing constraint violation?
guardrails this well-designed matter way more than just throwing bigger models at agent tasks tbh
How does this differ from dottxt's Outlines[0] on the technical level? Are you using some JSON grammar to force the LM head distribution to follow it?
[0]: https://github.com/dottxt-ai/outlines
The dashboard github link appears to be broken
Curious if this would help larger local models? Qwen 3.6 varieties of deepseek4?
Hey I'm really impressed and hoping to connect. I followed you on X just now, is that a decent place to shoot you a DM? I don't want anything from you, we just seem to be working on similar things (I'm working on our internal agent harness here, at a healthcare startup).
Happy to answer questions about the eval methodology, the backend findings, or anything in the repo. I'll be around.
Hello. Interesting project! Haven't gone through it yet, but want to consider using this in my CS master's capstone. While you have benchmarks I may create my own specific scenarios and comparisons vis-a-vis hosted inference to highlight specific economic benefit. Any suggestions?
I've been working on the same thing and even nearly called it forge. Instead I called it hammer.
I'll be keen to look through the code on this!
Do you think a similar approach would work with smaller models, like 1.5B models?
This is a neat project, but the description made me realize that I don't actually know what the term "guardrails" means.
... which lead me to realize that it's one of those terms with multiple meanings - like "agent" or even "AI" itself - but where people who use it may not be aware of how many different definitions are floating around.
In this project it refers to validating tool calls - fixing invalid tool responses, making sure certain required tool calls have been made, maintaining an error budget after which the task is abandoned with an error.
Other projects might use "guardrails" to mean protecting against unsafe content (Llama Gaurd), refusing off-topic queries (NVIDIA NeMo Guardrails "topical rails", filtering PII, detecting jailbreaks, or human-in-the-loop checks of specific actions.
I've even seen people talk about running a coding agent in a sandbox (Docker, Firecracker etc) as a form of guardrail.
I'd be curious about the eval methodology. In production coding tasks, the gap between benchmark scores and actual workflow integration can be significant. What does the error recovery loop look like?
That's a huge gap for llama.cpp server - any idea why?
Hi Antoine!
Interesting point about backend variance. Do you think serving layer should become part of standard LLM eval reporting?
Interesting!
The https://swival.dev harness already has retry nudges, step enforcement, error recovery, context awareness, etc. to try to support small models as much as possible.
Curious to see how it compares with forge, and if both could be combined.
no different from how the mcdonalds system can turn any random person on the street to a smiling cog in the machine.
I get a strong LLM smell in your description. If you couldn't bother to write it, why should I bother to read it?