AI is pretty decent at writing TLA+ specs, but the specs are often bloated, and AI misses the simpler system. If you try to write TLA+ with LLMs and they produce a 500 LoC monstrosity with very abbreviated names, consider that this is not at all how TLA+ is supposed to feel. Leslie Lamport, the creator of TLA+, also recommends using a lot of comments in specs (the language is meant to be machine-checkable but is also notionally terse).
If you want to 'prompt your way' with TLA+, this is a decent one for ChatGPT. But don't expect to do wonders and find a better abstraction than you and if you run it 'agentically' with TLC in the loop, it will bloat the spec in no time:
Write the smallest TLA+ model that captures only the safety-relevant state of [the thing you're modeling].
Do not model implementation data structures unless they affect the property.
Use clear names, not abbreviations.
Start with: constants, variables, TypeOK, Init, and 2–4 invariants.
Use copious amounts of comments at the module level (use multi-line comments) and for each key variable, action, invariant, and temporal property (use one or more single-line comments).
Prefer fewer variables and coarser atomic steps unless finer interleavings are essential.
Keep the spec under 100 LoC (excluding comments) unless there is a specific reason not to.
After writing it, list what was intentionally abstracted away.
"Running leading LLMs across the eleven systems shows that LLMs are great at producing correct TLA+ syntax but struggle to ensure conformance and appropriate invariants."
show comments
AYBABTME
TLA+, P, Lean... formal methods and previously esoteric testing methods (property based, mutation... testing) should become the default. I think it's the only way we can really reap the benefits of agentic coding.
I wrote about this a bit on my blog[1], different angle but along the same line. You explain TLA+ and model checking well which makes the case concrete.
I'm curious of you have thoughts on these other methods and tools like P, Lean, Dafny, etc?
I found Quint to be a good pairing with LLM. Easier syntax to learn so you are actually collaborating instead of relying on hope.
show comments
leoqa
Unfortunately the benefit of TLA+ is the act of modeling your system painstakingly. The actual checker helps confirm your hypothesis, etc. But skipping the modeling and outsourcing it is not ideal. I’ve always struggled reasoning about models my team mates wrote, and will often have to mentally go through the process of arriving at the same abstractions/invariants etc before I can understand it.
show comments
some_random
So, the syntax is so bad that you have to use an LLM to generate the spec, but also you still need to understand the spec the LLM generates which is now harder because you weren't even the one to write it. It sounds like the syntax just sucks and there's no way to get around it.
show comments
aerodexis
I did this myself a few weeks ago and the technique that helped me the most was to compare the TLA output against the race-conditions I could construct by hand. I worked iteratively and unit-tested the model by constructing a model that didn't have certain race protection mechanisms and validating that the model generated the expected race.
You can also work backwards from the races it generates and ensure that they're real races.
show comments
sigbottle
Sorry, I thought this would've been a method for finding invariants, not one just for expressing them? I guess I should think about TLA+ as ultimately some kind of solver - give it a configuration, it tells me if it's defined well or not, the point is to make sure I'm not making mistakes, but not necessarily automated innovation?
vbernat
Isn't this line incorrect?
BB == b > 1 /\ b' = b - 2 /\ w' = w + 1 \* Picked 2 black
It should b > 2, otherwise you'll get in an invalid state.
show comments
UltraSane
TLA+ should be generated by hand as a extremely detailed spec for LLMs to use to generate code
AI is pretty decent at writing TLA+ specs, but the specs are often bloated, and AI misses the simpler system. If you try to write TLA+ with LLMs and they produce a 500 LoC monstrosity with very abbreviated names, consider that this is not at all how TLA+ is supposed to feel. Leslie Lamport, the creator of TLA+, also recommends using a lot of comments in specs (the language is meant to be machine-checkable but is also notionally terse).
If you want to 'prompt your way' with TLA+, this is a decent one for ChatGPT. But don't expect to do wonders and find a better abstraction than you and if you run it 'agentically' with TLC in the loop, it will bloat the spec in no time:
Write the smallest TLA+ model that captures only the safety-relevant state of [the thing you're modeling]. Do not model implementation data structures unless they affect the property. Use clear names, not abbreviations. Start with: constants, variables, TypeOK, Init, and 2–4 invariants. Use copious amounts of comments at the module level (use multi-line comments) and for each key variable, action, invariant, and temporal property (use one or more single-line comments). Prefer fewer variables and coarser atomic steps unless finer interleavings are essential. Keep the spec under 100 LoC (excluding comments) unless there is a specific reason not to. After writing it, list what was intentionally abstracted away.
https://www.sigops.org/2026/can-llms-model-real-world-system...
"Running leading LLMs across the eleven systems shows that LLMs are great at producing correct TLA+ syntax but struggle to ensure conformance and appropriate invariants."
TLA+, P, Lean... formal methods and previously esoteric testing methods (property based, mutation... testing) should become the default. I think it's the only way we can really reap the benefits of agentic coding.
I wrote about this a bit on my blog[1], different angle but along the same line. You explain TLA+ and model checking well which makes the case concrete.
I'm curious of you have thoughts on these other methods and tools like P, Lean, Dafny, etc?
[1]: https://aybabt.me/blog/correctness-agentic-world
I found Quint to be a good pairing with LLM. Easier syntax to learn so you are actually collaborating instead of relying on hope.
Unfortunately the benefit of TLA+ is the act of modeling your system painstakingly. The actual checker helps confirm your hypothesis, etc. But skipping the modeling and outsourcing it is not ideal. I’ve always struggled reasoning about models my team mates wrote, and will often have to mentally go through the process of arriving at the same abstractions/invariants etc before I can understand it.
So, the syntax is so bad that you have to use an LLM to generate the spec, but also you still need to understand the spec the LLM generates which is now harder because you weren't even the one to write it. It sounds like the syntax just sucks and there's no way to get around it.
I did this myself a few weeks ago and the technique that helped me the most was to compare the TLA output against the race-conditions I could construct by hand. I worked iteratively and unit-tested the model by constructing a model that didn't have certain race protection mechanisms and validating that the model generated the expected race.
You can also work backwards from the races it generates and ensure that they're real races.
Sorry, I thought this would've been a method for finding invariants, not one just for expressing them? I guess I should think about TLA+ as ultimately some kind of solver - give it a configuration, it tells me if it's defined well or not, the point is to make sure I'm not making mistakes, but not necessarily automated innovation?
Isn't this line incorrect?
It should b > 2, otherwise you'll get in an invalid state.TLA+ should be generated by hand as a extremely detailed spec for LLMs to use to generate code