Durable execution should be lightweight

peterkelly

Durable execution is best done at the level of a language implementation, not as a library.

A workflow engine I recently built provided an interpreter for a Scheme-based language that, for each blocking operation, took a snapshot of the interpreter state (heap + stack) and persisted that to a database. Each time an operation completes (which could be after hours/days/weeks), the interpreter state is restored from the database and execution proceeds from the point at which it was previously suspended. The interpreter supports concurrency, allowing multiple blocking operations to be in progress at the same time, so the work to be done after the completion of one can proceed even while others remain blocked.

The advantage of doing this at the language level is that persistence becomes transparent to the programmer. No decorators are needed; every function and expression inherently has all the properties of a "step" as described here. Deterministic execution can be provided if needed. And if there's a need to call out to external code, it is possible to expose Python functions as Scheme built-ins that can be invoked from the interpreter either synchronously or asynchronously.

I see a lot of workflow engines released that almost get to the point of being like a traditional programming language interpreter but not quite, exposing the structure of the workflow using a DAG with explicit nodes/edges, or (in the case of DBOS) as decorators. While I think this is ok for some applications, I really believe the "workflow as a programming language" perspective deserves more attention.

There's a lot of really interesting work that's been done over the years on persistent systems, and especially orthogonal persistence, but sadly this has mostly remained confined to the research literature. Two real-world systems that do implement persistence at the language level are Ethereum and Smalltalk; also some of the older Lisp-based systems provided similar functionality. I think there's a lot more value waiting to be mined from these past efforts.

show comments

shipp02

This seems like temporal only without as much server and complexity. Maybe they ignore it or it really is that simple.

Overall really cool! There are some scalability concerns that are brought that I think are valid but maybe you have a Postgres server backing up every few servers that need this kind of execution. Also, every function shouldn't be its own step but needs to be divided into larger chunks where every request only generates <10 steps.

show comments

from-nibly

This is the can of worms introduced by doing event processing. As soon as you break the ties between request and response a billion questions come up. Asking what happens to the reservation in a request response scenario is just chuck an error at the user and ask them to try again.

As soon as you accept the user input and tell the user all is well before you have processed it you enter into these kinds of problems.

I know not every single thing we do can be done without this kind of async processing, but we should treat these scenarios more seriously.

It's not that it can't ever be good, it's that it will always be complicated.

show comments

jkonowitch

Having used several external orchestrators I can see the appeal of the simplicity of this approach, especially for smaller teams wanting to limit the amount of infrastructure to maintain. Postgres is a proven tool, and as long as you design `@step`s to each perform one non deterministic side effect, I can see this scaling very well both in terms of performance and maintainability.

diptanu

I don’t think what you are describing as heavy is that big of a deal if an external orchestration system is required only for deployment, while the workflow can be developed and tested without a server on a laptop or notebook.

Bringing in orchestration logic in the app layer means there is more code being bundled with the app, which has its own set of tradeoffs - like bringing in a different set of code dependencies which might conflict with application code.

In 2025, I would be surprised if a good workflow engine didn’t have a completely server-less development mode :)

zokier

My immediate reaction is hell no.

> In some sense, external orchestration turns individual applications into distributed microservices, with all the complexity that implies.

I'd argue that durable execution intrisically is complex and external orchestrators give you tools to manage that complexity, whereas this attempts to brush the complexity under the rug in a way that does not inspire confidence.

show comments

shayansm1

> In some sense, external orchestration turns individual applications into distributed microservices, with all the complexity that implies.

While I'm not entirely convinced by the notion that distributed microservices inherently increase complexity, I do see significant benefits in how they empower workflows that span multiple projects and teams. For instance, in Temporal, different workers can operate in various programming languages, each managing its own specific set of activities within a single workflow. This approach enhances communication and collaboration between diverse projects, allowing them to leverage their unique tech stacks while still working together seamlessly.

rammy1234

My reaction is "No way, not again !!". I personally done this internal orchestration at scale at a large enterprise spanning millions of execution and it has scalability problem. we eventually externalized it to bring back sanity.

show comments

CGamesPlay

Love it! I built a toy library that looked very similar to this one a few months ago. How does this handle changing the workflow code? I quite like how Temporal handles it, where you use an "if has(my_feature)" to allow for in-progress workflows to be live-updated, even in the middle of loops. I also introduced an idea of "object handles", something like a file descriptor, which is an opaque handle to the workflow function but which can be given to a step function to be unwrapped, and it can can be persisted and restored via a consistent ID.

martinpeck

I think the example given in this blog post might need a "health warning" that steps should, generally, be doing more than just printing "hello".

I can imagine that the reads and writes to Postgres for a large number of workflows, each with a large number of small steps called in a tight loop, would cause some significant performance problems.

The examples given on their main site are a little more meaningful.

show comments

tenken

I have never used it, but a predasessor of mine talked about Clipper alot and I believe it allowed remote execution blocks tied to a storage backend, in this case I'm talking about xBase languages ...

I think also Rebol supports remote execution blocks ...

ethbr1

Hot take: this is bad architecture.

The solution seems to be solving for the simplest use case (internal stateless functions) rather than the most complex use case (external state-impactful functions).

Furthermore, the words used aren't really what they should be talking about.

>> Because workflows are just Python functions, the thread can restart a workflow by simply calling the workflow function with its original inputs and ID, retrieved from Postgres.

>> For this model to work, we have to make one assumption: workflow functions must be deterministic.

Yes, "deterministic"... because all state modification is aligned to ensure that.

If instead of a single print() the function/step had 2 print()'s, the state leaks and the abstraction explodes.

The right abstraction here is probably something more functional / rust-like, where external state modifications sections are explicitly decorated (either automatically or by a developer).

show comments