The Isolation Trap: Erlang

132 points60 comments2 days ago
JackC

The article argues that shared memory and message passing are the same thing because they share the same classes of potential failure modes.

Isn't it more like, message passing is a way of constraining shared memory to the point where it's possible for humans to reason about most of the time?

Sort of like rust and c. Yes, you can write code with 'unsafe' in rust that makes any mistake c can make. But the rules outside unsafe blocks, combined with the rules at module boundaries, greatly reduce the m * n polynomial complexity of a given size of codebase, letting us reason better about larger codebases.

show comments
rdtsc

> But an escape hatch is still an escape hatch. These mechanisms bypass the process isolation model entirely. They are shared state outside the process model, accessible concurrently by any process, with no mailbox serialization, no message copying, no ownership semantics. And when you introduce shared state into a system built on the premise of having none, you reintroduce the bugs that premise was supposed to eliminate.

No, they do bypass it. I don't know what "Technical Program Managers at Google" do but they don't seem to be using a lot of Erlang it seems ;-). ETS tables can be modeled as a process which stores data and then replies to message queries. Every update and read is equivalent to sending a message. The terms are still copied (see note * below). You're not going to read half a tuple and then it will mutate underneath as another process updates it. Traversing an ETS table is logically equivalent to asking a process for individual key-values using regular message passing.

What is different is what these are optimized for. ETS tables are great for querying and looking up data. They even have a mini query language for it (https://www.erlang.org/doc/apps/stdlib/qlc.html). Persistent terms are great for configuration values. None of them break the isolated heap and immutable data paradigm, they just optimize for certain access patterns.

Even dictionary fields they mention, when a process reads another process' dictionary it's still a signal being sent to a process and a reply needing to be received.

* Immutable binary blocks >64B can be referenced, but they are referenced when sending data using explicit messages between processes anyway.

show comments
Twey

Message passing is a type of mutable shared state — but one that's restricted in some important way to eliminate a certain class of errors (in Erlang's case, to a thread-safe queue with pairwise ordering guarantees so that all processing on a particular actor's state is effectively atomic). You can also pick other structures that give different guarantees, e.g. LVars or CRDTs make operations commutative so that the ordering problems go away (but by removing your ability to write non-commutative operations). The big win for the actor model is (just) that it linearizes all operations on a particular substate of the program while allowing other actors' states to be operated on concurrently.

Nobody argues that any of these approaches is a silver bullet for all concurrency problems. Indeed most of the problems of concurrency have direct equivalents in the world of single-threaded programming that are typically hard and only partially solved: deadlocks and livelocks are just infinite loops that occur across a thread boundary, protocol violations are just type errors that occur across a thread boundary, et cetera. But being able to rule out some of these problems in the happy case, even if you have to deal with them occasionally when writing more fiddly code, is still a big win.

If you have an actor Mem that is shared between two other actors A and B then Mem functions exactly as shared memory does between colocated threads in a multithreaded system: after all, RAM on a computer is implemented by sending messages down a bus! The difference is just that in the hardware case the messages you can pass to/from the actor (i.e. the atomicity boundaries) are fixed by the hardware, e.g. to reads/writes on particular fixed-sized ranges of memory, while with a shared actor Mem is free to present its own set of software-defined operations, with awareness of the program's semantics. Memory fences are a limited way to bring that programmability to hardware memory, but the programmer still has the onerous and error-prone task of mapping domain operations to fences.

show comments
IsTom

> Forget to set a timeout on a gen_server:call?

Default timeout is 5 seconds. You need to set explicit infinity timeout to not have one.

show comments
johnisgood

> This isn’t obviously wrong

I thought it was obviously wrong. Server A calls Server B, and Server B calls server A. Because when I read the code my first thought was that it is circular. Is it really not obvious? Am I losing my mind?

The mention of `persistent_term` is cool.

show comments
lukeasrodgers

I don’t have much experience with pony but it seems like it addresses the core concerns in this article by design https://www.ponylang.io/discover/why-pony/. I wish it were more popular.

show comments
pshirshov

I believe it's more correct to reference circular calls as "livelocks", not "deadlocks" - something is happening but the whole computation cannot progress.

For the rest - pure untyped actors come with a lot of downsides and provoke engineers to make systems unnecessarily distributed (with all the consistency and timeout issues). There aren't that many problems which can be mapped well directly to actors. I personally find async runtimes with typed front-ends (e.g. Cats/ZIO in Scala, async in Rust, etc) much more robust and much less error-prone.

show comments
aeonfox

A real interesting read as someone who spends a bit of time with Elixir. Wasn't aware of the atomic and counter Erlang features that break isolation.

Though they do say that race conditions are purely mitigated by discipline at design time, but then mention race conditions found via static analysis:

> Maria Christakis and Konstantinos Sagonas built a static race detector for Erlang and integrated it into Dialyzer, Erlang’s standard static analysis tool. They ran it against OTP’s own libraries, which are heavily tested and widely deployed.

> They found previously unknown race conditions. Not in obscure corners of the codebase. Not in exotic edge cases. In the kind of code that every Erlang application depends on, code that had been running in production for years.

I imagine that the 4th issue of protocol violation could possibly be mitigated by a typesafe abstracted language like Gleam (or Elixir when types are fully implemented)

show comments
anonymous_user9

This seems interesting, but the sheer density of LLM-isms make it hard to get through.

show comments
tonnydourado

Thank god I found this page: https://causality.blog/series/, now I can relax knowing that at least there's a plan for a conclusion. Looking forward to the next posts

show comments
cyberpunk

Eh maybe. I work on a big, mature, production erlang system which has millions of processes per cluster and while the author is right in theory, these are quite extreme edge cases and i’ve never tripped over them.

Sure, if you design a shit system that depends on ETS for shares state there are dangers, so maybe don’t do that?

I’d still rather be writing this system in erlang than in another language, where the footguns are bigger.

show comments
never_inline

> This isn’t just academic elegance, it kept phone switches running with five nines of availability.

Hmm....

> Erlang is the strongest form of the isolation argument, and it deserves to be taken seriously, which is why what happens next matters.

OK I think I know who wrote this.

> The problem isn’t that developers write circular calls by accident. It’s that deadlock-freedom doesn’t compose.

Is there a need to regugriate it in this format? "two protocols that are individually deadlock-free can still combine to deadlock in an actor system." This is the actually meaningful part.

> Forget to set a timeout on a gen_server:call?

People have pointed out its factually wrong in the thread. Eh

> This is the discipline tax. It works when the team is experienced, the codebase is well-maintained, and the conventions are followed consistently. It erodes when any of those conditions weaken, and given enough time and enough turnover they do.

I know this is an LLM tell, but can't point out. It makes me uneasy to read this. Maybe the rule of three? Maybe the reguggeiation of a elementary SE concept in between a technical description? Maybe because it's tryhard to sound smart? All three I guess.

I could go on, but sigh, man don't use these clankers to write prose. They're like negative level gzip compression.

worthless-trash

Could be wrong, but that wont deadlock because 5 seconds later, you're going to have call/2 fail.

instig007

GHC Haskell has the best concurrency story among high-level programming languages. SMP parallelism, structured concurrency with M:N multicore mapping, STM transactions for data structures including members of collections (https://hackage.haskell.org/package/stm-containers), and OTP-like primitives (https://haskell-distributed.github.io/). All fit nicely into native binaries on x86_64 and arm64.

felixgallo

This is agitslop.