lukax

NUMA can cause really crappy performance. We deployed a Go based LLM gateway in Kubernetes deployed on a server with hundreds of CPU cores. We didn't explicitly set GOMAXPROCS so Go runtime scheduled goroutines over different CPUs and it constantly used 200% CPU and GC was causing latency spikes. Then we set GOMAXPROCS 8 and all performance issues went away. Until recently Kubernetes didn't work well with NUMA.

show comments
treesknees

Something I didn’t see mentioned was that this unequal memory access time also affects pcie I/O. If your thread on CPU A needs to get data in or out of a nic on CPU B, your throughput/latency will be impacted.

We have to explain this to customers of our software all the time, it’s something that’s easy to miss.

show comments
Twirrim

NUMA is one of those amazing things that trip you up in all sorts of ways at unexpected times. The amazing "invisible" performance killler (invisible because unless you're already aware of NUMA, or remember to check, you won't know it's there potentially crippling you.)

It has been a source of routine conversations with customers and engineers of all kinds, and often one of those things you don't know about until too late.

I don't know if the kernel has improved this behaviour in the several years since last tested, but a coworker realised that the linux page-cache wasn't fully split by NUMA node. They were benchmarking mysql running it in each NUMA node, and noticed the second NUMA node was noticeably slower. Then discover after a reboot the second node was fast, and the first was slower. After a bit of thinking and tinkering they discovered that libmysql was ending up in the page cache in the same NUMA node as the benchmark client was run in first, so even though they were pinning the benchmark tool and mysql process to the NUMA node, the benchmark client was causing the OS to reach across the NUMA node to get at the page cached library.

show comments
jpecar

I'm baffled by the fact that NUMA is still an issue in 2026. My impression is that this was all solved back in dotcom era already on those big SUNs. At least in HPC we solved this already in mid 2000s. Why is supposedly modern world still wasting time on this? Kernel these days exposes just about everything you would ever want to know about a system topology and every runtime should be making use of that information. If it does not, I cannot consider it ready for this century.

show comments
iofiiiiiiiii

Yeah, when you have tall servers this can be a really surprising factor. In some sense you could view this as an extension of processor caching behaviors, which also causes some memory accesses to be lower - just due to cache behaviors, not physical location. But in many cases, the same tools can be used to fight both "far" memory accesses and cache trashing, by using a thread-isolated architecture.

I have been dealing with the topic for a few years now and it was surprisingly hard to track down the bottlenecks to actual numbers. Some time ago I managed to find a good example to demonstrate the effect in a tangible way and wrote up an article about it. If the topic sounds interesting, you might enjoy https://sander.saares.eu/2025/03/31/structural-changes-for-4... (Structural changes for +48-89% throughput in a Rust web service).

jeffbee

My question: why do mainstream users tolerate NUMA? 99% of you don't need to. Single-socket servers exist and they are not only tolerable but better in most ways. Dealing with NUMA in software consists of trying to logically partition the machine, but you can instead physically partition the machine. It's so much simpler!

Amazon gets this. Except for the 4th generation their Graviton systems are not NUMA.

show comments