How exchanges turn order books into distributed logs

alexpotato

At a past job (hedge fund), my role was to co-ordinate investigations into why latency may have changed when sending orders.

A couple of quants had built a random forest regression model that could take inputs like time of day, exchange, order volume etc and spit out an interval of what latency had historically been in that range.

If the latency moved outside that range, an alert would fire and then I would co-ordinate a response with the a variety of teams e.g. trading, networking, Linux etc

If we excluded changes on our side as the culprit, we would reach out to the exchange and talk to our sales rep there would might also pull in networking etc.

Some exchanges, EUREX comes to mind, were phenomenal at helping us identify issues. e.g. they once swapped out a cable that was a few feet longer than the older cable and that's why the latency increased.

One day, it's IEX, of Flash Boys fame, that triggers an alert. Nothing changed on our side so we call them. We are going back and forth with the networking engineer and then the sales rep says, in almost hushed tones:

"Look, I've worked at other exchange so I get where you are coming from in asking these questions. Problem is, b/c of our founding ethos, we are actually not allowed to track our own internal latency so we really can't help you identify the root cause. I REALLY wish it was different."

I love this story b/c HN, as a technology focused site, often thinks all problems have technical solutions but sometimes it's actually a people or process solution.

Also, incentives and "philosophy of the founders" matter a lot too.

show comments

jshaqaw

This is interesting but also just hilarious at a meta level. I was a “low frequency” ie manual fundamental based hedge fund investor for many years. In general I think hft is a net benefit to liquidity when done in compliance with the text and spirit of regulations. But no real world allocation of resources is improved by having to game transactions to this level of time granularity. This is just society pouring resources down a zero sum black hole. Open to hearing contrary views of course.

show comments

dmurray

This article both undersells and oversells the technical challenge exchanges solve.

First, it is of course possible to apply horizontal scaling through sharding. My order on Tesla doesn't affect your order on Apple, so it's possible to run each product on its own matching engine, its own set of gateways, etc. Most exchanges don't go this far: they might have one cluster for stocks starting A-E, etc. So they don't even exhaust the benefits available from horizontal scaling, partly because this would be expensive.

On the other hand, it's not just the sequencer that has to process all these events in strict order - which might make you think it's just a matter of returning a single increasing sequence number for every request. The matching engine which sits downstream of the sequencer also has to consume all the events and apply a much more complicated algorithm: the matching algorithm described in the article as "a pure function of the log".

Components outside of that can generally be scaled more easily: for example, a gateway cares only about activity on the orders it originally received.

The article is largely correct that separating the sequencer from the matching engine allows you to recover if the latter crashes. But this may only be a theoretical benefit. Replaying and reprocessing a day's worth of messages takes a substantial fraction of the day, because the system is already operating close to its capacity. And after it crashed, you still need to figure out which customers think they got their orders executed, and allow them to cancel outstanding orders.

show comments

alexpotato

> Every modern exchange has a single logical sequencer. No matter how many gateways feed the system, all events flow into one component whose job is to assign the next sequence number. That integer defines the global timeline.

A notable edge case here is that if EVERYTHING (e.g. market data AND orders) goes through the sequencer then you can, essentially, Denial of Service to key parts of the trading flow.

e.g. one of the first exchanges to switch to a sequencer model was famous for having big market data bursts and then huge order entry delays b/c each order got stuck in the sequencer queue. In other words, the queue would be 99.99% market data with orders sprinkled in randomly.

show comments

Scubabear68

I wish the article had stuck with the technical topic at hand and left out the embellishment. In particular the opening piece talking about what is happening outside the exchange.

What happens outside the exchange really doesn’t matter. The ordering will not happen until it hits the exchange.

And that is why algorithmic traders want their algos in a closet as close to the exchange both physically and also in terms of network hops as possible.

show comments

croemer

Smells of AI writing: "Timestamps aren't enough. Exchanges need a stronger ordering primitive." etc

show comments

cgio

The title is obviously the wrong way around, exchanges turn distributed logs into order books. The distributed part is a resilience decision but not essential to the design (technically writing to a disk would give persistence with less ability to recover, or with some potential gaps in the case of failure (remember there is a sequence published on the other end too, the market data feed)). As noted in the article, the sequencer is a single-threaded, not parallelisable process. Distribution is just a configuration of that single threaded path. Parallelisation is feasible to some extent by sharding across order books themselves (dependencies between books may complicate this).

show comments

rhodey

Always fun to read about HFT. If anyone wants to learn about the Order Book data structure you can find it in JS here:

https://github.com/rhodey/limit-order-book

https://www.npmjs.com/package/limit-order-book

teleforce

This distributed logs nature of the exchanges is very much suitable for Kafka.

But for the required stringent latency, Kafka for head of line (HoL) blocking under concurrent events can be an issue though [1].

[1] What If We Could Rebuild Kafka from Scratch? (220 comments)

https://news.ycombinator.com/item?id=43790420

contingencies

So many wrong statements here it's difficult to know where to start. Perhaps "Why Eventual Consistency Is Impossible in Finance" which is glaring: most of the economy runs on eventual consistency (brokers, banks, credit cards, crypto consensus).

nick0garvey

> Pipelined replication: the sequencer assigns a sequence number immediately and ships the event to replicas in parallel. Matching doesn't wait for the replicas to acknowledge.

How is this avoiding data loss if the lead sequencer goes down after acking but without the replica receiving the write?

thijson

The article says it's not enough to accurately timestamp orders at the various order entry portals. I didn't understand why that's not enough.

GPS can provide fairly accurate timestamps. There's a few other GLONASS systems as well for extra reliability.

show comments

hamiecod

How long can the exchanges scale their sequencer systems (which are sequential) vertically? The trading volume is only rising with time at a higher rate than the advancement of low latency tech.

show comments

8cvor6j844qw_d6

Very interesting. I wished to know the author. The site doesn't seem to have readily available information on the author.

show comments

nly

These distributed sequencer solutions are for resilience, and they add a lot of latency because each node needs to do something like RAFT. Exchanges generally don't care aggressively about low latency, they care about resilience and fairness. It's the hedge funds etc looking for an edge.

One thing often missed here is that most orders, even from most hedge funds and prop trading shops, still go via broker systems. Direct Market Access is getting more common but it's often a pain in the arse from a regulatory and disclosures perspective, and means you lose out on short locate (shares that you can borrow from your broker to short sell).

"Sponsored Access", where you connect directly to an exchange but your broker monitors your activity via a drop copy, is a happy middle ground.

Surprisingly though, I've heard of at least one trading venue where going direct is slower, because the venues own risk checks are slower than the the ones implemented by at least one broker, and the broker themselves are allowed to bypass the risk checks put in place at the exchange for general DMA clients. "Direct" is clearly subject to negotiation.

I've also heard of brokers who tried to implement their gateways in FPGA, and have later shuttered the project, having gone back to relatively slow software gateways for the flexibility.

A lot of trading still happens via FIX, which is a slow ASCII protocol. Most prop shops will have aggressively optimized FIX parsers and serialisers out of necessity.

People think all trading happens in these elite, bleeding-edge hardcore sub-microsecond systems, but a lot of it is just dogshit.

Things are a bit more optimised in the derivatives space because of the insane volumes (Options trading just for US equities is easily into the petabytes of storage per year).

HolyLampshade

I’m a tad late to the party, but it’s worth providing a little context to the technical conversation.

Of the many thing trading platforms are attempting to do, the two most relevant here are the overall latency and more importantly where serialization occurs on the system.

Latency itself is only relevant as it applies to the “uncertainty” period where capital is tied up before the result of the instruction is acknowledged. Firms can only have so much capital risk, and so these moments end up being little dead periods. So long as the latency is reasonably deterministic though it’s mostly inconsequential if a platform takes 25us or 25ms to return an order acknowledgement (this is slightly more relevant in environments where there are potentially multiple venues to trade a product on, but in terms of global financial systems these environments are exceptions and not the norm). Latency is really only important when factored alongside some metric indicating a failure of business logic (failures to execute on aggressive orders or failures to cancel in time are two typical metrics)

The most important to many participants is where serialization occurs on the trading venue (what the initial portion of this blog is about; determining who was “first”). Usually this is to the tune of 1-2ns (in some cases lower). There are diminishing returns however to making this absolute in physical terms. A small handful of venues have attempted to address serialization at the very edge of their systems, but the net result is just a change in how firms that are extremely sensitive to being first apply technical expertise to the problem.

Most “good” venues permit an amount of slop in their systems (usually to the tune of 5-10% of the overall latency) which reduces the benefits of playing the sorts of ridiculous games to be “first”. There ends up being a hard limit to the economic benefit of throwing man hours and infrastructure at the problem.