Keeping a Postgres Queue Healthy

cataflutter

Decent article, but some remarks:

1) It seems these two statements conflict with each other:

> The oldest such transaction sets the cutoff—referred to as the "MVCC horizon." Until that transaction completes, every dead tuple newer than its snapshot is retained.

and

> For example, imagine three analytics queries, each running for 40 seconds, staggered 20 seconds apart. No individual query would trigger a timeout for running too long. But because one is always active, the horizon never advances, and the effect on vacuum is the same as one transaction that never ends.

If the three analytics *transactions* (it's transactions that matter, not queries, although there is some subtlety around deferred transactions not acquiring a snapshot until the first query) are started at different times, they will have staggered snapshots and so once the first completes, this should allow the vacuum to advance.

2) Although the problem about this query:

    SELECT * FROM jobs
    WHERE status = 'pending'
    ORDER BY run_at
    LIMIT 1
    FOR UPDATE SKIP LOCKED;

having to consider dead tuples is a genuine concern and performance problem, this can also be mitigated by adding a monotonically increasing column and adding a `WHERE column < ?` clause, provided you have also added an index to make that pagination efficient. This way you don't need to consider dead tuples and they 'only' waste space whilst waiting to be vacuumed, rather than also bogging down read perf.

There is a little subtlety around how you guarantee that the column is monotonically increasing, given concurrent writers, but the answer to that depends on what tricks you can fit into your application.

3) I almost want to say that the one-line summary is 'Don't combine (very) long-running transactions with (very) high transaction rates in Postgres'

(Is this a fair representation?)

nine_k

In short:

* Postgres still has the same problem with vacuum horizon, when a long-running query can block vacuuming of a quick-churning table. (The author uses a benchmark from 2015 when the problem was already well-understood.)

* Stock Postgres still has no tools good enough against it.

* The author's company special version of Postgres does have such tools; a few polite promotions of it are strewn across the article.

My conclusion: it's still not wise to mix long (OLAP-style) loads and quick-churning (queue-style) loads on the same Postgres instance. Maybe running 0MQ or even RMQ may be an easier solution, depending on the requirements to the queue.

simeonGriggs

Yo! Author here, I’ll be around if anyone’s got questions!

show comments

sebmellen

Postgres can do so much. I see people choose Kafka and SQS for things that Graphile Worker could do all day long.

show comments

richwater

It would be nice if this ad at least explained a little bit of the technical side of the solution.

show comments