Lucene really does feel like magic sometimes. It was designed expressly to solve the top K problem at hyper scale. It's incredibly mature technology. You can go from zero to a billion documents without thinking too
much about anything other than the amount of mass storage you have available.
Every time I've used Lucene I have combined it with a SQL provider. It's not necessarily about one or the other. The FTS facilities within the various SQL providers are convenient, but not as capable by comparison. I don't think mixing these into the same thing makes sense. They are two very different animals that are better joined by way of the document ids.
jmgimeno
Maybe I'm wrong, but for this query:
SELECT *
FROM benchmark_logs
WHERE severity < 3
ORDER BY timestamp DESC
LIMIT 10;
this index
CREATE INDEX ON benchmark_logs (severity, timestamp);
cannot be used as proposed: "Postgres can jump directly to the portion of the tree matching severity < 3 and then walk the timestamps in descending order to get the top K rows."
Postgres with this index can walk to a part of the tree with severity < 3, but timestamps are sorted only for the same severity.
show comments
davidelettieri
The "But Wait, We Need Filters Too" paragraph mentions "US" filter which is introduced only later on.
show comments
h1fra
Postgres is really good at a lot of things, but it's very unfortunate that it's really bad at simple analytics. I wish there was a plugin instead of having to have N databases
Vadim_samokhin
Just in case, there is a btree_gin extension which can be used in queries combining gin-indexable column and btree-indexable column. It doesn’t solve top-K ordering problem though.
tacone
The issue here is the row based format. You simply can't filter on arbitrary columns with that. Either use an external warehouse or a columnar plug-in like Timescale.
Lucene really does feel like magic sometimes. It was designed expressly to solve the top K problem at hyper scale. It's incredibly mature technology. You can go from zero to a billion documents without thinking too much about anything other than the amount of mass storage you have available.
Every time I've used Lucene I have combined it with a SQL provider. It's not necessarily about one or the other. The FTS facilities within the various SQL providers are convenient, but not as capable by comparison. I don't think mixing these into the same thing makes sense. They are two very different animals that are better joined by way of the document ids.
Maybe I'm wrong, but for this query:
SELECT * FROM benchmark_logs WHERE severity < 3 ORDER BY timestamp DESC LIMIT 10;
this index
CREATE INDEX ON benchmark_logs (severity, timestamp);
cannot be used as proposed: "Postgres can jump directly to the portion of the tree matching severity < 3 and then walk the timestamps in descending order to get the top K rows."
Postgres with this index can walk to a part of the tree with severity < 3, but timestamps are sorted only for the same severity.
The "But Wait, We Need Filters Too" paragraph mentions "US" filter which is introduced only later on.
Postgres is really good at a lot of things, but it's very unfortunate that it's really bad at simple analytics. I wish there was a plugin instead of having to have N databases
Just in case, there is a btree_gin extension which can be used in queries combining gin-indexable column and btree-indexable column. It doesn’t solve top-K ordering problem though.
The issue here is the row based format. You simply can't filter on arbitrary columns with that. Either use an external warehouse or a columnar plug-in like Timescale.