This looks like a great addition to the Postgres ecosystem. When adding specialized extensions like this, it's always worth keeping an eye on how they impact overall system performance, especially memory usage and lock contention as the dataset grows. For anyone testing this out, I'd recommend using an open-source tool like *pgmetrics* (https://pgmetrics.io) to get a baseline and then monitor how the new indexes and search workloads affect your underlying metrics. It’s zero-dependency and gives you a very deep look into the internals without much overhead.
In september he burned through 3000$ in API credits though, but I think that's before we finally bought max plans for everyone that wanted it.
show comments
shreyssh
Nice work. pg_search has been on my radar for a while, having BM25 natively in Postgres instead of bolting on Elasticsearch is a huge DX win. Curious about the index build time on larger datasets though. I'm working with ~2M row tables and the bottleneck for most Postgres extensions I've tried isn't query speed, it's the initial indexing. Any benchmarks on that?
This is really cool. I've built things on PostgreSQL ts_vector() FTS in the past which works well but doesn't have whole-index ranking algorithms so can't do BM25.
It's a bit surprising to me that this doesn't appear to have a mechanism to say "filter for just documents matching terms X and Y, then sort by BM25 relevance" - it looks like this extension currently handles just the BM25 ranking but not the FTS filtering. Are you planning to address that in the future?
I found this example in the README quite confusing:
SELECT * FROM documents
WHERE content <@> to_bm25query('search terms', 'docs_idx') < -5.0
ORDER BY content <@> 'search terms'
LIMIT 10;
That -5.0 is a magic number which, based on my understanding of BM25, is difficult to predict in advance since the threshold you would want to pick varies for different datasets.
show comments
kev009
The "term positions" caveat seems like a major limitation for human oriented searches of logs or products or whatever. I don't see it mentioned in what's next, will you address it in some future release or is it out of scope for your intended use cases?
Especially since what Anthropic describes here is a bit of a rube Goldberg machine which also involves preprocessing (contextual summarization) and a reranking model, so I was wondering if there's any "good enough" out of the box solutions for it.
show comments
zephyrwhimsy
Input quality is almost always the actual bottleneck. Teams spend months tuning retrieval while feeding HTML boilerplate into their vector stores.
show comments
robotswantdata
“Just use Postgres” greybeards right again.
Looking forward to giving this a go soon
mattbessey
Please oh please let GCP add this to the supported managed Postgres extensions...
show comments
gmassman
Very exciting! Congrats on the release, this will be a huge benefit to all folks building RAG/rerank systems on top of Postgres. Looking forward to testing it out myself.
show comments
hmokiguess
I hope someone from Neon is watching this thread, definitely want to play with this asap
piskov
On a tangent note it’s amazing how hard it is to have a good case-insensitive search in Postgres.
In SQL Server you just use case-insensitive collation (which is a default) and add an index (it’s the only one non-clustered) and call it a day.
In postgres you need to go above and beyond just for that. It’s like postgres guys were “nah dog, everybody just uses lowercase; you don’t need to worry of people writing john doe as John Doe)”.
And don’t get me started with storing datetime with timezone (e.g “4/2/2007 7:23:57 PM -07:00“). In sql server you have datetimeoffset; in Postgres you fuck off :-)
show comments
landsman
Postgres is really amazing backbone for products. Love it.
Unical-A
Impressive benchmarks. How does the BM25 implementation handle high-frequency updates (writes) while maintaining search latency? Usually, there's a trade-off between ingest speed and search performance in Postgres-based full-text search.
show comments
maweaver
I've been doing some RAG prototypes with hybrid search using pg_textsearch plus pgvector and have been very pleased with the results. Happy to see a 1.0 release!
show comments
jackyliang
VERY excited about this, literally just looking to build hybrid search using Postgres FTS. When will this be available on Supabase?
show comments
diwank
we have been using pg_textsearch in production for a few weeks now, and it's been fairly stable and super speedy. we used to use paradedb (aka pg_search -- it's quite annoying that the two or so similarly named), but paradedb was extremely unstable, led to serious data corruption a bunch of times. in fact, before switching to pg_textsearch, we just switched over to plain trigram search coz paradedb was tanking our db so often...
also shoutout to tj for being super responsive on github issues!
timedude
When is this available on AWS in Aurora? Anyone from AWS here, add it pronto
bradfox2
Thank you!! Goodbye manticore if this works.
devmor
This is really cool to see! I've been using BM25+sqlite-vec for contextual search projects for a little while, it's a great performance addition.
This looks like a great addition to the Postgres ecosystem. When adding specialized extensions like this, it's always worth keeping an eye on how they impact overall system performance, especially memory usage and lock contention as the dataset grows. For anyone testing this out, I'd recommend using an open-source tool like *pgmetrics* (https://pgmetrics.io) to get a baseline and then monitor how the new indexes and search workloads affect your underlying metrics. It’s zero-dependency and gives you a very deep look into the internals without much overhead.
FWIW TJ is not your average vibe coder imo: https://www.linkedin.com/in/todd-j-green/
In september he burned through 3000$ in API credits though, but I think that's before we finally bought max plans for everyone that wanted it.
Nice work. pg_search has been on my radar for a while, having BM25 natively in Postgres instead of bolting on Elasticsearch is a huge DX win. Curious about the index build time on larger datasets though. I'm working with ~2M row tables and the bottleneck for most Postgres extensions I've tried isn't query speed, it's the initial indexing. Any benchmarks on that?
I just packaged it for pg-forge, nice stuff: https://github.com/wolfv/pg-forge
Now you can install it easily with `pixi`.
This is really cool. I've built things on PostgreSQL ts_vector() FTS in the past which works well but doesn't have whole-index ranking algorithms so can't do BM25.
It's a bit surprising to me that this doesn't appear to have a mechanism to say "filter for just documents matching terms X and Y, then sort by BM25 relevance" - it looks like this extension currently handles just the BM25 ranking but not the FTS filtering. Are you planning to address that in the future?
I found this example in the README quite confusing:
That -5.0 is a magic number which, based on my understanding of BM25, is difficult to predict in advance since the threshold you would want to pick varies for different datasets.The "term positions" caveat seems like a major limitation for human oriented searches of logs or products or whatever. I don't see it mentioned in what's next, will you address it in some future release or is it out of scope for your intended use cases?
P.S. been shipping it for a while https://www.freshports.org/databases/pg_textsearch/ :)
Can you explain this in more detail? Is this for RAG, i.e. combining vector search with keyword search?
My knowledge on that subject roughly begins and ends with this excellent article, so I'd love to hear how this relates to that.
https://www.anthropic.com/engineering/contextual-retrieval
Especially since what Anthropic describes here is a bit of a rube Goldberg machine which also involves preprocessing (contextual summarization) and a reranking model, so I was wondering if there's any "good enough" out of the box solutions for it.
Input quality is almost always the actual bottleneck. Teams spend months tuning retrieval while feeding HTML boilerplate into their vector stores.
“Just use Postgres” greybeards right again. Looking forward to giving this a go soon
Please oh please let GCP add this to the supported managed Postgres extensions...
Very exciting! Congrats on the release, this will be a huge benefit to all folks building RAG/rerank systems on top of Postgres. Looking forward to testing it out myself.
I hope someone from Neon is watching this thread, definitely want to play with this asap
On a tangent note it’s amazing how hard it is to have a good case-insensitive search in Postgres.
In SQL Server you just use case-insensitive collation (which is a default) and add an index (it’s the only one non-clustered) and call it a day.
In postgres you need to go above and beyond just for that. It’s like postgres guys were “nah dog, everybody just uses lowercase; you don’t need to worry of people writing john doe as John Doe)”.
And don’t get me started with storing datetime with timezone (e.g “4/2/2007 7:23:57 PM -07:00“). In sql server you have datetimeoffset; in Postgres you fuck off :-)
Postgres is really amazing backbone for products. Love it.
Impressive benchmarks. How does the BM25 implementation handle high-frequency updates (writes) while maintaining search latency? Usually, there's a trade-off between ingest speed and search performance in Postgres-based full-text search.
I've been doing some RAG prototypes with hybrid search using pg_textsearch plus pgvector and have been very pleased with the results. Happy to see a 1.0 release!
VERY excited about this, literally just looking to build hybrid search using Postgres FTS. When will this be available on Supabase?
we have been using pg_textsearch in production for a few weeks now, and it's been fairly stable and super speedy. we used to use paradedb (aka pg_search -- it's quite annoying that the two or so similarly named), but paradedb was extremely unstable, led to serious data corruption a bunch of times. in fact, before switching to pg_textsearch, we just switched over to plain trigram search coz paradedb was tanking our db so often...
also shoutout to tj for being super responsive on github issues!
When is this available on AWS in Aurora? Anyone from AWS here, add it pronto
Thank you!! Goodbye manticore if this works.
This is really cool to see! I've been using BM25+sqlite-vec for contextual search projects for a little while, it's a great performance addition.
interesting documents:
https://timescale.github.io/pg_textsearch/benchmarks/compari...
https://github.com/timescale/pg_textsearch/blob/main/ROADMAP...
> ParadeDB, is guarded behind AGPL
What a wonderful ad for ParadeDB, and clear signal that "TigerData" is a pernicious entity.