Show HN: Postgres extension for BM25 relevance-ranked full-text search

i_have_to_speak

This looks like a great addition to the Postgres ecosystem. When adding specialized extensions like this, it's always worth keeping an eye on how they impact overall system performance, especially memory usage and lock contention as the dataset grows. For anyone testing this out, I'd recommend using an open-source tool like *pgmetrics* (https://pgmetrics.io) to get a baseline and then monitor how the new indexes and search workloads affect your underlying metrics. It’s zero-dependency and gives you a very deep look into the internals without much overhead.

show comments

jascha_eng

FWIW TJ is not your average vibe coder imo: https://www.linkedin.com/in/todd-j-green/

In september he burned through 3000$ in API credits though, but I think that's before we finally bought max plans for everyone that wanted it.

show comments

shreyssh

Nice work. pg_search has been on my radar for a while, having BM25 natively in Postgres instead of bolting on Elasticsearch is a huge DX win. Curious about the index build time on larger datasets though. I'm working with ~2M row tables and the bottleneck for most Postgres extensions I've tried isn't query speed, it's the initial indexing. Any benchmarks on that?

show comments

droelf

I just packaged it for pg-forge, nice stuff: https://github.com/wolfv/pg-forge

Now you can install it easily with `pixi`.

simonw

This is really cool. I've built things on PostgreSQL ts_vector() FTS in the past which works well but doesn't have whole-index ranking algorithms so can't do BM25.

It's a bit surprising to me that this doesn't appear to have a mechanism to say "filter for just documents matching terms X and Y, then sort by BM25 relevance" - it looks like this extension currently handles just the BM25 ranking but not the FTS filtering. Are you planning to address that in the future?

I found this example in the README quite confusing:

  SELECT * FROM documents
  WHERE content <@> to_bm25query('search terms', 'docs_idx') < -5.0
  ORDER BY content <@> 'search terms'
  LIMIT 10;

That -5.0 is a magic number which, based on my understanding of BM25, is difficult to predict in advance since the threshold you would want to pick varies for different datasets.

show comments

kev009

The "term positions" caveat seems like a major limitation for human oriented searches of logs or products or whatever. I don't see it mentioned in what's next, will you address it in some future release or is it out of scope for your intended use cases?

P.S. been shipping it for a while https://www.freshports.org/databases/pg_textsearch/ :)

andai

Can you explain this in more detail? Is this for RAG, i.e. combining vector search with keyword search?

My knowledge on that subject roughly begins and ends with this excellent article, so I'd love to hear how this relates to that.

https://www.anthropic.com/engineering/contextual-retrieval

Especially since what Anthropic describes here is a bit of a rube Goldberg machine which also involves preprocessing (contextual summarization) and a reranking model, so I was wondering if there's any "good enough" out of the box solutions for it.

show comments

zephyrwhimsy

Input quality is almost always the actual bottleneck. Teams spend months tuning retrieval while feeding HTML boilerplate into their vector stores.

show comments

robotswantdata

“Just use Postgres” greybeards right again. Looking forward to giving this a go soon

mattbessey

Please oh please let GCP add this to the supported managed Postgres extensions...

show comments

gmassman

Very exciting! Congrats on the release, this will be a huge benefit to all folks building RAG/rerank systems on top of Postgres. Looking forward to testing it out myself.

show comments

hmokiguess

I hope someone from Neon is watching this thread, definitely want to play with this asap

piskov

On a tangent note it’s amazing how hard it is to have a good case-insensitive search in Postgres.

In SQL Server you just use case-insensitive collation (which is a default) and add an index (it’s the only one non-clustered) and call it a day.

In postgres you need to go above and beyond just for that. It’s like postgres guys were “nah dog, everybody just uses lowercase; you don’t need to worry of people writing john doe as John Doe)”.

And don’t get me started with storing datetime with timezone (e.g “4/2/2007 7:23:57 PM -07:00“). In sql server you have datetimeoffset; in Postgres you fuck off :-)

show comments

landsman

Postgres is really amazing backbone for products. Love it.

Unical-A

Impressive benchmarks. How does the BM25 implementation handle high-frequency updates (writes) while maintaining search latency? Usually, there's a trade-off between ingest speed and search performance in Postgres-based full-text search.

show comments

maweaver

I've been doing some RAG prototypes with hybrid search using pg_textsearch plus pgvector and have been very pleased with the results. Happy to see a 1.0 release!

show comments

jackyliang

VERY excited about this, literally just looking to build hybrid search using Postgres FTS. When will this be available on Supabase?

show comments

diwank

we have been using pg_textsearch in production for a few weeks now, and it's been fairly stable and super speedy. we used to use paradedb (aka pg_search -- it's quite annoying that the two or so similarly named), but paradedb was extremely unstable, led to serious data corruption a bunch of times. in fact, before switching to pg_textsearch, we just switched over to plain trigram search coz paradedb was tanking our db so often...

also shoutout to tj for being super responsive on github issues!

timedude

When is this available on AWS in Aurora? Anyone from AWS here, add it pronto

bradfox2

Thank you!! Goodbye manticore if this works.

devmor

This is really cool to see! I've been using BM25+sqlite-vec for contextual search projects for a little while, it's a great performance addition.

esafak

interesting documents:

https://timescale.github.io/pg_textsearch/benchmarks/compari...

https://github.com/timescale/pg_textsearch/blob/main/ROADMAP...

gplprotects

> ParadeDB, is guarded behind AGPL

What a wonderful ad for ParadeDB, and clear signal that "TigerData" is a pernicious entity.

show comments