The point about synthetic query generation is good. We found users had very poor queries, so we initially had the LLM generate synthetic queries. But then we found that the results could vary widely based on the specific synthetic query it generated, so we had it create three variants (all in one LLM call, so that you can prompt it to generate a wide variety, instead of getting three very similar ones back), do parallel search, and then use reciprocal rank fusion to combine the list into a set of broadly strong performers. For the searches we use hybrid dense + sparse bm25, since dense doesn't work well for technical words.
This, combined with a subsequent reranker, basically eliminated any of our issues on search.
show comments
bityard
I must be missing something, this says it can be self-hosted. But the first page of the self-hosting docs say you need accounts with no less than 6 (!) other third-party hosted services.
We have very different ideas about the meaning of self-hosted.
show comments
daemonologist
I concur:
The big LLM-based rerankers (e.g. Qwen3-reranker) are what you always wanted your cross-encoder to be, and I highly recommend giving them a try. Unfortunately they're also quite computationally expensive.
Your metadata/tabular data often contains basic facts that a human takes for granted, but which aren't repeated in every text chunk - injecting it can help a lot in making the end model seem less clueless.
The point about queries that don't work with simple RAG (like "summarize the most recent twenty documents") is very important to keep in mind. We made our UI very search-oriented and deemphasized the chat, to try to communicate to users that search is what's happening under the hood - the model only sees what you see.
show comments
pietz
My biggest RAG learning is to use agentic RAG. (Sorry for buzzword dropping)
- Classic RAG: `User -> Search -> LLM -> User`
- Agentic RAG: `User <-> LLM <-> Search`
Essentially instead of having a fixed loop, you provide the search as a tool to the LLM, which does three things:
- The LLM can search multiple times
- The LLM can adjust the search query
- The LLM can use multiple tools
The combination of these three things has solved a majority of classic RAG problems. It improves user queries, it can map abbreviations, it can correct bad results on its own, you can also let it list directories and load files directly.
js98
Similar writeup I did about 1.5 years ago for processing millions of (technical) pages for RAG. Lots has stayed the same it seems
Does anyone know how to do versioning for embeddings? Let’s say I want to update/upsert my data and deliver v6 of domain data instead of v1 or filter for data within a specified date range. I am thinking of exploring context prepending to chunks.
show comments
hatmanstack
Not here to schlep for AWS but S3 Vectors is hands down the SOTA here. That combined with a Bedrock Knowledge Base to handle Discovery/Rebalance tasks makes for the simplest implementation on the Market.
Once Bedrock KB backed by S3 Vectors is released from Beta it'll eat everybody's lunch.
show comments
urbandw311er
To somebody thinking of building or paying for such a RAG system, would a workable solution be:
* Upload documents via API into a Google Workspace folder
* Use some sort of Google AI search API on those documents in that folder
…placing documents for different customers into different folders.
Or the Azure equivalent whatever that is.
leetharris
Embedding based RAG will always just be OK at best. It is useful for little parts of a chain or tech demos, but in real life use it will always falter.
That is, there is nothing here that one could not easily write without a library.
show comments
n_u
> Reranking: the highest value 5 lines of code you'll add. The chunk ranking shifted a lot. More than you'd expect. Reranking can many times make up for a bad setup if you pass in enough chunks. We found the ideal reranker set-up to be 50 chunk input -> 15 output.
What is re-ranking in the context of RAG? Why not just show the code if it’s only 5 lines?
show comments
max002
Great post, gonna be super useful for me :)
mattfrommars
Great read.
But how do people land opportunities to work on exciting project as the author did? I've been trying to get into legal tech in LLM space but I've been unsuccessful.
Anyone here successfully transitioned into legal space? My gut always been legal to the space where LLM can really be useful, the first one is in programming.
captainregex
How much of a hit would you take on quality if you moved the processing local? have you experimented with it? don’t think llamaindex has local sadly
show comments
pietz
I find it interesting that so many services and tools were investigated except for embedding models. I would have thought that's one of the biggest levers.
show comments
jascha_eng
I have a RAG setup that doesn't work on documents but other data points that we use for generation (the original data is call recordings but it is heavily processed to just a few text chunks).
Instead of a reranker model we do vector search and then simply ask GPT-5 in an extra call which of the results is the most relevant to the input question. Is there an advantage to actual reranker models rather than using a generic LLM?
show comments
torrmal
we have been trying to make it so that people dont have to reinvent the wheel, over and over and over again, and have a very straight forward all batteries included that can scale to many millions of documents, combining the best of RAG with traditional search and parametric search,
https://docs.mindsdb.com/mindsdb_sql/knowledge_bases/overvie...
Would love your feedback.
manishsharan
Thanks for sharing. TIL about rerankers.
Chunking strategy is a big issue. I found acceptable results by shoving large texts to to gemini flash and have it summarize and extract chunks instead of whatever text splitter I tried. I use the method published by Anthropic https://www.anthropic.com/engineering/contextual-retrieval i.e. include full summary along with chunks for each embedding.
I also created a tool to enable the LLM to do vector search on its own .
I do not use Langchain or python.. I use Clojure+ LLMs' REST APIs.
show comments
bitpatch
Really solid write-up — it’s rare to see someone break down the real tradeoffs of scaling RAG beyond the toy examples. The bit about reranking and chunking actually saving more than fancy LLM tricks hits home to me.
osigurdson
Speaking of embedding models, OpenAIs are getting a little long in the tooth at this stage.
whinvik
Anybody know what is meant by 'injecting relevant metadata'. Where is it injected?
show comments
383toast
They should've tested other embedding models, there are better ones than openai's (and cheaper)
show comments
alexchantavy
> What moved the needle: Query Generation
What does query generation mean in this context, it’s probably not SQL queries right?
show comments
dcreater
do you still use langchain/llamaindex for other agents/AI use cases?
The point about synthetic query generation is good. We found users had very poor queries, so we initially had the LLM generate synthetic queries. But then we found that the results could vary widely based on the specific synthetic query it generated, so we had it create three variants (all in one LLM call, so that you can prompt it to generate a wide variety, instead of getting three very similar ones back), do parallel search, and then use reciprocal rank fusion to combine the list into a set of broadly strong performers. For the searches we use hybrid dense + sparse bm25, since dense doesn't work well for technical words.
This, combined with a subsequent reranker, basically eliminated any of our issues on search.
I must be missing something, this says it can be self-hosted. But the first page of the self-hosting docs say you need accounts with no less than 6 (!) other third-party hosted services.
We have very different ideas about the meaning of self-hosted.
I concur:
The big LLM-based rerankers (e.g. Qwen3-reranker) are what you always wanted your cross-encoder to be, and I highly recommend giving them a try. Unfortunately they're also quite computationally expensive.
Your metadata/tabular data often contains basic facts that a human takes for granted, but which aren't repeated in every text chunk - injecting it can help a lot in making the end model seem less clueless.
The point about queries that don't work with simple RAG (like "summarize the most recent twenty documents") is very important to keep in mind. We made our UI very search-oriented and deemphasized the chat, to try to communicate to users that search is what's happening under the hood - the model only sees what you see.
My biggest RAG learning is to use agentic RAG. (Sorry for buzzword dropping)
- Classic RAG: `User -> Search -> LLM -> User`
- Agentic RAG: `User <-> LLM <-> Search`
Essentially instead of having a fixed loop, you provide the search as a tool to the LLM, which does three things:
- The LLM can search multiple times
- The LLM can adjust the search query
- The LLM can use multiple tools
The combination of these three things has solved a majority of classic RAG problems. It improves user queries, it can map abbreviations, it can correct bad results on its own, you can also let it list directories and load files directly.
Similar writeup I did about 1.5 years ago for processing millions of (technical) pages for RAG. Lots has stayed the same it seems
https://jakobs.dev/learnings-ingesting-millions-pages-rag-az...
Does anyone know how to do versioning for embeddings? Let’s say I want to update/upsert my data and deliver v6 of domain data instead of v1 or filter for data within a specified date range. I am thinking of exploring context prepending to chunks.
Not here to schlep for AWS but S3 Vectors is hands down the SOTA here. That combined with a Bedrock Knowledge Base to handle Discovery/Rebalance tasks makes for the simplest implementation on the Market.
Once Bedrock KB backed by S3 Vectors is released from Beta it'll eat everybody's lunch.
To somebody thinking of building or paying for such a RAG system, would a workable solution be:
* Upload documents via API into a Google Workspace folder * Use some sort of Google AI search API on those documents in that folder
…placing documents for different customers into different folders.
Or the Azure equivalent whatever that is.
Embedding based RAG will always just be OK at best. It is useful for little parts of a chain or tech demos, but in real life use it will always falter.
> LLM: GPT 4.1 -> GPT 5 -> GPT 4.1, covered by Azure credits
whats this roundtrip? also the chronology of the LLM (4.1) doesnt match the rest of the stack (text-embedding-large-3), feels weird
> Chunking Strategy: this takes a lot of effort, you'll probably be spending most of your time on it
Could you share more about chunking strategies you used?
They say the chunker is the most important part, but theirs looks rudimentary: https://github.com/agentset-ai/agentset/blob/main/packages/e...
That is, there is nothing here that one could not easily write without a library.
> Reranking: the highest value 5 lines of code you'll add. The chunk ranking shifted a lot. More than you'd expect. Reranking can many times make up for a bad setup if you pass in enough chunks. We found the ideal reranker set-up to be 50 chunk input -> 15 output.
What is re-ranking in the context of RAG? Why not just show the code if it’s only 5 lines?
Great post, gonna be super useful for me :)
Great read. But how do people land opportunities to work on exciting project as the author did? I've been trying to get into legal tech in LLM space but I've been unsuccessful.
Anyone here successfully transitioned into legal space? My gut always been legal to the space where LLM can really be useful, the first one is in programming.
How much of a hit would you take on quality if you moved the processing local? have you experimented with it? don’t think llamaindex has local sadly
I find it interesting that so many services and tools were investigated except for embedding models. I would have thought that's one of the biggest levers.
I have a RAG setup that doesn't work on documents but other data points that we use for generation (the original data is call recordings but it is heavily processed to just a few text chunks). Instead of a reranker model we do vector search and then simply ask GPT-5 in an extra call which of the results is the most relevant to the input question. Is there an advantage to actual reranker models rather than using a generic LLM?
we have been trying to make it so that people dont have to reinvent the wheel, over and over and over again, and have a very straight forward all batteries included that can scale to many millions of documents, combining the best of RAG with traditional search and parametric search, https://docs.mindsdb.com/mindsdb_sql/knowledge_bases/overvie... Would love your feedback.
Thanks for sharing. TIL about rerankers.
Chunking strategy is a big issue. I found acceptable results by shoving large texts to to gemini flash and have it summarize and extract chunks instead of whatever text splitter I tried. I use the method published by Anthropic https://www.anthropic.com/engineering/contextual-retrieval i.e. include full summary along with chunks for each embedding.
I also created a tool to enable the LLM to do vector search on its own .
I do not use Langchain or python.. I use Clojure+ LLMs' REST APIs.
Really solid write-up — it’s rare to see someone break down the real tradeoffs of scaling RAG beyond the toy examples. The bit about reranking and chunking actually saving more than fancy LLM tricks hits home to me.
Speaking of embedding models, OpenAIs are getting a little long in the tooth at this stage.
Anybody know what is meant by 'injecting relevant metadata'. Where is it injected?
They should've tested other embedding models, there are better ones than openai's (and cheaper)
> What moved the needle: Query Generation
What does query generation mean in this context, it’s probably not SQL queries right?
do you still use langchain/llamaindex for other agents/AI use cases?
Nice app bro https://usul.ai/chat/VgnzXjlRdljIDMBVCfqiy
Exactly what kind of processing was done? Your pipeline is a function of the use case, lest you overengineer…