The Theoretical Limitations of Embedding-Based Retrieval

mingtianzhang

We are always looking for representations that can capture the meaning of information. However, most representations that compress information for retrieval are lossy. For example, embeddings are a form of lossy compression. Similar to the no-free-lunch theorem, no lossy compression method is universally better than another, since downstream tasks may depend on the specific information that gets lost. Therefore, the question is not which representation is perfect, but which representation is better aligned with an AI system. Because AI evolves rapidly, it is difficult to predict the limitations of the next generation of LLMs. For this reason, a good representation for information retrieval in future LLM systems should be closer to how humans represent knowledge.

When a human tries to retrieve information in a library, they first locate a book by category or by using a metadata keyword search. Then, they open the table of contents (ToC) to find the relevant section, and repeat this process as needed. Therefore, I believe the future of AI retrieval systems should mimic this process. The recently popular PageIndex approach (see this discussioin: https://news.ycombinator.com/item?id=45036944) also belongs to this category, as it generates a table-of-contents–like tree for LLMs to reason over. Again, it is a form of lossy compression, so its limitations can be discussed. However, this approach is the closest to how humans perform retrieval.

show comments

lunarmony

Researchers have discussed limitations of vector-based retrieval from a rank perspective in various forms for a few years. It's further been shown that better alternative exists; some low-rank approaches can theoretically approximate arbitrary high-rank distribution while permitting MIPS-level efficient inference (see e.g., Retrieval with Learned Similarities, https://arxiv.org/abs/2407.15462). Such solutions are already being used in production at Meta and at LinkedIn.

show comments

Straw

In the theoretical section, they extrapolate assuming a polynomial from 40 to thousands of dimensions. Why do they trust a polynomial fit to extrapolate two orders of magnitude? Why do we even think it's polynomial instead of exponential in the first place? Most things like this increase exponentially with dimension.

In fact, I think we can do it in d=2k dimensions, if we're willing to have arbitrarily precise query vectors.

Embed our points as (sin(theta), cos(theta), sin(2 x theta), cos(2 x theta)..., sin(k x theta), cos(k x theta)), with theta uniformly spaced around the circle, and we should be able to select any k of them.

Using a few more dimensions we can then ease the precision requirements on the query.

show comments

gdiamos

Their idea is that capacity of even 4096-wide vectors limits their performance.

Sparse models like BM25 have a huge dimension and thus don’t suffer from this limit, but they don’t capture semantics and can’t follow instructions.

It seems like the holy grail is a sparse semantic model. I wonder how splade would do?

show comments

ArnavAgrawal03

we used multi-vector models at Morphik, and I can confirm the real-world effectiveness, especially when compared with dense-vector retrieval.

show comments

simne

Could somebody suggest good introduction to simulation of complex behavior with neural networks?

I mean, I hear, about experiments of running Turing machine simulation on NN, or even simulation of some physics on NN, but I have not seen any good survey on these topics, and they could be very interest on subject.

zwaps

How does it look with ColBert style late interaction embeddings?

show comments