The biggest problem with retrieval is actually semantic relevance. I think most embedding models don't really capture sentence-level semantic content and instead act more like bag-of-words models averaging local word-level information.
Consider this simple test I’ve been running:
Anchor: “A background service listens to a task queue and processes incoming data payloads using a custom rules engine before persisting output to a local SQLite database.”
Option A (Lexical Match): “A background service listens to a message queue and processes outgoing authentication tokens using a custom hash function before transmitting output to a local SQLite database.”
Option B (Semantic Match): “An asynchronous worker fetches jobs from a scheduling channel, transforms each record according to a user-defined logic system, and saves the results to an embedded relational data store on disk.”
Any decent LLM (e.g., Gemini 2.5 Pro, GPT-4/5) immediately knows that the Anchor and Option B describe the same concept just with different words. But when I test embedding models like gemini-embedding-001 (currently top of MTEB), they consistently rate Option A as more similar measured by cosine similarity. They’re getting tricked by surface-level word overlap.
I put together a small GitHub repo that uses ChatGPT to generate and test these “semantic triplets:
I’m not sure what the “biggest” problem is, but I do think diversity is vastly underappreciated compared to relevance.
You can have maximally relevant search results that are horrible. Because most users (and LLMs) want to understand the range of options, not just one type of relevant option.
Search for “shoes” and only see athletic shoes is a bad experience. You’ll sell more shoes, and keep the user engaged, if you show a diverse range of shoes.
Might also be useful for dataset curation, or even just prompt engineering. For example when training a classification task and picking a diverse set of examples for training or evaluation.
True, I think that's also a great usecase! Though these algorithms likely won't scale to very large datasets (e.g. millions of samples), but for smaller datasets, like fine-tuning sets, I think this would work very well. I've worked on something similar in the past that works for larger datasets (semantic deduplication: https://github.com/MinishLab/semhash).
The biggest problem with retrieval is actually semantic relevance. I think most embedding models don't really capture sentence-level semantic content and instead act more like bag-of-words models averaging local word-level information.
Consider this simple test I’ve been running:
Anchor: “A background service listens to a task queue and processes incoming data payloads using a custom rules engine before persisting output to a local SQLite database.”
Option A (Lexical Match): “A background service listens to a message queue and processes outgoing authentication tokens using a custom hash function before transmitting output to a local SQLite database.”
Option B (Semantic Match): “An asynchronous worker fetches jobs from a scheduling channel, transforms each record according to a user-defined logic system, and saves the results to an embedded relational data store on disk.”
Any decent LLM (e.g., Gemini 2.5 Pro, GPT-4/5) immediately knows that the Anchor and Option B describe the same concept just with different words. But when I test embedding models like gemini-embedding-001 (currently top of MTEB), they consistently rate Option A as more similar measured by cosine similarity. They’re getting tricked by surface-level word overlap.
I put together a small GitHub repo that uses ChatGPT to generate and test these “semantic triplets:
https://github.com/semvec/embedstresstest
gemini-embedding-001 (current #1 on MTEB leaderboard ) scored close to 0% on these adversarial examples.
The repo is unpolished at the moment but it gets the idea across and everything is reproducible.
Anyway, did anyone else notice this problem?
I’m not sure what the “biggest” problem is, but I do think diversity is vastly underappreciated compared to relevance.
You can have maximally relevant search results that are horrible. Because most users (and LLMs) want to understand the range of options, not just one type of relevant option.
Search for “shoes” and only see athletic shoes is a bad experience. You’ll sell more shoes, and keep the user engaged, if you show a diverse range of shoes.
Might also be useful for dataset curation, or even just prompt engineering. For example when training a classification task and picking a diverse set of examples for training or evaluation.
True, I think that's also a great usecase! Though these algorithms likely won't scale to very large datasets (e.g. millions of samples), but for smaller datasets, like fine-tuning sets, I think this would work very well. I've worked on something similar in the past that works for larger datasets (semantic deduplication: https://github.com/MinishLab/semhash).
Fascinating!