Yes, also for semantic indexes, I use one for person/role/org matches. So that CEO == chief executive ~= managing director good when you have grey data and multiple look up data sources that use different terms.
Embeddings are good at partitioning document stores at a coarse grained level, and they can be very useful for documents where there's a lot of keyword overlap and the semantic differentiation is distributed. They're definitely not a good primary recall mechanism, and they often don't even fully pull weight for their cost in hybrid setups, so it's worth doing evals for your specific use case.
You could assign the cluster based on what the k nearest neighbors are, if there is a clear majority. The quality will depend on the suitability of your embeddings.
It altogether depends on the quality and suitability of the provided embedding vector that you provide. Even with a long embedding vector using a recent model, my estimation is that the classification will be better than random but not too accurate. You would typically do better by asking a large model directly for a classification. The good thing is that it is often easy to create a small human labeled dataset and estimate the error confusion matrix via each approach.
Their self-reported benchmarks have them out-performing pinecone by 7x in queries-per-second: https://zvec.org/en/docs/benchmarks/
I'd love to see those results independently verified, and I'd also love a good explanation of how they're getting such great performance.
Did someone compared with uSearch (https://github.com/unum-cloud/USearch)?
That I would like to see too, usearch is amazingly fast, 44m embeddings in < 100ms
I thought you need memory for these things and CPU is not the bottleneck?
Are these sort of similarity searches useful for classifying text?
Yes, also for semantic indexes, I use one for person/role/org matches. So that CEO == chief executive ~= managing director good when you have grey data and multiple look up data sources that use different terms.
Embeddings are good at partitioning document stores at a coarse grained level, and they can be very useful for documents where there's a lot of keyword overlap and the semantic differentiation is distributed. They're definitely not a good primary recall mechanism, and they often don't even fully pull weight for their cost in hybrid setups, so it's worth doing evals for your specific use case.
You could assign the cluster based on what the k nearest neighbors are, if there is a clear majority. The quality will depend on the suitability of your embeddings.
It altogether depends on the quality and suitability of the provided embedding vector that you provide. Even with a long embedding vector using a recent model, my estimation is that the classification will be better than random but not too accurate. You would typically do better by asking a large model directly for a classification. The good thing is that it is often easy to create a small human labeled dataset and estimate the error confusion matrix via each approach.