Zvec: A lightweight, fast, in-process vector database

(github.com)

42 points | by dvrp 2 days ago ago

9 comments

simonw 27 minutes ago

Their self-reported benchmarks have them out-performing pinecone by 7x in queries-per-second: https://zvec.org/en/docs/benchmarks/

I'd love to see those results independently verified, and I'd also love a good explanation of how they're getting such great performance.

clemlesne 2 hours ago

Did someone compared with uSearch (https://github.com/unum-cloud/USearch)?

[-]

neilellis 11 minutes ago

That I would like to see too, usearch is amazingly fast, 44m embeddings in < 100ms

_pdp_ 30 minutes ago

I thought you need memory for these things and CPU is not the bottleneck?

skybrian an hour ago

Are these sort of similarity searches useful for classifying text?

[-]

neilellis 8 minutes ago

Yes, also for semantic indexes, I use one for person/role/org matches. So that CEO == chief executive ~= managing director good when you have grey data and multiple look up data sources that use different terms.

CuriouslyC 39 minutes ago

Embeddings are good at partitioning document stores at a coarse grained level, and they can be very useful for documents where there's a lot of keyword overlap and the semantic differentiation is distributed. They're definitely not a good primary recall mechanism, and they often don't even fully pull weight for their cost in hybrid setups, so it's worth doing evals for your specific use case.

esafak 30 minutes ago

You could assign the cluster based on what the k nearest neighbors are, if there is a clear majority. The quality will depend on the suitability of your embeddings.

OutOfHere an hour ago

It altogether depends on the quality and suitability of the provided embedding vector that you provide. Even with a long embedding vector using a recent model, my estimation is that the classification will be better than random but not too accurate. You would typically do better by asking a large model directly for a classification. The good thing is that it is often easy to create a small human labeled dataset and estimate the error confusion matrix via each approach.