Zvec: A lightweight, fast, in-process vector database

(github.com)

42 points | by dvrp 2 days ago ago

9 comments

  • simonw 27 minutes ago

    Their self-reported benchmarks have them out-performing pinecone by 7x in queries-per-second: https://zvec.org/en/docs/benchmarks/

    I'd love to see those results independently verified, and I'd also love a good explanation of how they're getting such great performance.

  • clemlesne 2 hours ago

    Did someone compared with uSearch (https://github.com/unum-cloud/USearch)?

    • neilellis 11 minutes ago

      That I would like to see too, usearch is amazingly fast, 44m embeddings in < 100ms

  • _pdp_ 30 minutes ago

    I thought you need memory for these things and CPU is not the bottleneck?

  • skybrian an hour ago

    Are these sort of similarity searches useful for classifying text?

    • neilellis 8 minutes ago

      Yes, also for semantic indexes, I use one for person/role/org matches. So that CEO == chief executive ~= managing director good when you have grey data and multiple look up data sources that use different terms.

    • CuriouslyC 39 minutes ago

      Embeddings are good at partitioning document stores at a coarse grained level, and they can be very useful for documents where there's a lot of keyword overlap and the semantic differentiation is distributed. They're definitely not a good primary recall mechanism, and they often don't even fully pull weight for their cost in hybrid setups, so it's worth doing evals for your specific use case.

    • esafak 30 minutes ago

      You could assign the cluster based on what the k nearest neighbors are, if there is a clear majority. The quality will depend on the suitability of your embeddings.

    • OutOfHere an hour ago

      It altogether depends on the quality and suitability of the provided embedding vector that you provide. Even with a long embedding vector using a recent model, my estimation is that the classification will be better than random but not too accurate. You would typically do better by asking a large model directly for a classification. The good thing is that it is often easy to create a small human labeled dataset and estimate the error confusion matrix via each approach.