The PlanetScale vectors public beta

(planetscale.com)

37 points | by ksec 3 hours ago ago

7 comments

  • tanoku 2 hours ago

    Hi! I'm one of the authors of this feature. It's something quite novel, because it's not just a HNSW plug-in for MySQL (like e.g. pgvector is for Postgres). It's a fully transactional index, integrated into InnoDB.

    We based the implementation on two very new papers from Microsoft Research, SPANN and SPFresh. SPANN is a hybrid graph/tree algorithm that does a fantastic job of scaling larger-than-RAM indexes (https://arxiv.org/abs/2111.08566) and SPFresh expands upon it by defining a set of background operations that can be performed to maintain the index's performance and recall while it's continuously updated in-place (https://arxiv.org/html/2410.14452v1). The novel thing here is designing all the SPANN _and_ SPFresh operations to be transactional, and integrating them in MySQL's default storage engine.

    This tight integration fundamentally means that inserting, updating and deleting vector data from MySQL is always reflected immediately in the index as part of committing your transaction. But it also means that the indexes are fully covered by the MySQL binlog; they recover from hard crashes just fine. They're also managed by MySQL's buffer pool, so they scale to terabytes of data, just like any other table. And also crucially, they're fully integrated with the query planner, so they can be used in any query, including JOINs and WHERE clauses (something that many other vector indexes really struggle with).

    We plan to release a whitepaper on our transactionality extensions to SPFresh, which I think will be super interesting, but meanwhile please feel free to test the beta and ask me any questions (here, or by emailing PlanetScale support). Thanks!

    • sujayakar 37 minutes ago

      very cool stuff! I just read the SPFresh paper a few days ago and was wondering if it's been implemented in industry (e.g. Turbopuffer's implementation of SPANN).

      I'd be curious how y'all represent the posting lists for each partition in InnoDB:

      - what IDs are you storing in the posting lists?

      - how are the posting lists represented on disk? are they using compression and/or some form of skip indexing? the paper seemed to use a pretty simple block-based representation, but I'm curious what works well in practice.

      - how do the posting list data structures themselves handle incremental updates and MVCC?

      • tanoku 23 minutes ago

        These are very relevant questions! Thank you!

        We're storing IDs from a ghost column that is created in the table where you're inserting vector data. This works very well in practice and allows updating the value of the vectors in the table, because they're translated into a delete + insert in the vector index by updating the ghost ID.

        We have abstracted away the quantization system from the index; for the initial release, vector data is stored in raw blocks, like in the paper. Query performance is good, but disk usage is high. We're actively testing different quantization algorithms to see which ones we end up offering on GA. We're hoping our beta users will help us guide this choice!

        Incremental updates and MVCC are _extremely tricky_, for both correctness and performance. As you've surely noticed, the hard thing here is that the original paper is very focused on LSM trees, because it exploits the fact that LSM trees get compacted lazily to perform incremental updates to the posting lists ('merges'). MySQL (and Postgres, and all relational databases, really) are B-tree based, and in-place updates for B-trees are expensive! I think we came up with very interesting workarounds for the problem, but it's a quite a bit to drill down in a HN comment. Please stay tuned for our whitepaper. :)

  • alberth 2 hours ago

    Given that PlanetScale only operates in a handful of AWS and Google regions, isn't the latency brutal for any customer not hosting in 1 of the ~12 datacenter PlanetScale is in.

  • vvladymyrov an hour ago

    Awesome. Does planetscale vector support multi node indices (like Milvus) and even further blob store index storage which can be separate from compute, allowing to pay for compute on demand when queries are run?

    • tanoku 41 minutes ago

      Definitely yes to the first: the index implementation is fully integrated with PlanetScale's sharding layer, so you can scale horizontally as much as you need. This works very well in practice: the sharding layer, Vitess, is extremely well battle tested with petabyte-sized clusters in production, and our vector indexes are like any other table on the cluster, so it really scales horizontally with very predictable performance.

      As for separating storage and compute: we don't do that. One of our key sells here is that this is vector data fully integrated into your relational database with ACID semantics. Very hard to do separate storage and compute and keep this behavior!

  • sgammon 2 hours ago

    Awesome! Excited to play with this one.