Vector databases are the wrong abstraction

(timescale.com)

257 points | by jascha_eng 14 hours ago ago

48 comments

whakim 4 hours ago

This is actually really cool, and despite what I'm sure will come off as (constructive) criticism, I am very impressed!

First, I think you oversell the overhead of keeping data in sync and the costs of not doing so in a timely manner. Almost any distributed system that is using multiple databases already needs to have a strategy for dealing with inconsistent data. As far as this problem goes, inconsistent embeddings are a pretty minor issue given that (1) most embedding-based workflows don't do a lot of updating/deletion; and (2) the sheer volume of embeddings from only a small corpus of data means that in practice you're unlikely to notice consistency issues. In most cases you can get away with doing much less than is described in this post. That being said, I want to emphasize that I still think not having to worrying about syncing data is indeed cool.

Second, IME the most significant drawback to putting your embeddings in a Postgres database with all your other data is that the workload looks so different. To take one example, HNSW indices using pgvector consume a ton of resources - even a small index of tens of millions of embeddings may be hundreds of gigabytes on disk and requires very aggressive vacuuming to perform optimally. It's very easy to run into resource contention issues when you effectively have an index that will consume all the available system resources. The canonical solution is to move your data into another database, but then you've recreated the consistency problem that your solution purports to solve.

Third, a question: how does this interact with filtering? Can you take advantage of partial indices on the underlying data? Are some of the limitations in pgvector's HNSW implementation (as far as filtering goes) still present?

[-]

avthar 4 hours ago

Post co-author here. Really appreciate the feedback.

Your point about HNSW being resource intensive is one we've heard. Our team actually built another extension called pgvectorscale [1] which helps scale vector search on Postgres with a new index type (StreamingDiskANN). It has BQ out the box and can also store vectors on disk vs only in memory.

Another practice I've seen work well is for teams use to use a read replica to service application queries and reduce load on the primary database.

To answer your third question, if you combine Pgai Vectorizer with pgvectorscale, the limitations around filtered search in pgvector HNSW are actually no longer present. Pgvectorscale implements streaming filtering, ensuring more accurate filtered search with Postgres. See [2] for details.

[1]: https://github.com/timescale/pgvectorscale [2]: https://www.timescale.com/blog/how-we-made-postgresql-as-fas...

[-]

whakim an hour ago

Thanks for your answer. I hear you on using a read-replica to serve embedding-based queries, but I worry there are lots of cases where that breaks down in practice: presumably you still need to do a bunch of IO on the primary to support insertion, and presumably reconstituting an index (e.g. to test out new hyperparameters) isn't cheap; at least you can offload the memory requirements of reading big chunks of your graph into memory onto the follower though.

Cool to see the pgvectorscale stuff; it sounds like the approach for filtering is not dissimilar to the direction that the pgvector team are taking with 0.8.0, although the much-denser graph (relative to HNSW) may mean the approach works even better in practice?

morgango 8 hours ago

Great point!

(Disclaimer: I work for Elastic)

Elasticsearch has recently added a data type called semantic_text, which automatically chunks text, calculates embeddings, and stores the chunks with sensible defaults.

Queries are similarly simplified, where vectors are calculated and compared internally, which makes a lot less I/O and a lot simpler client code.

https://www.elastic.co/search-labs/blog/semantic-search-simp...

[-]

pjot 7 hours ago

I made something similar, but used duckDB as the vector store (and query engine)! It’s impressively fast

https://github.com/patricktrainer/duckdb-embedding-search

[-]

ekianjo 4 hours ago

There is vector type data available in duckdb now?

[-]

wild_egg 3 hours ago

They call it a fixed size array type but, yes. It was added earlier this year. Works really great

https://duckdb.org/2024/05/03/vector-similarity-search-vss.h...

pjot 2 hours ago

Yep! It was added in v0.10.0 - which was released a month or two after I made this.

This is using v0.9.1

jdthedisciple 8 hours ago

How does their embedding model compare in terms of retrieval accuracy to, say `text-embedding-3-small` and `text-embedding-3-large`?

[-]

splike 7 hours ago

You can use openai embeddings in elastic if you don't want to use their elser sparse embeddings

binarymax 7 hours ago

It’s impossible to answer that question without knowing what content/query domain you are embedding. Checkout MTEB leaderboard, dig into the retrieval benchmark, and look for analogous datasets.

[-]

3abiton 6 hours ago

So we're talking maximizing embedding model per use case? Medical dats would require differnet model than say sales data? Sounds very fragmented approach.

[-]

ekianjo 4 hours ago

The answer lies with a validation dataset that you create for testing.

avthar 14 hours ago

Hey HN! Post co-author here, excited to share our new open-source PostgreSQL tool that re-imagines vector embeddings as database indexes. It's not literally an index but it functions like one to update embeddings as source data gets added, deleted or changed.

Right now the system only supports OpenAI as an embedding provider, but we plan to extend with local and OSS model support soon.

Eager to hear your feedback and reactions. If you'd like to leave an issue or better yet a PR, you can do so here [1]

[1]: https://github.com/timescale/pgai

[-]

TechDebtDevin 40 minutes ago

I'm doing something similar with go + postgres

hhdhdbdb 8 hours ago

Pretty smart. Why is the DB api the abstraction layer though? Why not two columns and a microservice. I assume you are making async calls to get the embeddings?

I say that because it seems n unsual. Index would suit sync better. But async things like embeddings, geo for an address, is this email considered a spammer etc. feel like app level stuff.

[-]

cevian 7 hours ago

(post co-author here)

The DB is the right layer from a interface point of view -- because that's where the data properties should be defined. We also use the DB for bookkeeping what needs to be done because we can leverage transactions and triggers to make sure we never miss any data. From an implementation point of view, the actual embedding does happen outside the database in a python worker or cloud functions.

Merging the embeddings and the original data into a single view allows the full feature set of SQL rather than being constrained by a REST API.

[-]

hhdhdbdb 4 hours ago

That is arguable because while it is a calculated field, it is not a pure one (IO is required), and not necessarily idempotent, not atomic and not guaranteed to succeed.

It is certainly convenient for the end user, but it hides things. What if the API calls to open AI fail or get rate limited. How is that surfaced. Will I see that in my observability. Will queries just silently miss results.

If the DB does the embedding itself synchronously within the write it would make sense. That would be more like elastic search or a typical full text index.

mind-blight 2 hours ago

This is super cool! One suggestion for the blog: I would put "re-imagines vector embeddings as database indexes. It's not literally an index but it functions like one to update embeddings as source data gets added, deleted or changed." as a tl/dr at the top.

It wasn't clear to me why this was significantly different than using pg_vector until I read that. That makes the rest of the post (e.g. why this you need the custom methods in a `SELECT`) make a lot more sense in context

metalwhale 4 hours ago

Thank you for sharing this! I have one question: Is there any plan to add support for local LLM / embeddings models?

[-]

motoxpro 4 hours ago

"Right now the system only supports OpenAI as an embedding provider, but we plan to extend with local and OSS model support soon."

In the post you responded to

[-]

metalwhale 4 hours ago

Haha I feel so dumb now. Thank you!

jdthedisciple 8 hours ago

Whats wrong with using FAISS as your single db?

Its like sqlite for vector embeddings, and you can store metadata (the primary data, foreign keys, etc) along with the vectors, preserving the relationship.

Not sure if the metadata is indexxed but at least iirc it's more or less trivial to update the embeddings when your data changes (tho i haven't used it in a while so not sure).

[-]

avthar 7 hours ago

Good q. For most standalone vector search use cases, FAISS or a library like it is good.

However, FAISS is not a database. It can store metadata alongside vectors, but it doesn't have things you'd want in your app db like ACID compliance, non-vector indexing, and proper backup/recovery mechanisms. You're basically giving up all the DBMS capabilities.

For new RAG and search apps, many teams prefer just using a single app db with vector search capabilities included (Postgres, Mongo, MySQL etc) vs managing an app db and a separate vector db.

bryantwolf 6 hours ago

Hey, this looks great! I'm a huge fan of vectors in Postgres or wherever your data lives, and this seems like a great abstraction.

When I write a sql query that includes a vector search and some piece of logic, like: ``` select name from users where age > 21 order by <vector_similarity(users.bio, "I like long walks on the beach")> limit 10; ``` Does it filter by age first or second? I've liked the DX of pg_vector, but they do vector search, followed by filtering. It seems like that slows down what should be the superpower of a setup like this.

Here's a bit more of a complicated example of what I'm talking about: https://blog.bawolf.com/p/embeddings-are-a-good-starting-poi...

[-]

cevian 5 hours ago

(post co-author here)

It could do either depending on on what the planner decides. In pgvector it usually does post-filtering in practice (filter after vector search).

pgvector HNSW has the problem that there is a cutoff of retrieving some constant C results and if none of them match the filter than it won't find results. I believe newer version of pgvector address that. Also pgvectorscale's StreamingDiskANN[1] doesn't have that problem to begin with.

[1]: https://www.timescale.com/blog/how-we-made-postgresql-as-fas...

jeffchuber 6 hours ago

pg_vector does post-filtering, not pre-filtering

[-]

brewii 5 hours ago

timescaledbs pg_vector_scale extension does pre-filtering thankfully. shame i cant get it in RDS though

[-]

akulkarni 4 hours ago

You can request it for RDS

dinobones 9 hours ago

Wow, actually a good point I haven't seen anyone make.

Taking raw embeddings and then storing them into vector databases, would be like if you took raw n-grams of your text and put them into a database for search.

Storing documents makes much more sense.

[-]

choilive 8 hours ago

Been using pgvector for a while, and to me it was kind of obvious that the source document and the embeddings are fundamentally linked so we always stored them "together". Basically anyone doing embeddings at scale is doing something similar to what Pgai Vectorizer is doing and is certainly a nice abstraction.

[-]

jdthedisciple 8 hours ago

I used FAISS as it also allowed me to trivially store them together.

Idk how well it scales though, it's just doing it's job on my hobby project scale

For my few 100'000s embeddings I must say the performance was satisfactory.

markusw 8 hours ago

I’m using sqlite-vec along with FTS5 in (you guessed it) SQLite and it’s pretty cool. :)

keithwhor 6 hours ago

I agree.

Similar to blog post, instead of at the extension layer I built a PostgreSQL ORM for Node.js based on ActiveRecord + Django's ORM that includes the concept of vector fields [0][1] that lets you write code like this:

    // Stores the `title` and `content` fields together as a vector
    // in the `content_embedding` vector field
    BlogPost.vectorizes(
      'content_embedding',
      (title, content) => `Title: ${title}\n\nBody: ${content}`
    );

    // Find the top 10 blog posts matching "blog posts about dogs"
    // Automatically converts query to a vector
    let searchBlogPosts = await BlogPost.query()
      .search('content_embedding', 'blog posts about dogs')
      .limit(10)
      .select();

I find it tremendously useful; you can query the underlying data or the embedding content, and you can define how the fields in the model get stored as embeddings in the first place.

[0] https://github.com/instant-dev/orm?tab=readme-ov-file#using-...

[1] https://github.com/instant-dev/orm?tab=readme-ov-file#using-...

ok123456 7 hours ago

Yes. Materialized Views are good.

[-]

unholyguy001 7 hours ago

That was just what I was thinking. This approach will have the same issues that materialized views have as well

cevian 7 hours ago

haha. We had a good internal debate as to whether this is more like indexes or more like Materialized Views. It's kinda a mixture of the two.

gillesjacobs 5 hours ago

Safe to say that if you're using off-the-shelf character-based chunking, your AI app is not past PoC.

jeffchuber 6 hours ago

> Vector databases treat embeddings as independent data, divorced from the source data from which embeddings are created

With the exception of Pinecone: Chroma, Qdrant, Weaviate, Elastic, Mongo, and many others store the chunk/document alongside the embedding.

This is intentional misinformation.

[-]

avthar 5 hours ago

Post co-author here. The point is a little nuanced, so let me explain:

You are correct in saying that that you can store embeddings and source data together in many vectordbs. We actually point this out in the post. The main point is that they are not linked but merely stored alongside each other. If one changes, the other one does not automatically change, making the relationship between the two stale.

The idea behind Pgai Vectorizer is that it actually links embeddings with underlying source data so that changes in source data are automatically reflected in embeddings. This is a better abstraction and it removes the burden of the engineer to ensure embeddings are in sync as their data changes.

[-]

jeffchuber 5 hours ago

i know it is the case in chroma this is supported out of the box with 0 lines of code. i’m pretty sure it’s supported everywhere else in no more than 3 lines of code.

[-]

spmurrayzzz 3 hours ago

This is also the case with weaviate (as you assumed). If you update the value of any previously vectorized property, weaviate generates new vectors automatically for you.

cevian 5 hours ago

as far as I can tell Chroma can only store chunks, not the original documents. This is from your docs `If the documents are too large to embed using the chosen embedding function, an exception will be raised`.

In addition it seems that embeddings happen at ingest time. So, if, for example, the OpenAI endpoint is down the insert will fail. That, in turn means your users need to use a retry mechanism and a queuing system. All the complexity we describe in our blog.

Obviously, I am not an expert in Chroma. So apologies in advance if I got anything wrong. Just trying to get to the heart of the differences between the two systems.

[-]

spmurrayzzz 2 hours ago

Chroma certainly doesn't have the most advanced API in this area, but you can for sure store chunks or documents, its up to you. If your document size is too large to generate embeddings in a single forward pass, then yes you do need to chunk in that scenario.

Oftentimes though, even if the document does fit, you choose to chunk anyways or further transform the data with abstractive/extractive summarization techniques to improve your search dynamics. This is why I'm not sure the complexity noted in the article is relevant in anything beyond a "naive RAG" stack. How its stored or linked is an issue to some degree, but the greater more complex smell is in what happens before you even get to that point of inserting the data.

For more production-grade RAG, just blindly inserting embeddings wholesale for full documents is rarely going to get you great results (this varies a lot between document sizes and domains). So as a result, you're almost always going to be doing ahead-of-time chunking (or summarization/NER/etc) not because you have to due to document size, but because your search performance demands it. Frequently this involves more than one embeddings model for capturing different semantics or supporting different tasks, not to mention reranking after the initial sweep.

That's the complexity that I think is worth tackling in a paid product offering, but the current state of the module described in the article isn't really competitive with the rest of the field in that respect IMHO.

sgarland 7 hours ago

> the responsibility for generating and updating them as the underlying data changes can be handed over to the database management system

And now we shift ever more slightly back towards logic in the DB. I for one am thrilled; there’s no reason other than unfamiliarity to not let RDBMS perform functions it’s designed to do. As long as these offloads are documented in code, embrace not needing to handle it in your app.

mattxxx 8 hours ago

This reads solely as a sales pitch, which quickly cuts to the "we're selling this product so you don't have to think about it."

...when you actually do want to think about it (in 2024).

Right now, we're collectively still figuring out:

  1. Best chunking strategies for documents
  2. Best ways to add context around chunks of documents
  3. How to mix and match similarity search with hybrid search
  4. Best way to version and update your embeddings

[-]

cevian 8 hours ago

(post co-author here)

We agree a lot of stuff still needs to be figured out. Which is why we made vectorizer very configurable. You can configure chunking strategies, formatting (which is a way to add context back into chunks). You can mix semantic and lexical search on the results. That handles your 1,2,3. Versioning can mean a different version of the data (in which case the versioning info lives with the source data) OR a different embedding config, which we also support[1].

Admittedly, right now we have predefined chunking strategies. But we plan to add custom-code options very soon.

Our broader point is that the things you highlight above are the right things to worry about, not the data workflow ops and babysitting your lambda jobs. That's what we want to handle for you.

[1]: https://www.timescale.com/blog/which-rag-chunking-and-format...

dmezzetti 2 hours ago

I've been in the vector database space for a while (primary author of txtai). I do think vector indexing in traditional databases with tools like pgvector is a good option.

txtai has long had SQLite + Faiss support to enable metadata filtering with vector search. That pattern can take you farther than you think.

The design decisions I've made is to make it easy to plug different backends in for metadata and vectors. For example, txtai supports storing both in Postgres (w/ pgvector). It also supports sqlite-vec and DuckDB.

I'm not sure there is a one-size-fits-all approach. Flexibility and options seems like a win to me. Different situations warrant different solutions.