Ultra efficient vector extension for SQLite

(marcobambini.substack.com)

158 points | by marcobambini 5 days ago ago

54 comments

simonw 2 days ago

This is a neat project - good API design, performance looks impressive.

Note that this one isn't open source: https://github.com/sqliteai/sqlite-vector/blob/main/LICENSE....

The announcement says:

> We believe the community could benefit from sqlite-vector, which is why we’ve made it entirely free for open-source projects.

But that's not really how this works. If my open source projects incorporate aspects of this project they're no longer open source!

[-]

thorsson12 2 days ago

Ah, yes, this is a "source available" project, not what you would normally call an "open source" project. Still cool!

rudedogg 2 days ago

Dang, I was really excited about this too.

I guess I'll either stick with sqlite-vec or give turso another look. I'm not fond of the idea of a SQLite fork though.

Do you know if anything else I should take a look at? I know you use a lot of this stuff for your open-source AI/ML stuff. I'd like something I can use on device.

[-]

Rendello 15 hours ago

You can point DuckDB at a SQLite file and it will read it using its special columnar format. I'm not sure if that's what you need, though.

dmezzetti 2 days ago

Odd licensing strategy here. It's like someone that wants the cachet of saying they are open source without being it.

OutOfHere 2 days ago

In contrast, https://github.com/asg017/sqlite-vec is dual-licensed under Apache and MIT, which makes it open source.

tanvach 2 days ago

There is the 'Additional Grant for Open-Source Projects' section that seems to permit inclusion in open source project. Do you mind explaining why you think this is not enough? I'm not an expert in licenses so genuinely interested in your take.

[-]

matharmin 2 days ago

Let's say I have an open-source project licensed under Apache 2. The grant allows me to include the extension in my project. But it doesn't allow me to relicense it under Apache 2 or any other compatible license. So if I include it, my project can't be Apache 2-licensed anymore.

Apache 2 is just an example here - the same would apply for practically any open source license.

The one place I imagine it could still work is if the open-source project, say a sqlite browser, includes it as an optional plugin. So the project itself stays open-source, but the grant allows using the proprietary plugin with it.

[-]

stavros a day ago

I don't see why this would infect your project, though. You aren't using the code directly, you're using it as a tool dependency, no? Same way as if your OSS project used an Oracle DB to store data.

[-]

nextaccountic 19 hours ago

Unlike Oracle DB, sqlite gets embedded in your program binary. It's a library, not an external service, and this matters for OSS licenses

[-]

stavros 18 hours ago

Ah true, I forgot because I always use it in Python, where it's built in.

simonw 2 days ago

The reason I choose to apply open source licenses to my project as I want other people to be able to use them without any limitations (beyond those set out in the open source license I selected, which are extremely permissive.)

If they depend on software that carries limitations, I can no longer make that promise to my own users.

Or does their extra license term mean I can ship my own project which is the thinnest possible wrapper around theirs but makes it fully open source? That seems unlikely.

[-]

b33j0r 4 hours ago

I used to think this, but now I feel like anything I write will just be vacuumed up by bots and no human will ever even know about it, unless I include some kind of terms that at least make the work traceable to an artifact.

In this aggregate form, there is little difference between pseudocode snippets in a post like this one, versus a well-maintained library getting scraped.

The more I think about it, I don’t even really crave credit so much as the feedback loop that tells me whether I’m doing anything useful.

I haven’t solved this contradiction, so I still release under the MIT license.

MangoToupe 2 days ago

I guess "free software" is well and truly dead as a term with any general cultural weight.

[-]

theamk a day ago

As long as threads like this appear, not yet.

But not for the lack of trying - people keep trying to redefine it...

F3nd0 2 days ago

Even worse, it seems like it’s not Free Software, either.

[-]

froh a day ago

from the way you say this, it seems you confuse cost free with freedom, free software being about the latter, just implying the former.

[-]

F3nd0 a day ago

I was talking about freedom, hence the capitalisation to make that even clearer.

The parent only talked about ‘open source’, which has a huge overlap with Free Software, but the two still have different formal definitions (not to mention the completely different ideas behind them). This still left the (admittedly unlikely) possibility of the software in question being Free (as in freedom), so I felt it worth pointing out it wasn’t that, either. A common way to talk about software which is explicitly both Free and open-source at the same time is to call it Free and Open-Source Software.

mholt 2 days ago

> [sqlite-vec] works via virtual tables, which means vectors must live in separate tables and queries become more complex.

Not really, you can just call the distance function directly and your vector blob can be in any regular field in any regular table, like what I do. Works great.

More info: https://github.com/asg017/sqlite-vec/issues/196

[-]

roflcopter69 2 days ago

Sorry for not being on topic, just wanted to say hi @mholt and for making and maintaining Caddy! Happy Caddy user here.

[-]

mholt 2 days ago

Thank you, that's always nice to read! I will pass this along to our maintainer team.

roflcopter69 2 days ago

It's unfortunate that this one is not really open source, it has Elastic License 2.0 license.

But it's still a wonderful example for how far you can get with brute-force vector similarity search that has been optimized like crazy by making use of SIMD. Not having to use a fancy index is such a blessing. Think of all the complexities that you don't have when not using an index. You don't have these additional worries about memory and/or disk usage or insert/delete/update costs and you can make full use of SQL filters. It's crazy to me what kind of vector DBs people put up with. They use custom query languages, have to hold huge indices in memory or write humongous indices to disk for something that's not necessarily faster than brute-force search for tables with <5M rows. And let's be honest who has those gigantic tables with more than 5M rows? Not too many I'd assume.

[-]

benjiro a day ago

> And let's be honest who has those gigantic tables with more than 5M rows? Not too many I'd assume.

/Looks around all innocent ... does 57 billion count? Hate to tell ya but plenty of use cases when you deal with large datasets and normal table design will get out of hand. And row overhead will bite!

We ended up writing a specialized encoder / decoder to store the information in bytea fields to reduce it to a measly 3 billion (row packing is the better term) but we also lost some of the advantages of having the database btree index on some of the date fields.

Thing is, the moment you deal with big data, things starts to collaps fast if you want to deal with brute force, vs storage vs index handeling.

I can think of a lot more projects that can expand to crazy numbers, if you follow basic database normalization.

[-]

roflcopter69 a day ago

> /Looks around all innocent ... does 57 billion count?

That surely counts as a case where brute-force search will not do :) I'm intrigued though, do you really need to make searches over all those vectors or could you filter the candidates down to something <5M ? As I wrote, this is one of the nice advantages of no-index brute-force search, you can use good 'ol SQL WHERE clauses to limit the amount of candidates in many cases and then the brute-force search is not as expensive. Complex indices like HNSW or DiskANN don't play as nice with filters.

marcobambini a day ago

I understand your concerns about the license, but our goal was simply to prevent large corporations from taking our work, forking it, and offering it to their customers while we struggle to sustain development. We need to monetize our work in order to survive, though we do offer very generous commercial licenses for those who are interested.

[-]

misswaterfairy 9 hours ago

Why not release the source under AGPL (where 'network use' counts as distribution, unlike GPL), and offer commercial licences for those who want more favourable terms?

https://fossa.com/blog/dual-licensing-models-explained/

The maintainer of libxml2, Nick Wellnhofer, will be moving all his future contributions over to an AGPL fork as corporate users of libxml2 were unwilling to contribute financially.

https://news.ycombinator.com/item?id=45288488

https://gitlab.gnome.org/GNOME/libxml2/-/issues/976#note_253...

bawolff a day ago

Wait they are literally using the domain name https://sqlite.ai while not having any association with the sqlite authors?

I know that open source projects rarely protect their trademarks so maybe they are legally in the clear, but this still feels like bordering on fraud.

Seriously, look at their logo. This seems like a clear attempt to mislead consumers.

[-]

kg a day ago

Their company name is apparently SQLite Cloud, inc. and they offer multiple products with SQLite in the name. I guess maybe the sqlite people just don't care about companies using the trademark?

[-]

bawolff a day ago

Even if they don't (or maybe lawyers are too expensive to make it worth it) it still seems pretty scummy to me.

[-]

marcobambini a day ago

We are backed by the SQLite author (Dr. Richard Hipp), and we have full rights to use the SQLite name.

ashvardanian 2 days ago

My usual line of feedback would be to start with a more aggressive benchmark. Indexing 100K dense vector (100ish MB here) is not generally a good idea. Brute-force search at that scale is already trivial at 10 GB/s/core.

[-]

sdenton4 2 days ago

They say int he post that they're doing optimized brute-force search, which honestly makes a lot of sense for the local-scaled context.

Vector databases are often over-optimized for getting results into papers (where the be-all-end-all measure is recall@latency benchmarking). Indexed search is very rigid - filtered search is a pain, and you're stuck with the metric you used for indexing.

At smaller data scales, you get a lot more flexibility, and shoving things into a indexed search mindlessly will lead to a lot of pain. Providing optimized flexible search at smaller scales is quite valuable, imo.

[-]

ashvardanian 2 days ago

Ah, I see the article does mention "brute-force-like" — I must have skimmed past that. I'd be curious what exactly is meant by it in practice.

A small note: since the project seems to include @maratyszcza's fp16 library (MIT), it might be nice to add a line of attribution: https://github.com/maratyszcza/fp16

And if you ever feel like broadening the benchmarks, it could be interesting to compare with USearch. It has had an SQLite connector for a while and covers a wide set of hardware backends and similarity metrics, though it's not yet as straightforward to use on some OSes: https://unum-cloud.github.io/usearch/sqlite

[-]

sdenton4 2 days ago

To be clear, I'm not the author of the post. But I do maintain a library for folks working with large audio datasets, built on a combination of SQLite and usearch. :)

[-]

dotancohen a day ago

What library is that? My current project is working with voice recordings. My personal collection of voice recordings spans 20 years and measures in the high tens of GiB.

[-]

sdenton4 19 hours ago

Here you go: https://github.com/google-research/perch-hoplite

It's geared towards bioacoustics in particular. It's pretty easy to provide a wrapper for any given model though. Feel free to send a ping if you try it out for speech; will be happy to hear how it goes.

[-]

dotancohen 14 hours ago

Interesting. Audio search isn't a problem I've thought about addressing, as I'll have transcriptions anyway. But knowing this exists might inspire some additional features or use cases that I haven't thought of yet. Thank you.

aabhay 2 days ago

The repo mentions approximate NN search but the article implies this is mainly brute force. Is there any indexing at all then? If not, is the approximate part an app-space thing e.g. storing binary vectors alongside the real ones?

In addition, if things are brute forced, wouldn’t a columnar db perform better than a row-based one? E.G. DuckDB?

[-]

OutOfHere 2 days ago

A columnar database is completely irrelevant to vector search. Vectors aren't stored in columns. Traditional indexing too is altogether irrelevant because brute force means a full pass through the data. Specialized indexes can be relevant, but then the search is generally approximate, not exact.

[-]

roflcopter69 2 days ago

How's a database being columnar irrelevant to vector search? This very vector search extension shows that brute force search can work surprisingly well up to a certain dataset size and at this point columnar storage is great because it gives a perfect memory access pattern for the vector search instead of iterating over all the rows of a table and only accessing the vector of a row.

[-]

OutOfHere 2 days ago

That makes sense. I withdraw my comment.

mijoharas a day ago

Just for anyone looking into this, while I haven't tried this solution yet, I can say that sqlite-vec has been very pleasant to work with and I'd recommend it to people who are looking.

hbcondo714 2 days ago

The author replied to one of my comments[1] here on HN a few months ago when I asked about doing ANN on the edge; nice to see it arrive!

[1] https://news.ycombinator.com/item?id=44063950

kamranjon 2 days ago

I'd be interested to understand the query performance when compared to the HNSW implementation (Turso?) they mentioned. In general search performance is more important to me, and I don't mind having an increase in insert overhead to have very fast vector search.

[-]

dandanua 2 days ago

HNSW is not accurate. I guess brute-force means that sqlite-vector returns the best match.

[-]

kamranjon 2 days ago

Right but libsql(Turso) uses HNSW - so i'd be curious to know how performance of sqlite-vector compares - they do say they "Use a brute-force-like approach, but highly optimized." - which to me, would be very interesting to see compared with a HNSW approach.

simonw a day ago

I believe a common approach with this kind of inaccurate index is to use the index to get the top 100 and then calculate the exact distance against those 100 matches to get the top 10.

rcarmo 2 days ago

Seems to work fine, even if the license is a bit of a put off.

I am however still looking for a fast CPU-bound embedding model (fastembed is quite slow on small ARM chips).

fifilura 2 days ago

Honest question, I just want to learn, what are vector databases used for?

[-]

simonw 2 days ago

They're useful for embeddings, which let you turn articles (and images and other content) into a huge array of floating point numbers that capture the semantics of the content. Then you can use a vector database to find similar items to each other - or similar items to the user's own search query.

I wrote a big tutorial about embeddings a couple of years ago which still holds up today: https://simonwillison.net/2023/Oct/23/embeddings/

aaronblohowiak 2 days ago

finding similar things quickly, where the "shape" of a thing can be defined by a vector (like embeddings for instance). this can be used in lots of machine learning applications

[-]

fifilura 2 days ago

I figured it would be something like this. And vectors as rows in a regular table would be too slow then?

[-]

sandyarmstrong 2 days ago

sqlite does not have native support for a vector-like column type. Extensions like this and sqlite-vec build on the BLOB column type, and provide additional functions to let you efficiently search and manipulate this vector data.

bawolff a day ago

Everyone is saying AI, but also things like image similarity search.