Spann: Highly-Efficient Billion-Scale Approximate Nearest Neighbor Search (2021)

(arxiv.org)

124 points | by ksec a year ago ago

36 comments

rbranson a year ago

One of the only (the only?) commercial grade implementations was launched recently by us at PlanetScale:

https://planetscale.com/blog/announcing-planetscale-vectors-...

[-]

noahbp a year ago

No ability to host offline, and for 1/8th CPU + 1GB RAM + 800 GB storage, the price is $1,224/month?

I'm sure it works great, but at that price point, I'm stuck with self-hosting Postgres+pgvector.

[-]

bddicken a year ago

Just pointing out that what you're paying for is actually 3x these resources. By default you get a primary server and two replicas with whatever specification you choose. This is primarily for data durability, but you can also send queries to your replicas.

TechDebtDevin a year ago

Which works completely fine as long as you know how to manage your own db without getting wrecked!

But yes, I it seems extereme. But it is also cheaper than hiring a dedicated postgres/db guy who will cost 5 to 10x more per month.

[-]

mhuffman a year ago

There are plenty of set-it-and-forget-it vector dbs right now, maybe too many![0]

[0]https://news.ycombinator.com/item?id=41985176

[-]

TechDebtDevin 10 months ago

For sure, I personally use pgvector myself but I also don't have millions and millions of rows. I haven't messed with anything other than Pinecone so I can't speak to those services, but there's a big difference than a vector db for your own personal use and a chat app/search on a db with millions of users convos and docs, not sure how well these managed vector DB platforms scale, but you probably need the db guy anyways when you're using vectors at scale. Atleast I would.

3abiton a year ago

What's the advantage of NN over vectordb anymore? Are we losing some info when we embed?

Sirupsen a year ago

It works great. We’ve had SPANN in production since October of 2023 at https://turbopuffer.com/

bratao a year ago

SPANN is also implemented in the open-source Vespa.ai

[-]

jbellis a year ago

Actual SPANN or janky "inspired by SPANN" IVF with HNSW in front? Only real SPANN (with SPTAG, and partitioning designed to work with SPTAG) delivers good results. A superficial read of the paper LOOKS like you can achieve similar results by throwing off the shelf components at it, but it doesn't actually work well.

andreer a year ago

[dead]

aaronblohowiak a year ago

Kinda related, hopefully someone here in comments can help: what’s your favorite precise nn search that works on arm Macs for in memory dataset; 100k times / 300 float32 dims per item ? Ideally supporting cosine similarity

Faiss seems big to get going, tried n2 but doesn’t seem to want to install via pip.. if anyone has a go-to I’d be grateful. Thanks.

[-]

visarga a year ago

For just 100K items why don't you simply load the embeds into numpy and use cosine similarity directly? It's like 2 lines of code and works well for "small" number of documents. This would be exact NN search.

Use approximate NN search when you have high volume of searches over millions of vectors.

lmcinnes a year ago

If you just want in-memory then PyNNDescent (https://github.com/lmcinnes/pynndescent) can work pretty well. It should install easily with pip, works well at the scales you mention, and supports a large number of metrics, including cosine.

contravariant a year ago

Out of interest is nearest neighbour even remotely effective with 300 dimensions?

Seems to me that unless most of the variation is in only a couple of directions, pretty much no points are going to be anywhere near one another.

So with cosine similarity you're either going to get low scores for pretty much everything or a basic PCA should be able to reduce the dimensionality significantly.

[-]

ttul a year ago

I think you are referring to what's known as the "curse of dimensionality," where as dimensionality increases, the distance between points tends to become more uniform and large. However, nearest neighbor search can still work effectively because of several key factors:

1. Real data rarely occupies the full high-dimensional space uniformly. Instead, it typically lies on or near a lower-dimensional manifold embedded within the high-dimensional space. This is often called the "manifold hypothesis."

2. While distances may be large in absolute terms, _relative_ distances still maintain meaningful relationships. If point A is closer to point B than to point C in this high-dimensional space, that proximity often still indicates semantic similarity.

3. The data points that matter for a given problem often cluster in meaningful ways. Even in high dimensions, these clusters can maintain separation that makes nearest neighbor search useful.

Let me give a concrete example: Consider a dataset of images. While an image might be represented in a very high-dimensional space (e.g., thousands of pixels), images of dogs will tend to be closer to other dog images than to images of cars, even in this high-dimensional space. The meaningful features create a structure that nearest neighbor search can exploit.

Spam filtering is another area where nearest neighbor is used to good effect. When you know that a certain embedding representing a spam message (in any medium - email, comments, whatever), then if other messages come along and are _relatively_ close to that one, you may conclude that they are on the right side of the manifold to be considered spam.

You could train a special model to define this manifold, but spam changes all the time and constant training doesn’t work well.

[-]

osigurdson a year ago

>> as the "curse of dimensionality," where as dimensionality increases, the distance between points tends to become more uniform and large

Since embeddings are the middle layer of an ANN, doesn't this suggest that there are too many dimensions used during training. I would think a training goal would be to have relatively uniform coverage of the space

[-]

ttul a year ago

The curse of dimensionality in neural network embeddings actually serves a valuable purpose, contrary to what might seem intuitive. While the tendency for distances to become more uniform and large in high-dimensional spaces might appear problematic, this characteristic helps maintain clear separation between different semantic concepts. Rather than aiming for uniform coverage of the embedding space, neural networks benefit from having distinct clusters with meaningful gaps between them – much like how a library benefits from having clear separation between different subject areas, rather than books being randomly distributed throughout the space.

The high dimensionality of embeddings provides several key advantages during training and inference. The additional dimensions allow the network to better preserve both local and global relationships between data points, while providing the capacity to capture subtle semantic nuances. This extra capacity helps prevent information bottlenecks and provides more pathways for gradient descent during training, leading to more stable optimization. In essence, the high-dimensional nature of embeddings, despite its counterintuitive properties, is a feature rather than a bug in neural network architecture.

TL;DR: High-dimensional embeddings in neural networks are actually beneficial - the extra dimensions help keep different concepts clearly separated and make training more stable, even though the distances between points become more uniform.

geysersam a year ago

When dimensionality increases so does distance, but distance doesn't matter, we only care about relative distance compared to different points.

If clustering works or not has nothing to do with the dimensionality of the space, and everything to do with the distribution of the points.

mhuffman a year ago

>Out of interest is nearest neighbour even remotely effective with 300 dimensions?

They are/can be anyway. I had data with 50,000 dimensions, which after applying dimensionality reduction techniques got it "down to" around 300! ANN worked very well on those vectors. This was prior to the glut of vector dbs we have available now, so it was all in-memory and used direct library calls to find neighbors.

ebursztein a year ago

Try Usearch - it's really fast and under rated https://github.com/unum-cloud/usearch

mhuffman a year ago

Annoy is old, but works surprisingly well and is fast.

[-]

wood_spirit a year ago

And nowadays Spotify uses voyager https://engineering.atspotify.com/2023/10/introducing-voyage...

[-]

mhuffman a year ago

And also looks like full support for Mac ARM so, good info wood_spirit.

HDThoreaun a year ago

100k 300 dimension float 32s is less than a gigabyte. Just use numpy to do the NN search in memory.

teaearlgraycold a year ago

Why in memory? What are your latency requirements? I found pgvector to be surprisingly performant.

a year ago

[deleted]

peterldowns a year ago

Annoy

uptownfunk a year ago

Can we build an OS version of this and make it easy for solo dev to self host / roll their own?

almaight a year ago

Already integrated on bing

https://www.microsoft.com/en-us/research/uploads/prod/2021/1...

utopcell a year ago

For anyone that wants to see how this compares on ann-benchmarks.com, the project is called 'sptag'.

graycat a year ago

Hmm, how to do a statistical hypothesis of nearest neighbor data? Distribution free?

[-]

thecleaner 10 months ago

What's the hypothesis and the test statistic ?

[-]

graycat 10 months ago

It's been a long time! Glad you are interested. I gave a talk on it at the main NASDAQ computer site.

The null hypothesis is that some one point (a vector in R^n for the reals R and positive integer n) is from an independent, identically distributed set of random variables. The cute part was the use of the distribution property of tightness for the test statistic. The intended first application was for monitoring computer systems and improve on the early AI expert systems.

singhrac a year ago

Maybe worth a (2021) tag.

a year ago

[deleted]