Embeddings are the only aspect of modern AI I'm excited about because they're the only one that gives more power to humans instead of taking it away. They're the "bicycle for our minds" of Steve Jobs fame; intelligence amplification not intelligence replacement. IMO, the biggest improvement in computer usability in my lifetime was the introduction of fast and ubiquitous local search. I use Firefox's "Find in Page" feature probably 10 or more times per day. I use find and grep probably every day. When I read man pages or logs, I navigate by search. Git would be vastly less useful without git grep. Embeddings have the potential to solve the biggest weakness of search by giving us fuzzy search that's actually useful.
I've been experimenting with using embeddings for finding the relevant git commits, as I often don't know or remember the exact word that was used.
So I created my own little tool for embedding and finding commits by commit messages. Maybe you'll also find it useful:
https://github.com/adrianmfi/git-semantic-similarity
So you're saying, embeddings are fine, as long as we refrain from making full use of their capabilities? We've hit on a mathematical construct that seems to be able to capture understanding, and you're saying that the biggest models are too big, we need to scale down, only use embeddings for surface-level basic similarities?
I too think embeddings are vastly underutilized, and chat interface is not the be-all, end-all (not to mention, "chat with your program/PDF/documentation" just sounds plain stupid). However, whether current AI tools are replacing or amplifying your intelligence, is entirely down to how you use them.
As for search, yes, that was a huge breakthrough and powerful amplifier. 2+ decades ago. At this point it's computer use 101 - which makes it sad when dealing with programs or websites that are opaque to search, and "ubiquitous local search" is still not here. Embeddings can and hopefully will give us better fuzzy/semantic searching, but if you push this far enough, you'll have to stop and ask - if the search tool is now capable to understand some aspects of my data, why not surface this understanding as a different view into data, instead of just invoking it in the background when user makes a search query?
I have found that embeddings + LLM is very successful. I'm going to make the words up as to not yield my work publicly, but I had to classify something into 3 categories. I asked a simple llm to label it, it was 95% accurate. taking the min distance from the word embeddings to the mean category embeddings was about 96%. When I gave gave the LLM the embedding prediction, the LLM was 98% accurate.
There were issues an embedding model might not do well on where as the LLM could handle. for example: These were camel case words, like WoodPecker, AquafinaBottle, and WoodStock (I changed the words to not reveal private data).
WoodPecker and WoodStock would end up with close embedding values because the word Wood dominated the embedding values, but these were supposed to go into 2 different categories.
> word Wood dominated the embedding values, but these were supposed to go into 2 different categories
When faced with a similar challenge we developed a custom tokenizer, pretrained BERT base model[0], and finally a SPLADE-esque sparse embedding model[1] on top of that.
Do you mind sharing why you chose SPLADE-esque sparse embeddings?
I have been working on embeddings for a while.
For different reasons I have recently become very interested in learned sparse embeddings. So I am curious what led you to choose them for your application, and why?
> Do you mind sharing why you chose SPLADE-esque sparse embeddings?
I can provide what I can provide publicly. The first thing we ever do is develop benchmarks given the uniqueness of the nuclear energy space and our application. In this case it's FermiBench[0].
When working with operating nuclear power plants there are some fairly unique challenges:
1. Document collections tend to be in the billions of pages. When you have regulatory requirements to extensively document EVERYTHING and plants that have been operating for several decades you end up with a lot of data...
2. There are very strict security requirements - generally speaking everything is on-prem and hard air-gapped. We don't have the luxury of cloud elasticity. Sparse embeddings are very efficient especially in terms of RAM and storage. Especially important when factoring in budgetary requirements. We're already dropping in eight H100s (minimum) so it starts to creep up fast...
3. Existing document/record management systems in the nuclear space are keyword search based if they have search at all. This has led to substantial user conditioning - they're not exactly used to what we'd call "semantic search". Sparse embeddings in combination with other techniques bridge that well.
4. Interpretability. It's nice to be able to peek at the embedding and be able to get something out of it at a glance.
So it's basically a combination of efficiency, performance, and meeting users where they are. Our Fermi model series is still v1 but we've found performance (in every sense of the word) to be very good based on benchmarking and initial user testing.
I should also add that some aspects of this (like pretrained BERT) are fairly compute-intense to train. Fortunately we work with the Department of Energy Oak Ridge National Laboratory and developed all of this on Frontier[1] (for free).
> We've hit on a mathematical construct that seems to be able to capture understanding
I’m admittedly unfamiliar with the space, but having just done some reading that doesn’t look to be true. Can you elaborate please and maybe point to some external support for such a bold claim?
> Can you elaborate please and maybe point to some external support for such a bold claim?
SOTA LLMs?
If you think about what, say, a chair or electricity or love are, or what it means for something to be something, etc., I believe you'll quickly realize that words and concepts don't have well-defined meanings. Rather, we define things in terms of other things, which themselves are defined in terms of other things, and so on. There's no atomic meaning, the meaning is in the relationships between the thought and other thoughts.
And that is exactly what those models capture. They're trained by consuming a large amount of text - but not random text, real text - and they end up positioning tokens as points in high-dimensional space. As you increase the number of dimensions, there's eventually enough of them that any relationship between any two tokens (or groups; grouping concepts out of tokens is just another relationship) can be encoded in the latent space as proximity along some vector.
You end up with real computational artifact that's implementing the idea of defining concepts only in terms of other concepts. Now, between LLMs and the ability to identify and apply arbitrary concepts with vector math, I believe that's as close to the idea of "understanding" as we've ever come.
That does sound a bit like Peircian semiotic so I’m with you so far as the general concept of meaning being a sort of iterative construct.
Where I don’t follow is how a bitmap approximation captures that in a semiotic way. As far as I can tell the semiosis still is all occurring in the human observer of the machine’s output. The mathematics still hasn’t captured the interpretant so far as I can see.
Regardless of my possible incomprehension, I appreciate your elucidation. Thanks!
I feel like embeddings will be more powerful for understanding high dimensional physics than language because chaotic system predictability is limited by its compressability. Therefore an embedding is able to capture how exactly compressible the system is and therefore can extend the predictability as far as possible.
All modern AI technology can give more power to humans, you just have to use the right tools.
Every AI tool I can think of has made me more productive.
LLMs help me write code faster and understand new libraries, image generation helps me build sites and emails faster, etc
I agee with this view. Generative AI robs us of something (thinking, practicing) which is the long term ability to practice a skill and improve oneself in exchange of an immediate (often crappy) result. Embeddings is a tech that can help us solve problem, ut we still have to do most of the work.
I ask LLMs to give me exercises, tutorials then write up my experience into "course notes", along with flashcards. I ask it to simulate a teacher, I ask it to simulate students that I have to teach, etc...
I haven't found a tool that is more effective in helping me learn.
Does a player piano rob you of playing music yourself? A car from walking? A wheelbarrow from working out? It’s up to you if you want to stop practicing!
Chess has become even more popular despite computers that can “rob us” of the joy. They’re even better practice partners.
An individual car doesn't stop you from walking but a culture that centers cars leads to cities where walking is outright dangerous.
Most car owners would never say outright "I want a car-centric culture". But car manufacturers lobbied for it, and step by step, we got both the deployment of useful car infrastructure, and the destruction or ignoring of all amenities useful for people walking or cycling.
Now let's go back to the period where cars start to become enormously popular, and cities start to build neighborhoods without sidewalks. There was probably someone at the time complaining about the risk of cars overtaking walking and leading to stores being more far away etc. And in front of them was probably someone like you calling them a luddite and being oblivious of second order effects.
I’m not sure it robs us. It makes it possible, but many people including myself find the artistic products of AI to be utterly without value for the reasons you list. I will always cherish the product of lifelong dedication and human skill
It doesn't diminish - but I do find it interesting how it influences. Realism became less important, less interesting, though still valued to a lesser degree, with the ubiquity of photography. Where will human creativity move towards when certain task become trivially machine replicable? Where will human ingenuity _enabled_ by new technology make new art possible?
> I don’t know. After the model has been created (trained), I’m pretty sure that generating embeddings is much less computationally intensive than generating text. But it also seems to be the case that embedding models are trained in similar ways as text generation models2, with all the energy usage that implies. I’ll update this section when I find out more.
Although I do care about the environment, this question is completely the wrong one if you ask me. There is the public opinion (mainstream media?) some kind of idea that we should use less AI and somehow this would solve our climate problems.
As a counterexample, let's go to the extreme. Let's ban Google Maps because it does take computational resources from the phone. As a result more people will take wrong routes, and thus use more petrol. Say you use one gallon of petrol extra, that then wastes 34 kWh. This is of course the equivalent of running 34 powerful vacuum cleaners on full power for an hour. In contrast, say you downloaded your map, then the total "cost" is only the power used by the phone. A mobile phone has a battery of about 4 mAh, so 0,004 Ah * 4.2 V = 0.168 W, or 0.000168 kW. This means that the phone is about 200 000 times as efficient! And then we didn't even consider the time-saving for the human.
It's the same with running embeddings for doc generation. An Nvidia H100 consumes about 700 W, so say 1 kWh after an hour of running. 1 kWh should be enough to do a bunch of embedding runs. If this then saves, for example, one workday including the driving back and forth to the office, then again the tradeoff is highly in favor of the compute.
Long-term, its not about barring progress, but to have progress and more energy efficient models. The sum-games we play in regard to energy usage, don't necessarily stack up in dynamic systems; the increased energy usage of generative models may very well lead to less compute hours spent behind the desk drafting, revising, redrafting and doing it all over again once the next project comes around.
What remains though, is that increased productivity has rarely lead to a decrease in energy usage. Whether energy scarcity will drive model optimisation is anyone's guess, but it would be a differentiating feature on a market saturated with similarly capable offerings.
If people really cared about the environment then they would ban residential air conditioning, it’s a luxury.
I lived somewhere that would get to 40c in the summers and an oscillating fan was good enough to keep cool, the AC was nice to have but it wasn’t necessary.
I find it very hypocritical when people tell you to change your lifestyle for climate change but they have the A/C blasting all day long, everyday.
> I lived somewhere that would get to 40c in the summers and an oscillating fan was good enough to keep cool, the AC was nice to have but it wasn’t necessary.
As others have said, that works if you lived in a very dry area. And perhaps it was a house or a building optimized for airflow. And you didn't have much to do during the day. And I'm guessing you're young, and healthy.
Here in central Europe, sustained 40°C in the summer would rack up a significant body count. Anything above 30°C sucks really bad if you have any work to do. Or if, say, you're pregnant or have small children. And you live in a city.
Residential A/C isn't a luxury anymore, it's becoming a necessity very fast. Fortunately, heat pumps are one of the most efficient inventions of humankind. In particular, "A/C blasting all day long" beats anything else you could do to mitigate the heat if it involved getting into a car. And then it also beats whatever else you're doing to heat your place during the winter.
Prepare to have your mind blown: Heat pumps are very energy efficient at moving heat around. Significantly more so than the oil-fired boiler frequently found in the basement of a big-city apartment building.
> I lived somewhere that would get to 40c in the summers and an oscillating fan was good enough to keep cool
If it's not too humid.
I've lived in places with relatively high humidity and 35-40C, and have had the misfortune of not having an AC. Fans are not enough. I mean, sure, you can survive, but it really, really sucks.
Indeed, and A/C is kind of a prime example of why it's beneficial, given how much more energy-efficient heat pumps are for cooling and heating than just about anything else.
in a somewhat ironic twist, ac is crucial survival infrastructure in some parts of the world when heat comes during hot seasons. phoenix usa, parts of india, etc
That was a good post. Vector Embeddings are in some sense a summary of a doc that's unique similar to a hashcode of a doc. It makes me think it would be cool if there were some universal standard for generating embeddings, but I guess they'll be different for each AI model, so they can't have the same kind of "permanence" hash codes have.
It definitely also seems like there should be lots of ways to utilize "Cosine Similarity" (or other closeness algos) in databases and other information processing apps that we haven't really exploited yet. For example you could almost build a new kind of Job Search Service that matches job descriptions to job candidates based on nothing but a vector similarity between resume and job description. That's probably so obvious it's being done, already.
”you could almost build a new kind of Job Search Service that matches job descriptions to job candidates”
The key word being ”almost”. Yes, you can get similarity matches between job requirements and candidate resumes, but those matches are not useful for the task of finding an optimal candidate for a job.
For example, say a job requires A and B.
Candidate 1 is a junior who has done some work with A, B and C.
Candidate 2 is a senior and knows A, B, C, D, E and F by heart. All are relevant to the job and would make 2 the optimal candidate, even though C–F are not explicitly stated in the job requirements.
Candidate 1 would seem a much better candidate than 2, because 1’s embedding vector is closer to the job embedding vector.
> The key word being ”almost”. Yes, you can get similarity matches between job requirements and candidate resumes, but those matches are not useful for the task of finding an optimal candidate for a job.
Text embeddings are not about matching, they are about extracting the semantic topics and the semantic context. Matching comes next, if required.
If a LLM is used to generate the text embeddings, it would «expand» the semantic context for each keyword. E.g. «GenAI» would make the LLM expand the term into directly and loosely related semantic topics, say, «LLM», «NLP» (with a lesser relevance though), «artificial intelligence», «statistics» (more distant) and so forth. The generated embeddings will result in a much richer semantic context that will allow for straightforward similarity search as well as for exploratory radial search with ease. It also works well across languages, provided the LLM had a linguistically and sufficiently diverse corpus it was trained on.
Fun fact: I have recently delivered a LLM assisted (to generate text embeddings) k-NN similarity search for a client of mine. For the hell of it, we searched for «the meaning of life» in Cantonese, English, Korean, Russian and Vietnamese.
It pulled up the same top search result across the entire dataset for the query in English, Korean and Russian. Effectively, it turned into a Babelfish of search.
Cantonese and Vietnamese versions diverged and were less relevant as the LLM did not have a substantial corpus in either language. This can be easily fixed in the future, once a new LLM version that will have been trained on a better corpus in both, Cantonese and Vietnamese, languages – by regenerating the text embeddings on the dataset. The implementation won't have to change.
We don't know if Candidate 2 really "knows A, B, C, D, E and F by heart", just that they claim to. They could be adding whatever to their skill list just, even though they hardly used it, just because it' a buzzword.
So Candidate 1 could still blow them out of the water in performance, and even be able to trivially learn D, and E in a short while on the job if needed.
The skill vector wont tell much by itself, and even prevent finding the better candidate if its used for screening.
> We don't know if Candidate 2 really "knows A, B, C, D, E and F by heart", just that they claim to. They could be adding whatever to their skill list just, even though they hardly used it, just because it' a buzzword.
That is indeed a problem. I have been thinking about a possible solution to the very same problem for a while.
The fact: people lie on their resumes, and they do it for different reasons. There are white lies (e.g. pumps something up because they aspire to something but were not presented with an opportunity to do it, yet they are eager to skill themselves up, learn and do it, if given an opportunity). Then there are other lies. Generally speaking, lies are never black or white, true or false; they are a shade of grey.
So the best idea I have been able to come up with so far is a hybrid solution that entails the text embeddings (the skills similarity match and search) coupled with the sentiment analysis (to score the sincerity of the information stated on a resume) to gain an extra insight into the candidate's intentions. Granted, the sentiment analysis is an ethically murky area…
Sincerity score on a resume? I can't tell if you're joking or not. I mean yeah, any sentence that ends in something like "...yeah, that's the ticket." would be detectable for sure, but I'm not sure everyone is as bad a liar as Jon Lovitz.
Are you speaking hypothetically or from your own experience? The sentiment analysis is a thing, and it mostly works – I have tested it with satisfactory results on sample datasets. It is relatively easy to extract the emotional context from a corpus of text, less so when it comes to resumes due to their inherently more condensed content. Which is precisely why I mentioned ethical considerations in my previous response. With the extra effort and fine tuning, it should be possible to overcome most of the false negatives though.
Sure AI can detect emotional tones (being positive, being negative, even sarcasm sometimes) in writing, so if you mean something like detecting negativity in a resume so it can be thrown immediately in the trash, then I agree that can work. Any negative emotionality is always a red-flag.
But insofar as detecting lies in sentences, that simply cannot be done, because even if it ever did work the failure rate would still be 99%, so you're better off flipping a coin.
The trick is evaluate the score for each skill, also weighing it by the years of experience with the skill, then sum the evaluations. This will address your problem 100%.
Also, what a candidate claims as a skill is totally irrelevant and can be a lie. It is the work experience that matters, and skills can be extracted from it.
That's not accurate. You can explicitly bake in these types of search behaviors with model training.
People do this in ecommerce with the concept of user embeddings and product embeddings, where the result of personalized recommendations is just a user embedding search.
> not useful for the task of finding an optimal candidate
That statement is just flat out incorrect on it's face, however it did make me think of something I hadn't though of before, which is this:
Embedding vectors can be made to have a "scale" (multiplier) on specific terms which represent the amount of "weight" to add to that term. For example if I have 10 years experience in Java Web Development, then we can take the actual components of that vector
embedding (i.e. for string "Java Web Development") and multiply them by some proportionality of 10, and that results in a vector that is "Further" into that direction. This represents an "amount" of directional into the Java Web direction.
So this means even with vector embeddings we can scale out to specific amounts of experience. Now here's the cool part. You can then take all THOSE scaled vectors (one for each individual job candidate skill) and average them to get a single point in space which CAN be compared as a single scalar distance from what the Job Requirements specify.
Then you would have to renormalize the vectors. You really really want to keep the range -1..1 because that is a special case where cosine similarity equals dot product equals Euclidean distance.
I meant the normalized hyperspace direction (unit vector) represents a particular "skill" and the distance into that direction (extending outside the unit hypersphere) is years of experience.
This is geometrically "meaningful", semantically. It would apply to not just a time vector (experience) but in other contexts it could mean other things. Like for example, money invested into a particular sector (Hedge fund apps).
This makes me realize we could design a new type of Perceptron (MLP) where specific scalars for particular things (money, time, etc.) could be wired into the actual NN architecture, in such a way that a specific input "neuron" would be fed a scalar for time, and a different neuron a scalar for money, etc. You'd have to "prefilter" each training input to generate the individual scalars, but then input them into the same "neuron" every time during training. This would have to improve overall "Intelligence" by a big amount.
It just does cosine similarity with OpenAI embeddings + pgVector. It's not perfect by any means, but it's useful. It could probably stand to be improved with a re-ranker, but I just never got around to it.
Very cool. I knew it was too obvious an idea to be missed! Did you read my comments below about how you can maybe "scale up" a vector based on number of years of experience. I think that will work. It makes somebody with 10 yrs Java Experience closer to the target than someone with only 5yrs, if the target is 10 years! -- but the problem is someone with 20yrs looks even worse when they should look better! My problem in my life. hahaha. Too much experience.
I think the best "matching" factor is to minimize total distance where each distance is the time-multiplied vector for a specific skill.
> For example you could almost build a new kind of Job Search Service that matches job descriptions to job candidates based on nothing but a vector similarity between resume and job description. That's probably so obvious it's being done, already.
Literally the next item on my roadmap for employbl dot com lol. we're calling it a "personalized job board" and using PGVector for storing the embeddings. I've also heard good things about Typesense though.
One thing I've found to be important when creating the embeddings is to not do an embedding of the whole job description. Instead use an LLM to make a concise summary of the job listing (location, skills etc.) in a structured format. Then store that store as the embedding. It reduces noise and increases accuracy for vector search.
Yeah for "Semantic Hashes" (that's a good word for them!) we'd need some sort of "Canonical LLM" model that isn't necessarily used for inference, nor does it need to even be all that smart, but it just needs to be public for the world. It would need to be updated like every 2 to 5 years tho to account for new words or words changing meaning? ...but maybe could be updated in such a way as to not "invalidate" prior vectors, if that makes sense? For example "ride a bicycle" would still point in the same direction even after a refresh of the canonical model? It seems like feeding the same training set could replicate the same model values, but there are nonlinear instabilities which could make it disintegrate.
Maybe the embedding could be paired up with a set of words that embed to somewhere close to the original embedding? Then the embedding can be updated for new models by re-embedding those words. (And it would be more interpretible by a human.)
I mean it was just a thought I had. May be a "solution in search of a problem". I generate those a lot! haha. But it seems to me like having some sort of canonical set of training data and a canonical LLM architecture, we'd end up able to generate consistent embeddings of course, but I'm just not sure what the use cases are.
I guess it might be possible to retroactively create an embeddings model which could take several different models' embeddings, and translate them into the same format.
This is done with two models in most standard biencoder approaches. This is how multimodal embedding search works. We want to train a model such that the location of the text embeddings that represent an item and the image embeddings for that item are colocated.
That metaphor is skipping the most important part in between! You wouldn't be transplanting anything directly, you'd have a separate step in between, which would attempt to translate these action potentials.
The point of the translating model in between would be that it would re weight each and every one of the values of the embedding, after being trained on a massive dataset of original text -> vector embedding for model A + vector embedding for model B. If you have billions of parameters trained to do this translation between just two specific models to start with, wouldn't this be in the realm of possible?
A translation between models doesn't seem possible because there are actually no "common dimensions" at all between models. That is, each dimension has a completely different semantic meaning, in different models, but also it's the combination of dimension values that begin to impart real "meaning".
For example, the number of different unit vector combinations in a 1500 dimensional space is like the number of different ways of "ordering" the components, which is 5^4114
.
EDIT: And the point of that factorial is that even if the dimensions were "identical" across two different LLMs but merely "scrambled" (in ordering) there would be that large number to contend with to "unscramble".
This is very similar to how LLMs are taught to understand images in llava style models (the image embeddings are encoded into the existing language token stream)
This is a great post. I’ve also been having a lot of fun working with embeddings, with lots of those pages being documentation. We write up a quick post on how are using them in prod, if you want to go from having an embedding to actually using them in a web app:
Thanks, Eric. So what you're really telling me is that you might make an exception to the "no tools talks" general policy for Write The Docs conference talks and let me nerd out on embeddings for 30 mins?? ;P
The thing that puzzles me about embeddings is that they're so untargeted, they represent everything about the input string.
Is there a method for dimensionality reduction of embeddings for different applications? Let's say I'm building a system to find similar tech support conversations and I am only interested in the content of the discussion, not the tone of it.
How could I derive an embedding that represents only content and not tone?
You can do math with word embeddings. A famous example (which I now see has also been mentioned in the article) is to compute the "woman vector" by subtracting "man" from "woman". You can then add the "woman vector" to e.g. the "king" vector to obtain a vector which is somewhat close to "queen".
To adapt this to your problem of ignoring writing style in queries, you could collect a few text samples with different writing styles but same content to compute a "style direction". Then when you do a query for some specific content, subtract the projection of your query embedding onto the style direction to eliminate the style:
I suspect this also works with text embeddings, but you might have to train the embedding network in some special way to maximize the effectiveness of embedding arithmetic. Vector normalization might also be important, or maybe not. Probably depends on the training.
Another approach would be to compute a "content direction" instead of a "style direction" and eliminate every aspect of a query that is not content. Depending on what kind of texts you are working with, data collection for one or the other direction might be easier or have more/fewer biases.
And if you feel especially lazy when collecting data to compute embedding directions, you can generate texts with different styles using e.g. ChatGPT. This will probably not work as well as carefully handpicked texts, but you can make up for it with volume to some degree.
Interesting, but your hypothesis assumes that 'tone' is one-dimensional, that there is a single axis you can remove. I think tone is very multidimensional, I'd expect to be removing multiple 'directions' from the embedding.
No, I don’t think the author is saying one dimensional - the vectors are represented by magnitudes in almost all of the embedding dimensions.
They are still a “direction” in the way that [0.5, 0.5] in x,y space is a 45 degree angle, and in that direction it has a magnitude of around 0.7
So of course you could probably define some other vector space where many of the different labeled vectors are translated to magnitudes in the original embedding space, letting you do things like have a “tone” slider.
I think GP is saying that GGP assumes "tone" is one direction, in the sense there exists a vector V representing "tone direction", and you can scale "tone" independently by multiplying that vector with a scalar - hence, 1 dimension.
I'd say this assumption is both right and wrong. Wrong, because it's unlikely there's a direction in embedding space corresponding to a platonic ideal of "tone". Right, because I suspect that, for sufficiently large embedding space (on the order of what goes into current LLMs), any continuous concept we can articulate will have a corresponding direction in the embedding space, that's roughly as sharp as our ability to precisely define the concept.
I would say rather that the "standard example" is simplified, but it does capture an essential truth about the vectors. The surprise is not that the real world is complicated and nothing is simply expressible as a vector and that treating it as such doesn't 100% work in every way in every circumstance all of the time. That's obvious. Everyone who might work with embeddings gets it, and if they don't, they soon will. The surprise is that it does work as well as it does and does seem to be capturing more than a naive skepticism would expect.
You could of course compute multiple "tone" directions for every "tone" you can identify and subtract all of them. It might work better, but it will definitely be more work.
Though not exactly what you are after, Contextual Document Embeddings (https://huggingface.co/jxm/cde-small-v1), which generate embeddings based on "surrounding context" might be of some interest.
With 281M params it's also relatively small (at least for an embedding model) so one can play with it relatively easily.
Depends on the nature of the content you’re working with, but I’ve had some good results using an LLM during indexing to generate a search document by rephrasing the original text in a standardized way. Then you can search against the embeddings of that document, and perhaps boost based on keyword similarity to the original text.
Could you explicitly train a set of embeddings that performed that step in the process? For example which computing the loss, you compare the difference against the normalized text rather than the original. Or alternatively do this as a fine-tuning. Then you would have embedding that optimized for the characteristics you care about.
There are a few things you can do. If these access patterns are well known ahead of time, you can train subdomain behavior into the embedding models by using prefixing. E.g. content: fixing a broken printer, tone: frustration about broken printer, and "fixing a broken printer" can all be served by a single model.
We have customers doing this in production in other contexts.
If you have fundamentally different access patterns (e.g. doc -> doc retrieval instead of query -> doc retrieval) then it's often time to just maintain another embedding index with a different model.
They don't represent everything. In theory they do but in reality the choice of dimensions is a function of the model itself. It's unique to each model.
I've just begun to dabble with embeddings and LLMs, but recently I've been thing about tryin to use principle component analysis[1] to either project to desirable subspaces, or project out undesirable subspaces.
In your case it would be to take a bunch of texts which roughly mean the same thing but with variance in tone, compute PCA of the normalized embeddings, take the top axsis (or top few) and project it out (ie subtract the projection) of the embeddings for the documents you care about before doing the cosine similarity.
Something along those lines.
Could be it's a terrible idea, haven't had time to do much with it yet due to work.
> As docs site owners, I wonder if we should start freely providing embeddings for our content to anyone who wants them, via REST APIs or well-known URIs. Who knows what kinds of cool stuff our communities can build with this extra type of data about our docs?
Interesting idea. You'd have to specify the exact embedding model used to generate an embedding, right? Is there a well understood convention for such identification like say model_name:model_version:model_hash or something? For technical docs, obviously very broad field, is there an embedding model (or small number) widely used or obviously highly suitable that a site ownwer could choose one and have some reasonable expectation that publishing embeddings for their docs generated using that model would be useful to others? (Naive questions, I am not embedded in the field.)
It seems like sharing the text itself would be a better API, since it lets API users calculate their own embeddings easily. This is what the crawlers for search engines do. If they use embeddings internally, that’s up to them, and it doesn’t need to be baked into the protocol.
Could we work toward standardization at some point? Obviously, there will always be a newer model. I just hate that all the embedding work I did was with now depreciated openai model. At least single providers should see interest in ensuring that for their own model releases. Some trick like matryoshka embedding could secure that embedding from newer models nest or work within the space of older model preserving some form of comparability or alignment
Yeah, this is the main issue with the suggestion. Embeddings can only be compared to each other if they are in the same space (e.g., generated by the same model). Providing embeddings of a specific kind would require users to use the same model, which can quickly become problematic if you're using a closed-source embedding model (like OpenAI's or Cohere's).
Great post indeed! I totally agree that embeddings are underrated. I feel like the "information retrieval/discovery" world is stuck using spears (i.e., term/keyword-based discovery) instead of embracing the modern tools (i.e., semantic-based discovery).
The other day I found myself trying to figure out some common themes across a bunch of comments I was looking at. I felt lazy to go through all of them so I turned my attention to the "Sentence Transformers" lib. I converted each comment into a vector embedding, applied k-means clustering on these embeddings, then gave each cluster to ChatGPT to summarize the corresponding comments. I have to admit, it was fun doing this and saved me lots of time!
Cool, first time I've seen one of my posts trend without me submitting it myself. Hopefully it's clear from the domain name and intro that I'm suggesting technical writers are underrating how useful embeddings can be in our work. I know ML practitioners do not underrate them.
You might want to highlight chunking and how embeddings can/should represent subsections of your document as well. It seems relevant to me for cases like similarity or semantics search, getting the reader to the relevant portion of the document or page.
Theres probably some interesting ideas around tokenization and metadata as well. For example, if you’re processing the raw file I expect you want to strip out a lot of markup before tokenization of the content. Conversely, some markup like code blocks or examples would be meaningful for tokenization and embedding anyways.
I wonder if both of those ideas can be combined for something like automated footnotes and annotations. Linking or mouseover relevant content from elsewhere in the documentation.
Do you have any resources you recommend for representing sub sections? I'm currently prototyping a note/thoughts editor where one feature is suggesting related documents/thoughts (think linked notes in Obsidian) for which I would like to suggest sub sections and not only full documents.
Sorry, no good references off hand. I’ve had to help write & generate public docs in DocBook in the past. But no expert on either editors, nlp, or embeddings besides hacking around some tools for my own note taking. My assumption is youll want to use your existing markup structure, if you have it. Or naively split on paragraphs with a tool like spacy. Or get real fancy and use dynamic ranges; something like an accumulation window that aggregates adjacent sentences based on individual similarity, break on total size or dissimilarity, and then treat that aggregate as the range to “chunk.”
Thanks for the elaborate and helpful response. I'm also hacking on this as a personal note taking project and already started playing around with your ideas. Thanks!
Haha yeah I was about to comment that I recall a period just after Word2Vec came out where embeddings were most definitely not underrated but rather the most hyped ML thing out there!
Author of txtai (https://github.com/neuml/txtai) here. I've been in the embeddings space since 2020 before the world of LLMs/GenAI.
In principle, I agree with much of the sentiment here. Embeddings can get you pretty far. If the goal is to find information and citations/links, you can accomplish most of that with a simple embeddings/vector search.
GenAI does have an upside in that it can distill and process those results into something more refined. One of the main production use cases is retrieval augmented generation (RAG). The "R" is usually a vector search but doesn't have to be.
As we see with things like ChatGPT search and Perplexity, there is a push towards using LLMs to summarize the results but also linking to the results to increase user confidence. Even Google Search now has that GenAI section at the top. In general, users just aren't going to accept LLM responses without source citations at this point. The question is if the summary provides value or if the citations really provide the most value. If it's the later, then Embeddings will get the job done.
Doesn’t OpenAI embedding model support 8191/8192 tokens? That aside, declaring a winner by token size is misleading. There are more important factors like cross language support and precision for example
Updated the section to refer to the "Retrieval Average" column of the MTEB leaderboard. Is that the right column to refer to? Can someone link me to an explanation of how that benchmark works? Couldn't find a good link on it
if anything i would consider embeddings bit overrated, or it is safer to underrate them.
They're not the silver bullet many initially hoped for, they're not a complete replacement for simpler methods like BM25. They only have very limited "semantic understanding" (and as people throw increasingly large chunks into embedding models, the meanings can get even fuzzier)
Overly high expectations lets people believe that embeddings will retrieve exactly what they mean, and With larger top-k values and LLMs that are exceptionally good at rationalizing responses, it can be difficult to notice mismatches unless you examine the results closely.
Absolutely. Embeddings have been around a while and most people don’t realize it wasn’t until the e5 series of models from Microsoft that they even benchmarked as well as BM25 in retrieval scores, while being significantly more costly to compute.
I think sparse retrieval with cross encoders doing reranking is still significantly better than embeddings. Embedding indexes are also difficult to scale since hnsw consumes too much memory above a few million vectors and ivfpq has issues with recall.
Off the shelf embedding models definitely underpromise and overdeliver. In ten years I'd be very surprised if companies weren't fine-tuning embedding models for search based on their data in any competitive domains.
My startup (Atomic Canyon) developed embedding models for the nuclear energy space[0].
Let's just say that if you think off-the-shelf embedding models are going to work well with this kind of highly specialized content you're going to have a rough time.
Nice introduction, but I think that ranking the models purely by their input token limits is not a useful exercise. Looking at the MTEB leaderboard is better (although a lot of the models are probably overfitting to their test set).
This is a good time to chill for my visualization of 5 Millionembeddings of HN posts, users and comments: https://tomthe.github.io/hackmap/
Thanks, a couple other people gave me this same feedback in another comment thread and it definitely makes sense not to overindex on input token size. Will update that section in a bit.
I was using embeddings to group articles by topic, and hit a specific issue. Say I had 10 articles about 3 topics, and articles are either dry or very casual in tone.
I found clustering by topic was hard, because tone dimensions ( whatever they were ) seemed to dominate.
How can you pull apart the embeddings? Maybe use an LLM to extract a topic, and then cluster by extracted topic?
In the end I found it easier to just ask an LLM to group articles by topic.
I agree, I tried several methods during my pet project [1], and all of them have their pros and cons. Looks like creating topics first and predicting them using LLM works the best
Allegedly, the new hotness in RAG is exactly that. Use a smaller LLM to summarize the article and include that summary alongside the article when generating the embedding.
Potentially solves your issue, but it is also handy when you have to chunk a larger document and would lose context from calculating the embedding just on the chunk.
One quick minor note is that the resulting embeddings for the same text string could be different, depending on what you specify the input type as for retrieval tasks (i.e. query or document) -- check out the `input_type` parameter here: https://docs.voyageai.com/reference/embeddings-api.
Yes. Especially if you work in a not well supported language and/or have specific datapairs you want to match that might be out of ordinary text.
Training your own fine tune takes a really short time and GPU resources, and you can easily outperform even sota models on your specific problem with a smaller model/vector space
Then again on general English text and doing a basic fuzzy search. I would not really expect high performance gains.
I actually tend to agree. In the article, I didn't see the strong argument highlighting what powerful feature exactly people were missing in relation to embeddings. Those who work in ML they probably know these basics.
It is a nice read though - explaining the basics of vector spaces, similarity and how it is used in modern ML applications.
> Hopefully it's clear from the domain name and intro that I'm suggesting technical writers are underrating how useful embeddings can be in our work. I know ML practitioners do not underrate them.
> I didn't see the strong argument highlighting what powerful feature exactly people were missing in relation to embeddings
I had to leave out specific applications as "an exercise for the reader" for various reasons. Long story short, embeddings provide a path to make progress on some of the fundamental problems of technical writing.
Even by ML people from 25 years ago. It’s a black box function that maps from a ~30k space to a ~1k space. It’s a better function then things like PCA, but does the same thing.
0. Generate an embedding of some text, so that you have a known good embedding, this will be your target.
1. Generate an array of random tokens the length of the response you want.
2. Compute the embedding of this response.
3. Pick a random sub-section of the response and randomize the tokens in it again.
4. Compute the embedding of your new response.
5. If the embeddings are closer together, keep your random changes, otherwise discard them, go back to step 2.
6. Repeat this process until going back to step 2 stops improving your score. Also you'll probably want to shrink the size of the sub-section you're randomizing the closer your computed embedding is to your target embedding. Also you might be able to be cleverer by doing some kind of masking strategy? Like let's say the first half of your response text already was actually the true text of the target embedding. An ideal randomizer would see that randomizing the first half almost always makes the result worse, and so would target the 2nd half more often (I'm hoping that embeddings work like this?).
7. Do this N times and use an LLM to score and discard the worst N-1 results. I expect that 99.9% of the time you're basically producing adversarial examples w/ this strategy.
8. Feed this last result into an LLM and ask it to clean it up.
Is it not possible? I'm not that familiar with the topic. Doing some sort of averaging over a large corpus of separate texts could be interesting and probably would also have a lot of applications. Let's say that you are gathering feedback from a large group of people and want to summarize it in an anonymized way. I imagine you'd need embeddings with a somewhat large dimensionality though?
Embeddings from things like one-hot, count vectorization, tf-idf, etc into dimensionality reduction techniques like SVD and PCA have been around for a long time and also provided the ability to compare any two pieces of text to each other. Yes, neural networks and LLMs have provided the ability for the context of each word to affect the whole document's embedding and capture more meaning, potentially that pesky "semantic" sort even; but they still are fundamentally a dimensionality reduction technique.
This article really resonates with me - I've heard people (and vector database companies) describe transformer embeddings + vector databases as primarily a solution for "memory/context for your chatbot, to mitigate hallucinations", which seems like a really specific (and kinda dubious, in my experience) use case for a really general tool.
I've found all of the RAG applications I've tried to be pretty underwhelming, but semantic search itself (especially combined with full-text search) is very cool.
I dare say RAG with vector DBs is underwhelming because embeddings are not underrated but appropriately rated, and will not give you relevant info in every case. In fact, the way LLMs retrieve info internally [0] already works along the same principle and is a large factor in their unreliability.
I wonder if this can be used to detect code similarity between e.g. function or files etc.? Or are the existing algorithms overly trained on written prose?
My hot take: embeddings are overrated. They are overfitted on word overlap, leading to both many false positives and false negatives. If you identify a specific problem with them ("I really want to match items like these, but it does not work"), it is almost impossible to fix them. I often see them being used inappropriately, by people who read about their magical properties, but didn't really care about evaluating their results.
I think there is a deeper technical truth to this that hints at how much space there is to be gained in optimization.
1) that matryoshka representations work so well, and as few as 64 dimensions account for a large majority of the performance
2) that dimensional collapse is observed. Look at your cosine similarity scores and be amazed that everything is pretty similar and despite being a -1 to 1 scale, almost nothing is ever less than 0.8 for most models
I think we’re at the infancy in this technology, even with all of the advances in recent years.
Dude, you are talking total nonsense. What does 100k samples even mean. We have not even established what task we are talking about and you already know that you need that many samples. Not to be offensive, but you seem like the type of guy who believe in these magical properties.
Many of the top-performing models that you see on the MTEB retrieval for English and Chinese tend to overfit to the benchmark nowadays. voyage-3 and voyage-3-lite are also pretty small in size compared to a lot of the 7B models that take the top spots, and we don't want to hurt performance on other real-world tasks just to do well on MTEB.
We provide retrieval metrics for a variety of datasets and languages: https://blog.voyageai.com/2024/09/18/voyage-3/. I also personally encourage folks to either test on their own data or to find an open source dataset that closely resembles the documents they are trying to search (we provide a ton of free tokens for the evaluating our models).
It is unclear this model should be on that leaderboard because we don't know whether it has been trained on mteb test data.
It is worth noting that their own published material [0] does not entail any score from any dataset from the mteb benchmark.
This may sound nit picky, but considering transformers' parroting capabilities, having seen test data during training should be expected to completely invalidate those scores.
I have made several successful products in the past few years using primarily embeddings and cosine similarity. Can recommend. It’s amazingly effective (compared to what most people are using today anyway).
Mind-blowing. In effect, among humans, what separates the civilized from the crude is the quest for universality among the civilized. To say it differently, thinking in terms of attaining universality is the mark of a civilized mind.
There's attempts but you can only do so much in hundreds/thousands of dimensions. Most of the time the visualization doesn't really provide anything meaningful.
Assuming you have a dimension-reduction or manifold learning tool of choice (UMAP,PacMAP,t-SNE,PyMDE,etc.) then DataMapPlot (https://datamapplot.readthedocs.io/en/latest/) is a library specifically designed to make visualizations of the outputs of your dimension reduction.
This article shows the incorrect value for the OpenAI text-embedding-3-large Input Limit as 3072 which is actually its output limit [1]. The correct value is 8191 [2].
Edit: This value has now been fixed in the article.
Also, what each model means by a token can be very different due to the use of different model-specific encodings, so ultimately one must compare the number of characters, not tokens.
Embeddings are the only aspect of modern AI I'm excited about because they're the only one that gives more power to humans instead of taking it away. They're the "bicycle for our minds" of Steve Jobs fame; intelligence amplification not intelligence replacement. IMO, the biggest improvement in computer usability in my lifetime was the introduction of fast and ubiquitous local search. I use Firefox's "Find in Page" feature probably 10 or more times per day. I use find and grep probably every day. When I read man pages or logs, I navigate by search. Git would be vastly less useful without git grep. Embeddings have the potential to solve the biggest weakness of search by giving us fuzzy search that's actually useful.
I've been experimenting with using embeddings for finding the relevant git commits, as I often don't know or remember the exact word that was used. So I created my own little tool for embedding and finding commits by commit messages. Maybe you'll also find it useful: https://github.com/adrianmfi/git-semantic-similarity
Very cool, I’ll try this out!
Nice! Let me try this out
So you're saying, embeddings are fine, as long as we refrain from making full use of their capabilities? We've hit on a mathematical construct that seems to be able to capture understanding, and you're saying that the biggest models are too big, we need to scale down, only use embeddings for surface-level basic similarities?
I too think embeddings are vastly underutilized, and chat interface is not the be-all, end-all (not to mention, "chat with your program/PDF/documentation" just sounds plain stupid). However, whether current AI tools are replacing or amplifying your intelligence, is entirely down to how you use them.
As for search, yes, that was a huge breakthrough and powerful amplifier. 2+ decades ago. At this point it's computer use 101 - which makes it sad when dealing with programs or websites that are opaque to search, and "ubiquitous local search" is still not here. Embeddings can and hopefully will give us better fuzzy/semantic searching, but if you push this far enough, you'll have to stop and ask - if the search tool is now capable to understand some aspects of my data, why not surface this understanding as a different view into data, instead of just invoking it in the background when user makes a search query?
I have found that embeddings + LLM is very successful. I'm going to make the words up as to not yield my work publicly, but I had to classify something into 3 categories. I asked a simple llm to label it, it was 95% accurate. taking the min distance from the word embeddings to the mean category embeddings was about 96%. When I gave gave the LLM the embedding prediction, the LLM was 98% accurate.
There were issues an embedding model might not do well on where as the LLM could handle. for example: These were camel case words, like WoodPecker, AquafinaBottle, and WoodStock (I changed the words to not reveal private data). WoodPecker and WoodStock would end up with close embedding values because the word Wood dominated the embedding values, but these were supposed to go into 2 different categories.
> word Wood dominated the embedding values, but these were supposed to go into 2 different categories
When faced with a similar challenge we developed a custom tokenizer, pretrained BERT base model[0], and finally a SPLADE-esque sparse embedding model[1] on top of that.
[0] - https://huggingface.co/atomic-canyon/fermi-bert-1024
[1] - https://huggingface.co/atomic-canyon/fermi-1024
Do you mind sharing why you chose SPLADE-esque sparse embeddings?
I have been working on embeddings for a while.
For different reasons I have recently become very interested in learned sparse embeddings. So I am curious what led you to choose them for your application, and why?
> Do you mind sharing why you chose SPLADE-esque sparse embeddings?
I can provide what I can provide publicly. The first thing we ever do is develop benchmarks given the uniqueness of the nuclear energy space and our application. In this case it's FermiBench[0].
When working with operating nuclear power plants there are some fairly unique challenges:
1. Document collections tend to be in the billions of pages. When you have regulatory requirements to extensively document EVERYTHING and plants that have been operating for several decades you end up with a lot of data...
2. There are very strict security requirements - generally speaking everything is on-prem and hard air-gapped. We don't have the luxury of cloud elasticity. Sparse embeddings are very efficient especially in terms of RAM and storage. Especially important when factoring in budgetary requirements. We're already dropping in eight H100s (minimum) so it starts to creep up fast...
3. Existing document/record management systems in the nuclear space are keyword search based if they have search at all. This has led to substantial user conditioning - they're not exactly used to what we'd call "semantic search". Sparse embeddings in combination with other techniques bridge that well.
4. Interpretability. It's nice to be able to peek at the embedding and be able to get something out of it at a glance.
So it's basically a combination of efficiency, performance, and meeting users where they are. Our Fermi model series is still v1 but we've found performance (in every sense of the word) to be very good based on benchmarking and initial user testing.
I should also add that some aspects of this (like pretrained BERT) are fairly compute-intense to train. Fortunately we work with the Department of Energy Oak Ridge National Laboratory and developed all of this on Frontier[1] (for free).
[0] - https://huggingface.co/datasets/atomic-canyon/FermiBench
[1] - https://en.wikipedia.org/wiki/Frontier_(supercomputer)
These is excellent comment, can someone put it inside the highlights.
Some of the best performing embedding models (https://huggingface.co/spaces/mteb/leaderboard) are LLMs. Have you tried them?
> We've hit on a mathematical construct that seems to be able to capture understanding
I’m admittedly unfamiliar with the space, but having just done some reading that doesn’t look to be true. Can you elaborate please and maybe point to some external support for such a bold claim?
> Can you elaborate please and maybe point to some external support for such a bold claim?
SOTA LLMs?
If you think about what, say, a chair or electricity or love are, or what it means for something to be something, etc., I believe you'll quickly realize that words and concepts don't have well-defined meanings. Rather, we define things in terms of other things, which themselves are defined in terms of other things, and so on. There's no atomic meaning, the meaning is in the relationships between the thought and other thoughts.
And that is exactly what those models capture. They're trained by consuming a large amount of text - but not random text, real text - and they end up positioning tokens as points in high-dimensional space. As you increase the number of dimensions, there's eventually enough of them that any relationship between any two tokens (or groups; grouping concepts out of tokens is just another relationship) can be encoded in the latent space as proximity along some vector.
You end up with real computational artifact that's implementing the idea of defining concepts only in terms of other concepts. Now, between LLMs and the ability to identify and apply arbitrary concepts with vector math, I believe that's as close to the idea of "understanding" as we've ever come.
That does sound a bit like Peircian semiotic so I’m with you so far as the general concept of meaning being a sort of iterative construct.
Where I don’t follow is how a bitmap approximation captures that in a semiotic way. As far as I can tell the semiosis still is all occurring in the human observer of the machine’s output. The mathematics still hasn’t captured the interpretant so far as I can see.
Regardless of my possible incomprehension, I appreciate your elucidation. Thanks!
I feel like embeddings will be more powerful for understanding high dimensional physics than language because chaotic system predictability is limited by its compressability. Therefore an embedding is able to capture how exactly compressible the system is and therefore can extend the predictability as far as possible.
All modern AI technology can give more power to humans, you just have to use the right tools. Every AI tool I can think of has made me more productive.
LLMs help me write code faster and understand new libraries, image generation helps me build sites and emails faster, etc
I agee with this view. Generative AI robs us of something (thinking, practicing) which is the long term ability to practice a skill and improve oneself in exchange of an immediate (often crappy) result. Embeddings is a tech that can help us solve problem, ut we still have to do most of the work.
I ask LLMs to give me exercises, tutorials then write up my experience into "course notes", along with flashcards. I ask it to simulate a teacher, I ask it to simulate students that I have to teach, etc...
I haven't found a tool that is more effective in helping me learn.
Great for learning for learning sake. Learning with the intention of pursuing a career requires the economic/job model too, which is the problem.
Does a player piano rob you of playing music yourself? A car from walking? A wheelbarrow from working out? It’s up to you if you want to stop practicing!
Chess has become even more popular despite computers that can “rob us” of the joy. They’re even better practice partners.
An individual car doesn't stop you from walking but a culture that centers cars leads to cities where walking is outright dangerous.
Most car owners would never say outright "I want a car-centric culture". But car manufacturers lobbied for it, and step by step, we got both the deployment of useful car infrastructure, and the destruction or ignoring of all amenities useful for people walking or cycling.
Now let's go back to the period where cars start to become enormously popular, and cities start to build neighborhoods without sidewalks. There was probably someone at the time complaining about the risk of cars overtaking walking and leading to stores being more far away etc. And in front of them was probably someone like you calling them a luddite and being oblivious of second order effects.
Land is a shared, zero-sum resource: a parking lot is not a park.
Your software development methodology is your own. Why does someone else’s use of a tool deprive you of doing things the way you want?
I’m not sure it robs us. It makes it possible, but many people including myself find the artistic products of AI to be utterly without value for the reasons you list. I will always cherish the product of lifelong dedication and human skill
It doesn't diminish - but I do find it interesting how it influences. Realism became less important, less interesting, though still valued to a lesser degree, with the ubiquity of photography. Where will human creativity move towards when certain task become trivially machine replicable? Where will human ingenuity _enabled_ by new technology make new art possible?
there is fzf, depending on your definition of "useful"
> Is it terrible for the environment?
> I don’t know. After the model has been created (trained), I’m pretty sure that generating embeddings is much less computationally intensive than generating text. But it also seems to be the case that embedding models are trained in similar ways as text generation models2, with all the energy usage that implies. I’ll update this section when I find out more.
Although I do care about the environment, this question is completely the wrong one if you ask me. There is the public opinion (mainstream media?) some kind of idea that we should use less AI and somehow this would solve our climate problems.
As a counterexample, let's go to the extreme. Let's ban Google Maps because it does take computational resources from the phone. As a result more people will take wrong routes, and thus use more petrol. Say you use one gallon of petrol extra, that then wastes 34 kWh. This is of course the equivalent of running 34 powerful vacuum cleaners on full power for an hour. In contrast, say you downloaded your map, then the total "cost" is only the power used by the phone. A mobile phone has a battery of about 4 mAh, so 0,004 Ah * 4.2 V = 0.168 W, or 0.000168 kW. This means that the phone is about 200 000 times as efficient! And then we didn't even consider the time-saving for the human.
It's the same with running embeddings for doc generation. An Nvidia H100 consumes about 700 W, so say 1 kWh after an hour of running. 1 kWh should be enough to do a bunch of embedding runs. If this then saves, for example, one workday including the driving back and forth to the office, then again the tradeoff is highly in favor of the compute.
Long-term, its not about barring progress, but to have progress and more energy efficient models. The sum-games we play in regard to energy usage, don't necessarily stack up in dynamic systems; the increased energy usage of generative models may very well lead to less compute hours spent behind the desk drafting, revising, redrafting and doing it all over again once the next project comes around.
What remains though, is that increased productivity has rarely lead to a decrease in energy usage. Whether energy scarcity will drive model optimisation is anyone's guess, but it would be a differentiating feature on a market saturated with similarly capable offerings.
If people really cared about the environment then they would ban residential air conditioning, it’s a luxury.
I lived somewhere that would get to 40c in the summers and an oscillating fan was good enough to keep cool, the AC was nice to have but it wasn’t necessary.
I find it very hypocritical when people tell you to change your lifestyle for climate change but they have the A/C blasting all day long, everyday.
> I lived somewhere that would get to 40c in the summers and an oscillating fan was good enough to keep cool, the AC was nice to have but it wasn’t necessary.
As others have said, that works if you lived in a very dry area. And perhaps it was a house or a building optimized for airflow. And you didn't have much to do during the day. And I'm guessing you're young, and healthy.
Here in central Europe, sustained 40°C in the summer would rack up a significant body count. Anything above 30°C sucks really bad if you have any work to do. Or if, say, you're pregnant or have small children. And you live in a city.
Residential A/C isn't a luxury anymore, it's becoming a necessity very fast. Fortunately, heat pumps are one of the most efficient inventions of humankind. In particular, "A/C blasting all day long" beats anything else you could do to mitigate the heat if it involved getting into a car. And then it also beats whatever else you're doing to heat your place during the winter.
Prepare to have your mind blown: Heat pumps are very energy efficient at moving heat around. Significantly more so than the oil-fired boiler frequently found in the basement of a big-city apartment building.
> I lived somewhere that would get to 40c in the summers and an oscillating fan was good enough to keep cool
If it's not too humid.
I've lived in places with relatively high humidity and 35-40C, and have had the misfortune of not having an AC. Fans are not enough. I mean, sure, you can survive, but it really, really sucks.
“Reduce” was never going to work. Only the deep electrification of our economy will save us.
Indeed, and A/C is kind of a prime example of why it's beneficial, given how much more energy-efficient heat pumps are for cooling and heating than just about anything else.
> that would get to 40c in the summers and an oscillating fan was good enough to keep cool
This is entirely meaningless without providing the humidity. At higher than 70% relative humidity 40C is potentially fatal.
It was pretty humid too, a fan was ok. Americans are just very weak and are babies when it comes to temperature.
in a somewhat ironic twist, ac is crucial survival infrastructure in some parts of the world when heat comes during hot seasons. phoenix usa, parts of india, etc
AC is never crucial, you just want to be comfortable at the expense or th plane. How did people live there for centries before AC? Magic?
By not working when it's too hot to work, or by simply not living there, both of which are strictly verboten solutions in hypercapitalism.
That was a good post. Vector Embeddings are in some sense a summary of a doc that's unique similar to a hashcode of a doc. It makes me think it would be cool if there were some universal standard for generating embeddings, but I guess they'll be different for each AI model, so they can't have the same kind of "permanence" hash codes have.
It definitely also seems like there should be lots of ways to utilize "Cosine Similarity" (or other closeness algos) in databases and other information processing apps that we haven't really exploited yet. For example you could almost build a new kind of Job Search Service that matches job descriptions to job candidates based on nothing but a vector similarity between resume and job description. That's probably so obvious it's being done, already.
”you could almost build a new kind of Job Search Service that matches job descriptions to job candidates”
The key word being ”almost”. Yes, you can get similarity matches between job requirements and candidate resumes, but those matches are not useful for the task of finding an optimal candidate for a job.
For example, say a job requires A and B.
Candidate 1 is a junior who has done some work with A, B and C.
Candidate 2 is a senior and knows A, B, C, D, E and F by heart. All are relevant to the job and would make 2 the optimal candidate, even though C–F are not explicitly stated in the job requirements.
Candidate 1 would seem a much better candidate than 2, because 1’s embedding vector is closer to the job embedding vector.
> The key word being ”almost”. Yes, you can get similarity matches between job requirements and candidate resumes, but those matches are not useful for the task of finding an optimal candidate for a job.
Text embeddings are not about matching, they are about extracting the semantic topics and the semantic context. Matching comes next, if required.
If a LLM is used to generate the text embeddings, it would «expand» the semantic context for each keyword. E.g. «GenAI» would make the LLM expand the term into directly and loosely related semantic topics, say, «LLM», «NLP» (with a lesser relevance though), «artificial intelligence», «statistics» (more distant) and so forth. The generated embeddings will result in a much richer semantic context that will allow for straightforward similarity search as well as for exploratory radial search with ease. It also works well across languages, provided the LLM had a linguistically and sufficiently diverse corpus it was trained on.
Fun fact: I have recently delivered a LLM assisted (to generate text embeddings) k-NN similarity search for a client of mine. For the hell of it, we searched for «the meaning of life» in Cantonese, English, Korean, Russian and Vietnamese.
It pulled up the same top search result across the entire dataset for the query in English, Korean and Russian. Effectively, it turned into a Babelfish of search.
Cantonese and Vietnamese versions diverged and were less relevant as the LLM did not have a substantial corpus in either language. This can be easily fixed in the future, once a new LLM version that will have been trained on a better corpus in both, Cantonese and Vietnamese, languages – by regenerating the text embeddings on the dataset. The implementation won't have to change.
Even that is just static information.
We don't know if Candidate 2 really "knows A, B, C, D, E and F by heart", just that they claim to. They could be adding whatever to their skill list just, even though they hardly used it, just because it' a buzzword.
So Candidate 1 could still blow them out of the water in performance, and even be able to trivially learn D, and E in a short while on the job if needed.
The skill vector wont tell much by itself, and even prevent finding the better candidate if its used for screening.
> We don't know if Candidate 2 really "knows A, B, C, D, E and F by heart", just that they claim to. They could be adding whatever to their skill list just, even though they hardly used it, just because it' a buzzword.
That is indeed a problem. I have been thinking about a possible solution to the very same problem for a while.
The fact: people lie on their resumes, and they do it for different reasons. There are white lies (e.g. pumps something up because they aspire to something but were not presented with an opportunity to do it, yet they are eager to skill themselves up, learn and do it, if given an opportunity). Then there are other lies. Generally speaking, lies are never black or white, true or false; they are a shade of grey.
So the best idea I have been able to come up with so far is a hybrid solution that entails the text embeddings (the skills similarity match and search) coupled with the sentiment analysis (to score the sincerity of the information stated on a resume) to gain an extra insight into the candidate's intentions. Granted, the sentiment analysis is an ethically murky area…
Sincerity score on a resume? I can't tell if you're joking or not. I mean yeah, any sentence that ends in something like "...yeah, that's the ticket." would be detectable for sure, but I'm not sure everyone is as bad a liar as Jon Lovitz.
Are you speaking hypothetically or from your own experience? The sentiment analysis is a thing, and it mostly works – I have tested it with satisfactory results on sample datasets. It is relatively easy to extract the emotional context from a corpus of text, less so when it comes to resumes due to their inherently more condensed content. Which is precisely why I mentioned ethical considerations in my previous response. With the extra effort and fine tuning, it should be possible to overcome most of the false negatives though.
Sure AI can detect emotional tones (being positive, being negative, even sarcasm sometimes) in writing, so if you mean something like detecting negativity in a resume so it can be thrown immediately in the trash, then I agree that can work. Any negative emotionality is always a red-flag.
But insofar as detecting lies in sentences, that simply cannot be done, because even if it ever did work the failure rate would still be 99%, so you're better off flipping a coin.
So your point is that LLMs can't tell when job candidates are lying on their resume? Well that's true, but neither can humans. lol.
The trick is evaluate the score for each skill, also weighing it by the years of experience with the skill, then sum the evaluations. This will address your problem 100%.
Also, what a candidate claims as a skill is totally irrelevant and can be a lie. It is the work experience that matters, and skills can be extracted from it.
That's not accurate. You can explicitly bake in these types of search behaviors with model training.
People do this in ecommerce with the concept of user embeddings and product embeddings, where the result of personalized recommendations is just a user embedding search.
> not useful for the task of finding an optimal candidate
That statement is just flat out incorrect on it's face, however it did make me think of something I hadn't though of before, which is this:
Embedding vectors can be made to have a "scale" (multiplier) on specific terms which represent the amount of "weight" to add to that term. For example if I have 10 years experience in Java Web Development, then we can take the actual components of that vector embedding (i.e. for string "Java Web Development") and multiply them by some proportionality of 10, and that results in a vector that is "Further" into that direction. This represents an "amount" of directional into the Java Web direction.
So this means even with vector embeddings we can scale out to specific amounts of experience. Now here's the cool part. You can then take all THOSE scaled vectors (one for each individual job candidate skill) and average them to get a single point in space which CAN be compared as a single scalar distance from what the Job Requirements specify.
Then you would have to renormalize the vectors. You really really want to keep the range -1..1 because that is a special case where cosine similarity equals dot product equals Euclidean distance.
I meant the normalized hyperspace direction (unit vector) represents a particular "skill" and the distance into that direction (extending outside the unit hypersphere) is years of experience.
This is geometrically "meaningful", semantically. It would apply to not just a time vector (experience) but in other contexts it could mean other things. Like for example, money invested into a particular sector (Hedge fund apps).
This makes me realize we could design a new type of Perceptron (MLP) where specific scalars for particular things (money, time, etc.) could be wired into the actual NN architecture, in such a way that a specific input "neuron" would be fed a scalar for time, and a different neuron a scalar for money, etc. You'd have to "prefilter" each training input to generate the individual scalars, but then input them into the same "neuron" every time during training. This would have to improve overall "Intelligence" by a big amount.
It does exist! I built this for the monthly Who's Hiring threads: https://hnresumetojobs.com/
It just does cosine similarity with OpenAI embeddings + pgVector. It's not perfect by any means, but it's useful. It could probably stand to be improved with a re-ranker, but I just never got around to it.
Very cool. I knew it was too obvious an idea to be missed! Did you read my comments below about how you can maybe "scale up" a vector based on number of years of experience. I think that will work. It makes somebody with 10 yrs Java Experience closer to the target than someone with only 5yrs, if the target is 10 years! -- but the problem is someone with 20yrs looks even worse when they should look better! My problem in my life. hahaha. Too much experience.
I think the best "matching" factor is to minimize total distance where each distance is the time-multiplied vector for a specific skill.
> For example you could almost build a new kind of Job Search Service that matches job descriptions to job candidates based on nothing but a vector similarity between resume and job description. That's probably so obvious it's being done, already.
Literally the next item on my roadmap for employbl dot com lol. we're calling it a "personalized job board" and using PGVector for storing the embeddings. I've also heard good things about Typesense though.
One thing I've found to be important when creating the embeddings is to not do an embedding of the whole job description. Instead use an LLM to make a concise summary of the job listing (location, skills etc.) in a structured format. Then store that store as the embedding. It reduces noise and increases accuracy for vector search.
For one point of inspiration, see https://entropicthoughts.com/determining-tag-quality
I really like the picture you are drawing with "semantic hashes"!
Yeah for "Semantic Hashes" (that's a good word for them!) we'd need some sort of "Canonical LLM" model that isn't necessarily used for inference, nor does it need to even be all that smart, but it just needs to be public for the world. It would need to be updated like every 2 to 5 years tho to account for new words or words changing meaning? ...but maybe could be updated in such a way as to not "invalidate" prior vectors, if that makes sense? For example "ride a bicycle" would still point in the same direction even after a refresh of the canonical model? It seems like feeding the same training set could replicate the same model values, but there are nonlinear instabilities which could make it disintegrate.
Maybe the embedding could be paired up with a set of words that embed to somewhere close to the original embedding? Then the embedding can be updated for new models by re-embedding those words. (And it would be more interpretible by a human.)
I mean it was just a thought I had. May be a "solution in search of a problem". I generate those a lot! haha. But it seems to me like having some sort of canonical set of training data and a canonical LLM architecture, we'd end up able to generate consistent embeddings of course, but I'm just not sure what the use cases are.
I guess it might be possible to retroactively create an embeddings model which could take several different models' embeddings, and translate them into the same format.
This is done with two models in most standard biencoder approaches. This is how multimodal embedding search works. We want to train a model such that the location of the text embeddings that represent an item and the image embeddings for that item are colocated.
No. That’s like saying you can transplant a person’s neuronal action potentials into another person’s brain and have it make sense to them.
That metaphor is skipping the most important part in between! You wouldn't be transplanting anything directly, you'd have a separate step in between, which would attempt to translate these action potentials.
The point of the translating model in between would be that it would re weight each and every one of the values of the embedding, after being trained on a massive dataset of original text -> vector embedding for model A + vector embedding for model B. If you have billions of parameters trained to do this translation between just two specific models to start with, wouldn't this be in the realm of possible?
A translation between models doesn't seem possible because there are actually no "common dimensions" at all between models. That is, each dimension has a completely different semantic meaning, in different models, but also it's the combination of dimension values that begin to impart real "meaning".
For example, the number of different unit vector combinations in a 1500 dimensional space is like the number of different ways of "ordering" the components, which is 5^4114 .
EDIT: And the point of that factorial is that even if the dimensions were "identical" across two different LLMs but merely "scrambled" (in ordering) there would be that large number to contend with to "unscramble".
This is very similar to how LLMs are taught to understand images in llava style models (the image embeddings are encoded into the existing language token stream)
This is definitely possible. I made something like this. It worked pretty well for cosine similarity in my testing.
I tried doing something like that: https://gettjalerts.com/
I added semantic search, but I'm workin on adding resume upload/parsing to do automatic matching.
This is a great post. I’ve also been having a lot of fun working with embeddings, with lots of those pages being documentation. We write up a quick post on how are using them in prod, if you want to go from having an embedding to actually using them in a web app:
https://www.ethicalads.io/blog/2024/04/using-embeddings-in-p...
Thanks, Eric. So what you're really telling me is that you might make an exception to the "no tools talks" general policy for Write The Docs conference talks and let me nerd out on embeddings for 30 mins?? ;P
Haha. I think they are definitely relevant, and I’d call them a technology more than a tool.
That is mostly just that we don’t want folks going up and doing a 30 minute demo of Sphinx or something :-)
The thing that puzzles me about embeddings is that they're so untargeted, they represent everything about the input string.
Is there a method for dimensionality reduction of embeddings for different applications? Let's say I'm building a system to find similar tech support conversations and I am only interested in the content of the discussion, not the tone of it.
How could I derive an embedding that represents only content and not tone?
You can do math with word embeddings. A famous example (which I now see has also been mentioned in the article) is to compute the "woman vector" by subtracting "man" from "woman". You can then add the "woman vector" to e.g. the "king" vector to obtain a vector which is somewhat close to "queen".
To adapt this to your problem of ignoring writing style in queries, you could collect a few text samples with different writing styles but same content to compute a "style direction". Then when you do a query for some specific content, subtract the projection of your query embedding onto the style direction to eliminate the style:
I suspect this also works with text embeddings, but you might have to train the embedding network in some special way to maximize the effectiveness of embedding arithmetic. Vector normalization might also be important, or maybe not. Probably depends on the training.Another approach would be to compute a "content direction" instead of a "style direction" and eliminate every aspect of a query that is not content. Depending on what kind of texts you are working with, data collection for one or the other direction might be easier or have more/fewer biases.
And if you feel especially lazy when collecting data to compute embedding directions, you can generate texts with different styles using e.g. ChatGPT. This will probably not work as well as carefully handpicked texts, but you can make up for it with volume to some degree.
Interesting, but your hypothesis assumes that 'tone' is one-dimensional, that there is a single axis you can remove. I think tone is very multidimensional, I'd expect to be removing multiple 'directions' from the embedding.
No, I don’t think the author is saying one dimensional - the vectors are represented by magnitudes in almost all of the embedding dimensions.
They are still a “direction” in the way that [0.5, 0.5] in x,y space is a 45 degree angle, and in that direction it has a magnitude of around 0.7
So of course you could probably define some other vector space where many of the different labeled vectors are translated to magnitudes in the original embedding space, letting you do things like have a “tone” slider.
I think GP is saying that GGP assumes "tone" is one direction, in the sense there exists a vector V representing "tone direction", and you can scale "tone" independently by multiplying that vector with a scalar - hence, 1 dimension.
I'd say this assumption is both right and wrong. Wrong, because it's unlikely there's a direction in embedding space corresponding to a platonic ideal of "tone". Right, because I suspect that, for sufficiently large embedding space (on the order of what goes into current LLMs), any continuous concept we can articulate will have a corresponding direction in the embedding space, that's roughly as sharp as our ability to precisely define the concept.
I would say rather that the "standard example" is simplified, but it does capture an essential truth about the vectors. The surprise is not that the real world is complicated and nothing is simply expressible as a vector and that treating it as such doesn't 100% work in every way in every circumstance all of the time. That's obvious. Everyone who might work with embeddings gets it, and if they don't, they soon will. The surprise is that it does work as well as it does and does seem to be capturing more than a naive skepticism would expect.
You could of course compute multiple "tone" directions for every "tone" you can identify and subtract all of them. It might work better, but it will definitely be more work.
Though not exactly what you are after, Contextual Document Embeddings (https://huggingface.co/jxm/cde-small-v1), which generate embeddings based on "surrounding context" might be of some interest.
With 281M params it's also relatively small (at least for an embedding model) so one can play with it relatively easily.
Depends on the nature of the content you’re working with, but I’ve had some good results using an LLM during indexing to generate a search document by rephrasing the original text in a standardized way. Then you can search against the embeddings of that document, and perhaps boost based on keyword similarity to the original text.
This is also often referred to as Hypothetical Document Embeddings (https://arxiv.org/abs/2212.10496).
Do you have examples of this? Please say more!
Nice workaround. I just wish there was a less 'lossy' way to go about it!
Could you explicitly train a set of embeddings that performed that step in the process? For example which computing the loss, you compare the difference against the normalized text rather than the original. Or alternatively do this as a fine-tuning. Then you would have embedding that optimized for the characteristics you care about.
Normal full text search stuff helps reduce the search space - eg lemming, stemming, query simplification stuff were all way before LLMs.
There are a few things you can do. If these access patterns are well known ahead of time, you can train subdomain behavior into the embedding models by using prefixing. E.g. content: fixing a broken printer, tone: frustration about broken printer, and "fixing a broken printer" can all be served by a single model.
We have customers doing this in production in other contexts.
If you have fundamentally different access patterns (e.g. doc -> doc retrieval instead of query -> doc retrieval) then it's often time to just maintain another embedding index with a different model.
They don't represent everything. In theory they do but in reality the choice of dimensions is a function of the model itself. It's unique to each model.
Yeah, 'everything' as in 'everything that the model cares about' :)
I've just begun to dabble with embeddings and LLMs, but recently I've been thing about tryin to use principle component analysis[1] to either project to desirable subspaces, or project out undesirable subspaces.
In your case it would be to take a bunch of texts which roughly mean the same thing but with variance in tone, compute PCA of the normalized embeddings, take the top axsis (or top few) and project it out (ie subtract the projection) of the embeddings for the documents you care about before doing the cosine similarity.
Something along those lines.
Could be it's a terrible idea, haven't had time to do much with it yet due to work.
[1]: https://en.wikipedia.org/wiki/Principal_component_analysis
Agreed.. biggest problem with off the shelf embeddings I hit. Need a way to decompose embeddings.
You could fine-tune the embedding model to reduce cosine distance on a more specific function.
https://technicalwriting.dev/data/embeddings.html#let-a-thou...
> As docs site owners, I wonder if we should start freely providing embeddings for our content to anyone who wants them, via REST APIs or well-known URIs. Who knows what kinds of cool stuff our communities can build with this extra type of data about our docs?
Interesting idea. You'd have to specify the exact embedding model used to generate an embedding, right? Is there a well understood convention for such identification like say model_name:model_version:model_hash or something? For technical docs, obviously very broad field, is there an embedding model (or small number) widely used or obviously highly suitable that a site ownwer could choose one and have some reasonable expectation that publishing embeddings for their docs generated using that model would be useful to others? (Naive questions, I am not embedded in the field.)
It seems like sharing the text itself would be a better API, since it lets API users calculate their own embeddings easily. This is what the crawlers for search engines do. If they use embeddings internally, that’s up to them, and it doesn’t need to be baked into the protocol.
Could we work toward standardization at some point? Obviously, there will always be a newer model. I just hate that all the embedding work I did was with now depreciated openai model. At least single providers should see interest in ensuring that for their own model releases. Some trick like matryoshka embedding could secure that embedding from newer models nest or work within the space of older model preserving some form of comparability or alignment
Yeah, this is the main issue with the suggestion. Embeddings can only be compared to each other if they are in the same space (e.g., generated by the same model). Providing embeddings of a specific kind would require users to use the same model, which can quickly become problematic if you're using a closed-source embedding model (like OpenAI's or Cohere's).
Great post indeed! I totally agree that embeddings are underrated. I feel like the "information retrieval/discovery" world is stuck using spears (i.e., term/keyword-based discovery) instead of embracing the modern tools (i.e., semantic-based discovery).
The other day I found myself trying to figure out some common themes across a bunch of comments I was looking at. I felt lazy to go through all of them so I turned my attention to the "Sentence Transformers" lib. I converted each comment into a vector embedding, applied k-means clustering on these embeddings, then gave each cluster to ChatGPT to summarize the corresponding comments. I have to admit, it was fun doing this and saved me lots of time!
Interesting approach. Did you tell GPT to summarise the comments of each cluster after grouping them?
Cool, first time I've seen one of my posts trend without me submitting it myself. Hopefully it's clear from the domain name and intro that I'm suggesting technical writers are underrating how useful embeddings can be in our work. I know ML practitioners do not underrate them.
You might want to highlight chunking and how embeddings can/should represent subsections of your document as well. It seems relevant to me for cases like similarity or semantics search, getting the reader to the relevant portion of the document or page.
Theres probably some interesting ideas around tokenization and metadata as well. For example, if you’re processing the raw file I expect you want to strip out a lot of markup before tokenization of the content. Conversely, some markup like code blocks or examples would be meaningful for tokenization and embedding anyways.
I wonder if both of those ideas can be combined for something like automated footnotes and annotations. Linking or mouseover relevant content from elsewhere in the documentation.
Do you have any resources you recommend for representing sub sections? I'm currently prototyping a note/thoughts editor where one feature is suggesting related documents/thoughts (think linked notes in Obsidian) for which I would like to suggest sub sections and not only full documents.
Sorry, no good references off hand. I’ve had to help write & generate public docs in DocBook in the past. But no expert on either editors, nlp, or embeddings besides hacking around some tools for my own note taking. My assumption is youll want to use your existing markup structure, if you have it. Or naively split on paragraphs with a tool like spacy. Or get real fancy and use dynamic ranges; something like an accumulation window that aggregates adjacent sentences based on individual similarity, break on total size or dissimilarity, and then treat that aggregate as the range to “chunk.”
Thanks for the elaborate and helpful response. I'm also hacking on this as a personal note taking project and already started playing around with your ideas. Thanks!
Haha yeah I was about to comment that I recall a period just after Word2Vec came out where embeddings were most definitely not underrated but rather the most hyped ML thing out there!
Yeah embeddings are the unsung killer feature of LLMs
Author of txtai (https://github.com/neuml/txtai) here. I've been in the embeddings space since 2020 before the world of LLMs/GenAI.
In principle, I agree with much of the sentiment here. Embeddings can get you pretty far. If the goal is to find information and citations/links, you can accomplish most of that with a simple embeddings/vector search.
GenAI does have an upside in that it can distill and process those results into something more refined. One of the main production use cases is retrieval augmented generation (RAG). The "R" is usually a vector search but doesn't have to be.
As we see with things like ChatGPT search and Perplexity, there is a push towards using LLMs to summarize the results but also linking to the results to increase user confidence. Even Google Search now has that GenAI section at the top. In general, users just aren't going to accept LLM responses without source citations at this point. The question is if the summary provides value or if the citations really provide the most value. If it's the later, then Embeddings will get the job done.
Doesn’t OpenAI embedding model support 8191/8192 tokens? That aside, declaring a winner by token size is misleading. There are more important factors like cross language support and precision for example
Yep, voyage-3 is not even anywhere in the top of the MTEB leaderboard if you order by `retrieval score` desc.
stella_en_1.5B_v5 seems to be an unsung hero model in that regard
plus you may not even want such large token sizes if you just need accurate retrieval of snippets of text (like 1-2 sentences)
Thanks thund and jdthedisciple for these points and corrections. I'll update the section today.
Updated the section to refer to the "Retrieval Average" column of the MTEB leaderboard. Is that the right column to refer to? Can someone link me to an explanation of how that benchmark works? Couldn't find a good link on it
And that's not all because token encodings of different models can be very different.
if anything i would consider embeddings bit overrated, or it is safer to underrate them.
They're not the silver bullet many initially hoped for, they're not a complete replacement for simpler methods like BM25. They only have very limited "semantic understanding" (and as people throw increasingly large chunks into embedding models, the meanings can get even fuzzier)
Overly high expectations lets people believe that embeddings will retrieve exactly what they mean, and With larger top-k values and LLMs that are exceptionally good at rationalizing responses, it can be difficult to notice mismatches unless you examine the results closely.
Absolutely. Embeddings have been around a while and most people don’t realize it wasn’t until the e5 series of models from Microsoft that they even benchmarked as well as BM25 in retrieval scores, while being significantly more costly to compute.
I think sparse retrieval with cross encoders doing reranking is still significantly better than embeddings. Embedding indexes are also difficult to scale since hnsw consumes too much memory above a few million vectors and ivfpq has issues with recall.
Off the shelf embedding models definitely underpromise and overdeliver. In ten years I'd be very surprised if companies weren't fine-tuning embedding models for search based on their data in any competitive domains.
My startup (Atomic Canyon) developed embedding models for the nuclear energy space[0].
Let's just say that if you think off-the-shelf embedding models are going to work well with this kind of highly specialized content you're going to have a rough time.
[0] - https://huggingface.co/atomic-canyon/fermi-1024
> they're not a complete replacement for simpler methods like BM25
There are embedding approaches that balance "semantic understanding" with BM25-ish.
They're still pretty obscure outside of the information retrieval space but sparse embeddings[0] are the "most" widely used.
[0] - https://zilliz.com/learn/sparse-and-dense-embeddings
Nice introduction, but I think that ranking the models purely by their input token limits is not a useful exercise. Looking at the MTEB leaderboard is better (although a lot of the models are probably overfitting to their test set).
This is a good time to chill for my visualization of 5 Millionembeddings of HN posts, users and comments: https://tomthe.github.io/hackmap/
Thanks, a couple other people gave me this same feedback in another comment thread and it definitely makes sense not to overindex on input token size. Will update that section in a bit.
I was using embeddings to group articles by topic, and hit a specific issue. Say I had 10 articles about 3 topics, and articles are either dry or very casual in tone.
I found clustering by topic was hard, because tone dimensions ( whatever they were ) seemed to dominate.
How can you pull apart the embeddings? Maybe use an LLM to extract a topic, and then cluster by extracted topic?
In the end I found it easier to just ask an LLM to group articles by topic.
I agree, I tried several methods during my pet project [1], and all of them have their pros and cons. Looks like creating topics first and predicting them using LLM works the best
[1] https://eamag.me/2024/Automated-Paper-Classification
Allegedly, the new hotness in RAG is exactly that. Use a smaller LLM to summarize the article and include that summary alongside the article when generating the embedding.
Potentially solves your issue, but it is also handy when you have to chunk a larger document and would lose context from calculating the embedding just on the chunk.
Great post!
One quick minor note is that the resulting embeddings for the same text string could be different, depending on what you specify the input type as for retrieval tasks (i.e. query or document) -- check out the `input_type` parameter here: https://docs.voyageai.com/reference/embeddings-api.
Is there any benefit to fine-tuning a model on your corpus before using it to generate embeddings? Would that improve the quality of the matches?
Yes. Especially if you work in a not well supported language and/or have specific datapairs you want to match that might be out of ordinary text.
Training your own fine tune takes a really short time and GPU resources, and you can easily outperform even sota models on your specific problem with a smaller model/vector space
Then again on general English text and doing a basic fuzzy search. I would not really expect high performance gains.
Is there some way to compare different embeddings for different use cases?
Search for MTEB Leaderboard on huggingface
Underrated by people are unfamiliar with machine learning, maybe.
I actually tend to agree. In the article, I didn't see the strong argument highlighting what powerful feature exactly people were missing in relation to embeddings. Those who work in ML they probably know these basics.
It is a nice read though - explaining the basics of vector spaces, similarity and how it is used in modern ML applications.
> Hopefully it's clear from the domain name and intro that I'm suggesting technical writers are underrating how useful embeddings can be in our work. I know ML practitioners do not underrate them.
https://news.ycombinator.com/item?id=42014036
> I didn't see the strong argument highlighting what powerful feature exactly people were missing in relation to embeddings
I had to leave out specific applications as "an exercise for the reader" for various reasons. Long story short, embeddings provide a path to make progress on some of the fundamental problems of technical writing.
thank you for explanation, yes I later encountered your answer and upvoted it.
> I had to leave out specific applications as "an exercise for the reader" this is very unfortunate. would be very interesting to hear some intel :)
LLMs have nearly completely sucked the oxygen out of the room when it comes to machine learning or "AI".
I'm shocked at the number of startups, etc you see trying to do RAG, etc that basically have no idea what they are, how they actually work, etc.
The "R" in RAG stands for retrieval - as in the entire field of information retrieval. But let's ignore that and skip right to the "G" (generative)...
Garbage in, garbage out people!
Even by ML people from 25 years ago. It’s a black box function that maps from a ~30k space to a ~1k space. It’s a better function then things like PCA, but does the same thing.
What would be really cool if somebody figured out how to do embeddings -> text.
Hmm as a very stupid first pass...
0. Generate an embedding of some text, so that you have a known good embedding, this will be your target.
1. Generate an array of random tokens the length of the response you want.
2. Compute the embedding of this response.
3. Pick a random sub-section of the response and randomize the tokens in it again.
4. Compute the embedding of your new response.
5. If the embeddings are closer together, keep your random changes, otherwise discard them, go back to step 2.
6. Repeat this process until going back to step 2 stops improving your score. Also you'll probably want to shrink the size of the sub-section you're randomizing the closer your computed embedding is to your target embedding. Also you might be able to be cleverer by doing some kind of masking strategy? Like let's say the first half of your response text already was actually the true text of the target embedding. An ideal randomizer would see that randomizing the first half almost always makes the result worse, and so would target the 2nd half more often (I'm hoping that embeddings work like this?).
7. Do this N times and use an LLM to score and discard the worst N-1 results. I expect that 99.9% of the time you're basically producing adversarial examples w/ this strategy.
8. Feed this last result into an LLM and ask it to clean it up.
We'd be happy to sponsor research on this topic. If interseted, email me.
Is it not possible? I'm not that familiar with the topic. Doing some sort of averaging over a large corpus of separate texts could be interesting and probably would also have a lot of applications. Let's say that you are gathering feedback from a large group of people and want to summarize it in an anonymized way. I imagine you'd need embeddings with a somewhat large dimensionality though?
Reconstruct text from SONAR embeddings: https://github.com/facebookresearch/SONAR?tab=readme-ov-file...
I wonder if someone has already tried to do that. Though this might go in a similar direction: https://arxiv.org/abs/1711.00043
That's chatgpt
Embeddings from things like one-hot, count vectorization, tf-idf, etc into dimensionality reduction techniques like SVD and PCA have been around for a long time and also provided the ability to compare any two pieces of text to each other. Yes, neural networks and LLMs have provided the ability for the context of each word to affect the whole document's embedding and capture more meaning, potentially that pesky "semantic" sort even; but they still are fundamentally a dimensionality reduction technique.
This article really resonates with me - I've heard people (and vector database companies) describe transformer embeddings + vector databases as primarily a solution for "memory/context for your chatbot, to mitigate hallucinations", which seems like a really specific (and kinda dubious, in my experience) use case for a really general tool.
I've found all of the RAG applications I've tried to be pretty underwhelming, but semantic search itself (especially combined with full-text search) is very cool.
I dare say RAG with vector DBs is underwhelming because embeddings are not underrated but appropriately rated, and will not give you relevant info in every case. In fact, the way LLMs retrieve info internally [0] already works along the same principle and is a large factor in their unreliability.
[0] https://nonint.com/2023/10/18/is-the-reversal-curse-a-genera...
I wonder if this can be used to detect code similarity between e.g. function or files etc.? Or are the existing algorithms overly trained on written prose?
Yes, of course it can be used in that way, but the quality of the result depends on whether the model was also trained on such code or not.
My hot take: embeddings are overrated. They are overfitted on word overlap, leading to both many false positives and false negatives. If you identify a specific problem with them ("I really want to match items like these, but it does not work"), it is almost impossible to fix them. I often see them being used inappropriately, by people who read about their magical properties, but didn't really care about evaluating their results.
I think there is a deeper technical truth to this that hints at how much space there is to be gained in optimization.
1) that matryoshka representations work so well, and as few as 64 dimensions account for a large majority of the performance
2) that dimensional collapse is observed. Look at your cosine similarity scores and be amazed that everything is pretty similar and despite being a -1 to 1 scale, almost nothing is ever less than 0.8 for most models
I think we’re at the infancy in this technology, even with all of the advances in recent years.
"I really want to match items like these, but it does not work" is just a fine tuning problem.
Yes, in a sense that if you have infinite appropriate dataset and compute. No, in a sense what is practically achievable.
You don't need infinite data. You need ~100k samples. It's also not particularly expensive.
Dude, you are talking total nonsense. What does 100k samples even mean. We have not even established what task we are talking about and you already know that you need that many samples. Not to be offensive, but you seem like the type of guy who believe in these magical properties.
… you established the task earlier? Item X and Item Y should be colocated in embedding space.
This is what people using embedding models for recsys are doing. It’s not rocket science and it doesn’t require “infinite data”.
By 100k samples I mean 100k samples that provide relevance feedback. 100k positive pairs.
I’m working on these kinds of problems with actual customers. Not really sure where the hostility is coming from.
You can easily fix this using embedding arithmetic to build embedding classifiers.
Are there good examples of this working in the wild? Before I comb through all ten blue links... https://www.google.com/search?q=embedding%20arithmetic%20emb...
I'm not sure why the voyage-3 models aren't on the MTEB leaderboard. The code for the leaderboard suggests they should be there: https://huggingface.co/spaces/mteb/leaderboard/commit/b7faae...
But I don't see them when I filter the list for 'voyage'.
(I work at Voyage)
Many of the top-performing models that you see on the MTEB retrieval for English and Chinese tend to overfit to the benchmark nowadays. voyage-3 and voyage-3-lite are also pretty small in size compared to a lot of the 7B models that take the top spots, and we don't want to hurt performance on other real-world tasks just to do well on MTEB.
> we don't want to hurt performance on other real-world tasks just to do well on MTEB
Nice!
Fortunately MTEB lets you sort by model parameter size because using 7B parameter LLMs for embeddings is just... Yuck.
It would still be great to know how it compares?
Why should I pick voyage-3 if for all I know it sucks when it comes to retrieval accuracy (my personally most important metric)?
We provide retrieval metrics for a variety of datasets and languages: https://blog.voyageai.com/2024/09/18/voyage-3/. I also personally encourage folks to either test on their own data or to find an open source dataset that closely resembles the documents they are trying to search (we provide a ton of free tokens for the evaluating our models).
It is unclear this model should be on that leaderboard because we don't know whether it has been trained on mteb test data.
It is worth noting that their own published material [0] does not entail any score from any dataset from the mteb benchmark.
This may sound nit picky, but considering transformers' parroting capabilities, having seen test data during training should be expected to completely invalidate those scores.
[0] see excel spreadsheet linked here https://blog.voyageai.com/2024/09/18/voyage-3/
I'm critical of the low number of embedding dims.
Could hurt performance in niche applications, in my estimation.
Looking forward to try the announced large models though.
Embeddings are indeed great. I have been using it a lot.
Even wrote about it at: https://blog.dobror.com/2024/08/30/how-embeddings-make-your-...
I have made several successful products in the past few years using primarily embeddings and cosine similarity. Can recommend. It’s amazingly effective (compared to what most people are using today anyway).
Embeddings are a new jump to universality, like the alphabet or numbers. https://thebeginningofinfinity.xyz/Jump%20to%20Universality
Mind-blowing. In effect, among humans, what separates the civilized from the crude is the quest for universality among the civilized. To say it differently, thinking in terms of attaining universality is the mark of a civilized mind.
I made an episode to appreciate the book: https://podcasters.spotify.com/pod/show/podgenai/episodes/Th...
The title of the post says they are underrated, but doesn't provide any real justification beyond saying - they are good for x.
I am not denying their usefulness, but it's misleading.
Are there any visualization libraries that visualize embeddings in a vector space?
UMAP: https://umap-learn.readthedocs.io/en/latest/
scikit-learn also has options: https://scikit-learn.org/stable/auto_examples/manifold/plot_...
My instinct would be a principal component analysis (which someone has demonstrated here: https://www.youtube.com/watch?app=desktop&v=brt88wwoZtI). Not sure it would tell you much though, but it looks nice.
There's attempts but you can only do so much in hundreds/thousands of dimensions. Most of the time the visualization doesn't really provide anything meaningful.
Assuming you have a dimension-reduction or manifold learning tool of choice (UMAP,PacMAP,t-SNE,PyMDE,etc.) then DataMapPlot (https://datamapplot.readthedocs.io/en/latest/) is a library specifically designed to make visualizations of the outputs of your dimension reduction.
If you need them visualized, you're already on the wrong track.
It's fun to try and guess what semantic concepts might be captured within individual dimensions / pairs of dimensions of the embeddings space.
This reminds me- I gotta go back and reread Borges's short stories with ML theory in mind.
Is it accurate to say that any data that can be tokenized can be turned into embeddings?
This article shows the incorrect value for the OpenAI text-embedding-3-large Input Limit as 3072 which is actually its output limit [1]. The correct value is 8191 [2].
Edit: This value has now been fixed in the article.
[1] https://platform.openai.com/docs/models/embeddings#embedding...
[2] https://platform.openai.com/docs/guides/embeddings/#embeddin...
Also, what each model means by a token can be very different due to the use of different model-specific encodings, so ultimately one must compare the number of characters, not tokens.
"Reckless" seems a bit aggressive for what is likely an honest mistake in an otherwise very nice article.
Edited.
A couple other issues with that section surfaced here:
* https://news.ycombinator.com/item?id=42014683
* https://news.ycombinator.com/item?id=42015282
Updating that section now