Semantic Search With Vectors
If you’ve been following the latest news in search, you’ve probably heard about vector search.
And you may have even started to dig into the topic to try to learn more about it, only to come out the other end confused. Didn’t you leave that math back in college?
Building vector search is difficult. Understanding it doesn’t have to be.
And understanding that vector search isn’t the future, hybrid search is – that’s just as important.
What Are Vectors?
When we talk about vectors in the context of machine learning, we mean this: Vectors are groups of numbers that represent something.
That thing could be an image, a word, or nearly anything.
The questions, of course, are why those vectors are useful and how they are created.
Let’s look first at where those vectors come from. The short answer: Machine learning.
Jay Alammar has perhaps the best blog post ever written on what vectors are.
As a summary, though, machine learning models input items (let’s assume just words from here on out) and try to figure out the best formulas to predict something else.
For example, you may have a model that takes in the word “bee,” and it is trying to figure out the best formulas that will accurately predict that “bee” is seen in similar contexts as “insects” and “wasps.”
Once that model has that best formula, it can transform the word “bee” into a group of numbers that just so happen to be similar to the group of numbers for “insects” and “wasps.”
Why Vectors Are Powerful
Vectors are really powerful for this reason: Large language models like Generative Pre-trained Transformer 3 (GPT-3) or those from Google take into account billions of words and sentences, so they can start to make these connections and become really intelligent.
It’s easy to understand why people are so excited to apply that intelligence to search.
Some are even saying that vector search will replace the keyword search we’ve known and loved for decades.
The thing is, though, that vector search is not replacing keyword search whole-cloth. To think that keyword search won’t retain immense value places too much optimism in the new and shiny.
Vector search and keyword searches each have their own strengths, and they work best when they work together.
Vector Search For Long Tail Queries
If you work in search, you are likely intimately familiar with the long tail of queries.
This concept, popularized by Chris Anderson to describe digital content, says that there are some items (for search queries) that are much more popular than everything else, but that there are lots of individual items that are still wanted by someone.
So it is with search.
A few queries (also called “head” queries) are each searched a lot, but the great majority of queries are searched very little – maybe even just a single time.
Numbers will differ from site to site, but on an average site, about a third of total searches may come from just a few dozen queries, while nearly half of search volume comes from queries that are outside the 1,000 most popular.
Long tail queries tend to be longer, and they might even be natural language queries.
Research from my company Algolia showed that 75% of queries are two or fewer words. 90% of queries are four or fewer words. Then, to get to 99% of queries, you need 13 words!
However, they aren’t always long, they could just be obscure. For a women’s fashion website, “mauve dress” could be a long tail query because people don’t ask for that color very often. “Wristlet” might likewise be a seldom-seen query, even if the website does have bracelets for sale.
Vector search generally works great for long tail queries. It can understand that wristlets are similar to bracelets, and surface the bracelets even without synonyms set up. It can show pink or purple dresses when someone searches for something in mauve.
Vector search can even work well for those long or natural language queries. “Something to keep my drinks cold” will bring up refrigerators in well-tuned vector search, whereas, with keyword search, you better hope that text is somewhere in a product description.
In other words, vector search increases the recall of search results, or how many results are found.
How Vector Search Works
Vector search does this by taking those groups of numbers we described above and having the vector search engine ask, “If I were to graph these groups of numbers as lines, which would be closest together?”
An easy way to conceptualize this is to think of groups that have just two numbers. The group [1,2] is going to be closer to the group [2,2] than it would be to the group [2,500].
(Of course, since vectors have dozens of numbers within them, they are being “graphed” in dozens of dimensions, which isn’t so easy to visualize.)
This approach to determining similarity is powerful because the vectors representing words like “doctor” and “medicine” are going to be “graphed” much more similar than the words “doctor” and “rock” would be.
Downsides To Vector Search
However, there are downsides to vector search.
First is the cost. All of that machine learning that we discussed above? It has costs.
Storing the vectors is more expensive than storing a keyword-based search index, for one thing. Searching on those vectors is also slower than a keyword search in most cases.
Now, hashing can mitigate both of these problems.
Yes, we’re introducing more technical concepts, but this is another one that’s fairly simple to understand the basics.
Hashing performs a series of steps to transform some piece of information (like a string or a number) into a number, which takes up less memory than the original information.
It turns out that we can also use hashing to reduce the sizes of vectors while still maintaining what makes vectors useful: their ability to match conceptually similar items.
Through using hashing, we can make vector searches much faster and have the vectors use less room overall.
The details are highly technical, but what’s important is understanding that it is possible.
The Continued Usefulness Of Keyword Search
This doesn’t mean that keyword search isn’t still useful! Keyword search is generally faster than vector search.
Additionally, it is easier to understand why results are ranked the way they are.
Take the example of the query “texas” and “tejano” and “state” as potential word matches. Clearly, “tejano” is closer if we look at the comparison from a pure keyword search perspective. It’s not so easy to tell, however, which would be closer from a vector search approach.
Keyword-based search understands “texas” as being more similar to “tejano” because it uses a textual-based approach to finding records.
If records contain words that are exactly the same as what is in the query (or within a certain level of difference to account for typos), then the record is considered relevant and comes back in the result sets.
In other words, keyword search focuses on the precision of search results, or ensuring that the records that come back are relevant, even if there are fewer of them.
Keyword Search As Beneficial For Head Queries
For this reason, keyword search performs really well for head queries: those queries that are the most popular.
Head queries tend to be shorter, and they are also easier to optimize for. That means that if, for whatever reason, a keyword doesn’t match the right text inside a record, it’s often caught through analytics, and you can add a synonym.
Because keyword search works best for head queries and vector search works best for long tail queries, the two work best in concert.
This is known as hybrid search.
Hybrid search is when a search engine uses both keyword and vector search for a single query and ranks records correctly, no matter which search approach brought them about.
Ranking Records Across Search Sources
Ranking records that come from two different sources is not easy.
The two approaches have, by their very natures, different ways of scoring records.
Vector search will return a score, while some keyword-based engines won’t. Even if the keyword-based engines do return a score, there’s no guarantee that the two scores are equivalent.
If the scores aren’t equivalent, then you can’t say that a score of 0.8 from the keyword engine is more relevant than a score of 0.79 from the vector engine.
Another alternative would be to run all of the results through the scoring of either the vector engine or the keyword engine.
This has the benefit of getting the extra recall from the vector engine, but has some disadvantages as well. Those extra recalled results that come from the vector engine won’t be rated as relevant from a keyword score, or else they would have appeared in the results set already.
You could alternatively run all of the results – keyword or otherwise – through the vector scoring, but this is slow and expensive.
Vector Search As A Fallback
That’s why some search engines don’t even attempt to blend the two, but instead will always display keyword results first, and then vector results second.
The thinking here is that if a search returns zero or few results, then you can fall back to the vector results.
Remember, vector search is geared toward improving recall or finding more results, and so it may find relevant results that the keyword search did not.
This is a decent stopgap but is not the future of true hybrid search.
True hybrid search will rank multiple different search sources in the same result set by creating a score that is comparable across different sources.
There is much research into this approach today, but few are doing it well and providing their engine publicly.
So what does this mean for you?
Right now, the best thing you can do is probably to sit tight and stay up to date with what’s happening in the industry.
Vector and keyword-based hybrid search is coming in the upcoming years, and it will be available for people without data science teams.
In the meantime, keyword search is still valuable and will only be improved when vector search is brought in later.
Featured Image: pluie_r/Best SMM Panel