ElasticSearch has become the de-facto "drop-in" search solution in many places. Ruby on Rails gems like Searchkick make it very easy for you to integrate search capabilities in your application, but at the same time that easily allows you to shoot yourself in the foot too.
The objective of this series is to help you make the search relevance better. Sure, ElasticSearch works really well out-of-the-box when you have a small controlled corpus, but anything more requires you to actually learn all the capabilities of ElasticSearch and properly apply them.
Information Retrieval, Search, and Corpus
A search engine and an information retrieval (IR) system are different concepts. As you can read in Section 3.1 of The Anatomy of a Large-Scale Hypertextual Web Search Engine published by Google, you can see how an IR system is very ineffective when used against the web.
An IR system could be useful for searching a controlled corpus, like a catalogue of books, a list of courses in your company, legal documents, etc. It is more tailored to enterprise search where the corpus is smaller and controller, and every piece of content is relevant. Searching for a specific name in legal documents would work very successfully with an IR system.
A search engine faces a different problem. Not all the documents in the web are relevant. There is no control over the corpus, since anyone can come in and build a website. Finally, there are a lot of duplicates. If there's a popular keyword, then there could be thousands of articles and videos covering the same material, albeit slightly differently. While IR focuses on the corpus, search engines like Google had to leverage the social factor (PageRank) early on to distinguish between the quality of textually similar content. More unknown and advanced algorithms and factors would be in use today.
Strategic ways to improve relevancy
There are two major ways of tuning relevancy. One is to tune the corpus by:
- Pruning the corpus of bad items (valid in case of web search). How would we determine whether something is bad?
- Enriching the corpus with useful metadata that can be later used while querying
Another is to tune the query:
- Adding boosts to certain fields
- Ignoring or embracing stopwords
- Using synonyms
- Blacklisting certain words
- Learning to rank, which is using machine learning to rank the top N documents
Usually, you'll find yourself doing both.
Adding metadata to corpus
Let's say you have a huge database of machine parts. Different departments search on this database, but not all parts are relevant to the. To the electrical department, capacitors, resistors, etc., would be more relevant. To the mechanical department, there could be things like servos.
The strategic way to improve relevancy here is to add more metadata to your corpus. You could tag each item with which department it could be relevant to, and while searching, every person will appear to get more relevant results, because it's from their department.
You could also auto-complete commonly used parts.
Controlling the corpus
If your corpus is all around from the web (like a search engine), then you need a lot of signals (be it social or otherwise) on which content to rank in your corpus at all. People really don't want 100k results on their search query, they just want the most relevant to them.
A lot of times, search is less about search and more about improving or pruning the corpus through different pipelines.
Improving the ElasticSearch query
You'll need to get an understanding of the basic algorithms used by ES for ranking documents (not just retrieving them). This will allow you to understand why:
- Certain documents don't get returned
- Bad documents get ranked above good ones
- Search results are not specific enough
- Or search results are just too narrow
Here, we're not talking about relevance as a general concept. We're talking about certain implementations within ElasticSearch/Lucene that would lead to results appearing in a certain order. A lot of times, these would not be to your benefit. We'll see how you can tune that.
We'll be going in detail of each of the above techniques. However, for now, I wanted to give you a brief overview of the topics that we would be touching. There could be a few extra techniques as I continue writing this series, and I'll update this page to reflect that for future readers.
We'll also be looking at repeatable ways to construct small experiments to understand what ElasticSearch is doing.