
TRAINING APACHE LUCENE FULL

We set several goals before the development started: We also saw some inefficiencies in how ranking, searching, and indexing were implemented, so at some point in time it was clear that we can do better.
TRAINING APACHE LUCENE PATCH
These tasks took a lot of time, because you had to write business logic, then patch ES and sometimes even redeploy the cluster.Įventually, it became clear that we spend more time on infrastructure work than on writing business logic. We were more or less happy with how things were going until we saw the growing backlog of tasks that required modifications in the internal ES parts. The last one was based on an Elasticsearch cluster with dozens of nodes, shards, replicas and all of the stuff you’d expect from a high load search engine. Ozon had different search architectures during its lifetime. That’s all you need to know, for now, let’s move on! Why do you even need to build your own search engine, man? Just use ElasticSearch and have fun The last phase is usually a neural network that was trained to solve ranking problems. You can repeat this process as many times as you want, but in practice 2 or 3 sorting algorithms are enough. Then you pick the best items (as much as you can handle, let’s say 20K) from the first phase and sort them with your next algorithm - which is more clever, but more time-consuming. First, you pick your most naive but lightweight algorithm and sort all the results you found. Let’s say your query returned 500K items. The typical trick for the ranking problem is to use several ranking formulas.

Bad response time can hurt you even more than bad search relevancy, so be ready for a journey full of challenges and trade-offs. When your typical query returns 100K results or even more, you have a hard time trying to run complex sorting algorithms under the tight time limit constraints. But for some big players like search engines, job boards or e-commerce platforms, ranking the search results is a million-dollar problem. It may be not a top priority if you run a local food delivery app and your typical search query returns just 5 items. The ranking is a very important part of search applications. Again, it may be hidden inside a search engine (like ElasticSearch) or may be built as a top-level service that performs some complex business logic.

A brief intro on how search typically works Here I start from 2020 and talk only about O2. It covers our journey with MSSQL FTS, Sphinx and ElasticSearch. There is an original article on, if you don’t mind reading the pre-history from 1998 to 2020. If you ever thought of building a search engine or your work is related to search technologies- then this is the right story to read. We used some of the techniques from ElasticSearch and Solr but added a lot of things that make O2 the best match for our workloads. The key part of the story is O2 (oxygen) - the search engine we built on top of Apache Lucene. In this article, I will talk about the Ozon search architecture - a leading e-commerce marketplace.
