Full-Text Search Improvement

October 17, 2018

During our early stage development we decided to adopt ElasticSearch as our NoSQL document store and full-text search engine. Elastic is backed by Apache Lucene, which provides the indexing and search feature functionality for ElasticSearch. The Elastic ecosystem has made it easy for developers to ingest and query documents containing a variety of data types, without having to understand the backing Lucene syntax. Design Fast has recently updated our ElasticSearch implementation helping query results to be produced more accurately and quickly.

Elastic allows developers to define a dynamic or semi-structured data mapping for its indices — where documents are stored. During our refactoring we decided to define our semi-structured data mappings more explicitly. Before, we allowed Elastic to create field types such as pricing or external links dynamically(i.e., Elastic would choose what data type it deemed correct for a specific documents fields, which was not always the case). But, we learned that we could better utilize disk space and ensure a higher accuracy of results if we defined specific data types for a document. For example, storing a pricing field as a scaled_float with a factor of one hundred improves disk space use while maintaining the integrity of the field type as a double when queried.

As we defined more explicit fields, we came to the realization that we we could concatenate all these fields together and search on that single field. This field is normalized — transforming characters to lowercase and remove special characters — and then tokenized with ngram tokenizer. We decided to use the ngram tokenizer over the edge ngram tokenizer because our search statistics showed that most users would search for keywords that included different segments of a part number, not just the beginning of the part number targeting the part family. This was single handedly the largest change that helped improve query time and accuracy significantly. We were no longer building large query models with tens of different fields trying to match different search types. This allowed Design Fast to turn off indexing for almost all fields, which cut back on disk space usage and since Elastic no longer had to build inverted indices for those fields, increased ingestion time.

Design Fast has also moved over to Elastic Cloud from Amazon Web Services managed ElasticSearch service. Moving to Elastic Cloud has allowed us to use the latest and greatest Elastic versions with X-Pack. X-Pack has helped us significantly understand our Elastic infrastructure at a much lower level. We now use it for security, monitoring, logging, profiling, and much more.

We are excited to start utilizing the Elastic Cloud service for new tasks such as machine learning and to continually profile our search queries to quickly provide our users with the most accurate results.

Posted in Code & Programming, Updates and tagged edge ngram, elasticsearch, fulltextsearch, ngram, tokenization, tokenizer