ElasticSearch – Storage Architecture using Inverted Indexes

Elasticsearch is a cloud-based search server that uses the Lucene engine and is an open source under Apache licensing terms. It has been second in popularity as compared to Apache Solr. The way data is stored and searched is very different from that of the conventional relational database systems.

Based on JAVA, Elasticsearch is a cloud-based search server that uses the Lucene engine and is an open source under Apache licensing terms. It has been second in popularity as compared to Apache Solr. The way data is stored and searched is very different from that of the conventional relational database systems. The mechanism of storage involves the processes of Analysis and use of inverted indexes.

The Analysis Phase in Elasticsearch

The Analysis process involves tokenizing the document and filtering out unnecessary formats and words. Tokenizing involves splitting the documents into small strings, which in most cases are meaningful but unique words in themselves. Filtering involves stripping the document of very commonly occurring words in a sentence like articles, prepositions, pronouns, etc. and formats like HTML tags, capital letter formatting, white spaces, formatting character, special formatting etc. These characters, if stored, prove to pose a very heavy overhead on the storage, search and subsequent retrieval methods applicable to the database.

When Elasticsearch receives new document indexing request, it starts out by converting the document into small tokens. They are then filtered out by filters which are configured for the purpose. Every document that is to be indexed passes through this process of analysis. When the document is indexed, a default mapping sequence and an analyzer configured for the purpose are also attached.

The search and storage using inverted indexes

Talking of the data structure involved basically in storage and search, it entirely depends upon the application. As an example, SQL server uses a binary tree for storing or searching indexes. A binary tree is a data structure where each node can at most have two sub-nodes, commonly referred to as child nodes – left child and right child in many kinds of literature. In the case of Elasticsearch, the data structure used is the inverted index type data structure.

After the completion of the analysis phase, the data of the document is converted into tokens. The unique terms are assigned a unique index in the structure. Each node contains all the attributes that may be associated with the token hence generated such as a number of terms, the position of the terms, and source of the term and so on. Using such a data structure allows the user to store the data efficiently, search through the data faster and generate text related analytical data easily.

After the process of token mapping is complete, the document is ready to be stored on the hard disk. If you want, you may store the both the original and the analyzed versions of the document. The original document is stored under the “_source” field names of the data structure. Depending upon the choice of the user, the input document may not be analyzed at all. It may even be stored without any analysis. The data structure is totally dependent upon the analyzer that has been chosen for the purpose of indexing.

The best method for storing, searching and retrieving the data from the database depends upon the method of analysis that has been chosen before for the process of indexing, which in turn is dependent upon the needs of the application.