Selecting the best indexing technique in your Amazon OpenSearch Service clusters helps ship low-latency, correct outcomes whereas sustaining effectivity. In case your entry patterns require advanced queries, it’s finest to re-evaluate your indexing technique.
On this put up, we display how one can create a customized index analyzer in OpenSearch to implement autocomplete performance effectively by utilizing the Edge n-gram tokenizer to match prefix queries with out utilizing wildcards.
What’s an index analyzer?
Index analyzers are used to research textual content fields throughout ingestion of a doc. The analyzer outputs the phrases you should use to match queries
By default, OpenSearch indexes your knowledge utilizing the customary index analyzer. The usual index analyzer splits tokens on areas, converts tokens to lowercase, and removes most punctuation. For some use instances (like log analytics), the usual index analyzer may be all you want.
Commonplace Index Analyzer
Let’s take a look at what the usual index analyzer does. We’ll use the _analyze API to check how the usual index analyzer tokenizes the sentence “Commonplace Index Analyzer.”
Notice: You’ll be able to run all of the instructions on this put up utilizing OpenSearch DevTools within the OpenSearch Dashboard.
Discover how every phrase was lowercased and the interval (punctuation) was eliminated.
Creating your personal index analyzer
OpenSearch affords a lot of in-built analyzers that you should use for various entry patterns. It additionally allows you to construct your personal customized analyzer, configured in your particular search wants. Within the following instance, we’re going to configure a customized analyzer that returns partial phrase matches for an inventory of addresses. The analyzer is particularly designed for autocomplete performance, enabling finish customers to shortly discover addresses with out having to sort out (or keep in mind) a complete deal with. Autocomplete permits OpenSearch to successfully full the search time period primarily based off matched prefixes.
First, create an index known as standard_index_test:
Specifying the analyzer as customary just isn’t required as a result of the usual analyzer is the default analyzer.
To check, bulk add some knowledge to our standard_index_test that we created.
Question this knowledge utilizing the textual content “ope”.
When trying to find the time period “ope”, we don’t get any matches. To see why, we are able to dive somewhat deeper into the usual index analyzer and see how our textual content is being tokenized. Take a look at the usual index analyzer with the deal with “456 OpenSearch Drive Anytown, Ny 78910”.
The usual index analyzer has tokenized the deal with into particular person phrases: 456, opensearch, drive and so forth. Which means, except you seek for a person token (like 456 or opensearch) o, op, ope , and even open received’t yield any outcomes. One choice is to make use of wildcards whereas nonetheless utilizing the usual index analyzer for indexing:
The wildcard question would match “456 OpenSearch Drive Anytown, Ny 78910” however wildcard queries will be useful resource intensive and sluggish. Querying for ope* in OpenSearch ends in iterating over every time period within the index, bypassing optimizations of inverted index lookups. This ends in larger reminiscence utilization and slower efficiency. To enhance the efficiency of our question execution and search expertise, we are able to use an index analyzer that higher fits our entry patterns.
Edge n-gram
The Edge n-gram tokenizer helps you discover partial matches and avoids the usage of wildcards by tokenizing prefixes of a single phrase. For instance, the enter phrase espresso is expanded into all its prefixes, c, co , cof, and so forth. It might restrict the prefixes to these between a minimal (min_gram) and most (max_gram) size. So with min_gram=3 and max_gram=5, it can develop “espresso” to cof, coff, and coffe.
Create a brand new index known as custom_index with our personal customized index analyzer that makes use of Edge n-grams. Set the minimal token size (min_gram) to three characters, and the utmost token size (max_gram) to twenty characters. The min_gram and max_gram units the minimal and most returned token size respectively. You need to choose the min_gram and max_gram primarily based off your entry patterns. On this instance, we’re trying to find the time period “ope” so we don’t have to set the minimal size to something lower than 3 since we’re not trying to find phrases like o or op. Setting the min_gram too low can result in excessive latency. Likewise, we don’t have to set the utmost size to something higher than 20 as no particular person token will exceed the size of 20. Setting the utmost size to twenty provides us room to spare in case we do ultimately ingest an deal with with an extended token size. Notice, the index we’re creating right here is particularly for autocomplete performance and is probably going pointless for a normal search index.
Within the above code, we created an index known as custom_index with a customized analyzer named autocomplete. The analyzer performs the next:
- It makes use of the usual tokenizer to separate textual content into tokens
- A lowercase filter is utilized to lowercase all of the tokens
- The tokens are then additional damaged into smaller chunks primarily based off the minimal and most values of the edge_ngram
The search analyzer is configured to make use of the usual analyzer to cut back question processing required at search time. Now we have already utilized our customized analyzer to separate the textual content for us upon ingestion, and we don’t have to repeat this course of when looking. Take a look at how the customized analyzer analyzes the textual content Lexington Avenue:
Discover how the tokens are lowercase and now help partial matches. Now that we’ve seen how our analyzer tokenizes our textual content, bulk add some knowledge:
And check!
You’ve gotten configured a customized n-gram analyzer to search out partial phrases matches inside our listing of addresses.
Notice, there’s a tradeoff between utilizing non-standard index analyzers and writing compute intensive queries. Analyzers can have an effect on indexing throughput and enhance the general index measurement, particularly if used inefficiently. For instance, when creating the custom_index, the search analyzer was set to make use of the usual analyzer. Utilizing n_grams for evaluation upon ingestion and search would have impacted cluster efficiency unnecessarily. Moreover, we set the min_gram and max_gram to values that matched our entry patterns, guaranteeing we didn’t create extra n_grams than we would have liked to for our search use case. This allowed us to achieve the advantages of optimizing search with out impacting our ingestion throughput.
Conclusion
On this put up, we modified how OpenSearch listed our knowledge to simplify and velocity up autocomplete queries. In our case, utilizing the Edge n-grams allowed OpenSearch to match components of an deal with and yield exact outcomes with out compromising cluster efficiency with a wildcard question.
It’s at all times essential to check your cluster earlier than deploying in a manufacturing atmosphere. Understanding your entry patterns is crucial to optimizing your cluster from each an indexing and looking perspective. Use the rules on this put up as a place to begin. Verify your entry patterns earlier than creating an index, then start experimenting with totally different index analyzers in a check atmosphere to see how they’ll simplify your queries and enhance total cluster efficiency. For extra studying on normal OpenSearch cluster optimization methods, consult with the Get began with Amazon OpenSearch Service: T-shirt-size your area put up.
Concerning the authors

