Sunday, January 8, 2023
HomeData ScienceTips on how to Use the Synonyms Characteristic Appropriately in Elasticsearch |...

Tips on how to Use the Synonyms Characteristic Appropriately in Elasticsearch | by Lynn Kwong | Jan, 2023


Picture by Tumisu in Pixabay

Synonyms are used to enhance search high quality and broaden the scope of what’s thought-about an identical. For instance, a consumer trying to find “England” would possibly look forward to finding paperwork that include “British” or “UK” as properly, though these three phrases are completely completely different.

The synonyms function in Elasticsearch may be very highly effective and may make your search engine extra sturdy and highly effective if applied accurately. On this publish, we are going to introduce the necessities to implementing the synonyms function in observe with easy code snippets. Particularly, we are going to introduce the way to replace synonyms for current indexes which is a comparatively superior subject.

Preparation

We are going to begin an Elasticsearch server regionally with Docker and use Kibana to handle the indexes and run the instructions. You probably have by no means labored with Elasticsearch earlier than or need to have a fast refresh, this publish may be useful. And if you happen to encounter points operating Elasticsearch in Docker, this publish will very probably assist you to out.

If you end up prepared, let’s begin our journey to discover the synonyms function in Elasticsearch.

The docker-compose.yaml file we are going to use on this publish has the next content material, to which we are going to add extra options later:

model: "3.9"
companies:
elasticsearch:
picture: elasticsearch:8.5.3
setting:
- discovery.sort=single-node
- ES_JAVA_OPTS=-Xms1g -Xmx1g
- xpack.safety.enabled=false
volumes:
- sort: quantity
supply: es_data
goal: /usr/share/elasticsearch/information
ports:
- goal: 9200
printed: 9200
networks:
- elastic

kibana:
picture: kibana:8.5.3
ports:
- goal: 5601
printed: 5601
depends_on:
- elasticsearch
networks:
- elastic

volumes:
es_data:
driver: native

networks:
elastic:
title: elastic
driver: bridge

Obtain this file or create a brand new one named docker-compose.yaml and paste the content material above into it. Then you can begin Elasticsearch and Kibana with one of many following instructions:

# In the identical folder the place docker-compose.yaml is positioned (Beneficial).
docker-compose up -d

# If you're in a distinct folder or title the YAML file in a different way,
# you would wish to specify the trail or the title, for instance:
docker-compose -f ~/Downloads/docker-compose.yaml up -d
docker-compose -f docker-compose-elasticsearch up -d

Use the usual synonym token filter with an inventory of synonyms

Let’s first create an index utilizing the usual synonym token filter with an inventory of synonyms. Run the next command in Kibana, and we are going to clarify the small print shortly:

PUT /inventory_synonym
{
"settings": {
"index": {
"evaluation": {
"analyzer": {
"index_analyzer": {
"tokenizer": "customary",
"filter": [
"lowercase",
"synonym_filter"
]
}
},
"filter": {
"synonym_filter": {
"sort": "synonym",
"synonyms": [
"PS => PlayStation",
"Play Station => PlayStation"
]
}
}
}
}
},
"mappings": {
"properties": {
"title": {
"sort": "textual content",
"analyzer": "index_analyzer"
}
}
}
}

Key factors right here:

  1. Notice the nested ranges of the keys for the settings. settings => index => evaluation => analyzer/filter are all built-in key phrases. Nonetheless, index_analyzer and synonym_filter are customized names for the customized analyzer and filter, respectively.
  2. We have to create a customized filter with the sort being synonym. An inventory of synonyms is supplied explicitly with the synonyms choice. This could usually be used for testing solely because it’s not handy to replace the synonym listing as we are going to see later.
  3. Solr synonyms are used on this publish. For this instance, specific mappings are used which suggests the token on the lefthand facet of => is changed with the one on the appropriate facet. We are going to use equal synonyms later, which suggests the tokens supplied are handled equivalently.
  4. The synonym_filter is added to the filter listing of a brand new customized analyzer named index_analyzer. Usually the sequence of the filters issues. Nonetheless, for the synonym filter, it’s a bit particular and could also be shocking to many people. On this instance, although the synonym_filter filter is put after the lowercase filter, the tokens returned by this filter are additionally handed to the lowercase filter and thus additionally get lowercased. Subsequently, you don’t want to offer lowercase tokens within the synonym listing or within the synonym file.
  5. Lastly, within the mappings for the doc, the customized analyzer is specified for the title area.

To check the analyzer created within the index, we will name the _analyze endpoint:

GET /inventory_synonym/_analyze
{
"analyzer": "index_analyzer",
"textual content": "PS 3"
}

We are able to see that the token for “PS” is changed with the synonym specified, and in lowercase:

{
"tokens": [
{
"token": "playstation",
"start_offset": 0,
"end_offset": 2,
"type": "SYNONYM",
"position": 0
},
{
"token": "3",
"start_offset": 3,
"end_offset": 4,
"type": "<NUM>",
"position": 1
}
]
}

Let’s add some paperwork to the index and check if it really works correctly in looking out:

PUT /inventory_synonym/_doc/1
{
"title": "PS 3"
}

PUT /inventory_synonym/_doc/2
{
"title": "PlayStation 4"
}

PUT /inventory_synonym/_doc/3
{
"title": "Play Station 5"
}

We are able to carry out a easy search with the match key phrase:

GET /inventory_synonym/_search
{
"question": {
"match": {
"title": "PS"
}
}
}

If nothing goes incorrect, all three paperwork ought to be returned with the identical rating.

Index-time vs search-time synonyms

As you see, within the above instance, just one analyzer is created and it’s used for each indexing and looking out.

Making use of synonyms to all paperwork in the course of the indexing step is discouraged as a result of it has some main disadvantages:

  • The synonym listing can’t be up to date with out reindexing every part, which may be very inefficient in observe.
  • The search rating could be impacted as a result of synonym tokens are counted as properly.
  • The indexing course of turns into extra time-consuming and the indexes will get larger. It’s negligible for small information set however may be very vital for large ones.

Subsequently, it’s higher to simply apply synonyms within the search step which might overcome all three disadvantages. To do that, we have to create a brand new analyzer for looking out.

Use search_analyzer and apply search-time synonyms

Run the next command in Kibana to create a brand new index with search-time synonyms:

PUT /inventory_synonym_graph
{
"settings": {
"index": {
"evaluation": {
"analyzer": {
"index_analyzer": {
"tokenizer": "customary",
"filter": [
"lowercase"
]
},
"search_analyzer": {
"tokenizer": "customary",
"filter": [
"lowercase",
"synonym_filter"
]
}
},
"filter": {
"synonym_filter": {
"sort": "synonym_graph",
"synonyms": [
"PS => PlayStation",
"Play Station => PlayStation"
]
}
}
}
}
},
"mappings": {
"properties": {
"title": {
"sort": "textual content",
"analyzer": "index_analyzer",
"search_analyzer": "search_analyzer"
}
}
}
}

Key factors:

  • The kind is now modified to synonym_graph which is a extra refined synonym filter and is designed for use as a part of a search analyzer solely. It may well deal with multi-word synonyms extra correctly and is advisable for use within the search-time evaluation. Nonetheless, you possibly can proceed to make use of the unique synonym sort and it’ll behave the identical on this publish.
  • The synonym filter is faraway from the index-time analyzer and added to the search-time one.
  • The search_analyzer is specified for the title area explicitly. If it’s not specified, the identical analyzer (index_analyzer) shall be used for each indexing and looking out.

The analyzer ought to return the identical tokens as earlier than. Nonetheless, after you could have listed the three paperwork with these instructions and carried out the identical search once more, the outcomes shall be completely different:

GET /inventory_synonym_graph/_search
{
"question": {
"match": {
"title": "PS"
}
}
}

This time solely “PlayStation 4″ is returned. Even “PS 3” shouldn’t be returned!

The reason being that the synonym filter is just utilized at search time. The search question “ps” is changed with the synonym token “ps”. Nonetheless, the paperwork within the index weren’t filtered by the synonym filter and thus “PS” was simply tokenized as “ps” and never changed with “ps”. Equally for “Play Station”. Consequently, solely “PlayStation 4” may be matched.

To make it work correctly as within the earlier instance, we have to change the synonym rule from specific mappings to equal synonyms. Let’s replace the synonym filter as follows:

......
"filter": {
"synonym_filter": {
"sort": "synonym_graph",
"synonyms": [
"PS, PlayStation, Play Station"
]
}
}
......

To alter the synonyms of an current index, we will recreate the index and reindex all of the paperwork, which is foolish and inefficient.

A greater approach is to replace the settings of the index. Nonetheless, we have to shut the index earlier than the settings may be up to date, after which re-open it so it may be accessed:


POST /inventory_synonym_graph/_close

PUT inventory_synonym_graph/_settings
{
"settings": {
"index.evaluation.filter.synonym_filter.synonyms": [
"PS, PlayStation, Play Station"
]
}
}

POST /inventory_synonym_graph/_open

Notice the particular syntax for updating the settings of an index.

After the above instructions are run, let’s check the search_analyzer with the _analyzer endpoint and see the tokens generated:

GET /inventory_synonym_graph/_analyze
{
"analyzer": "search_analyzer",
"textual content": "PS 3"
}

And that is the consequence:

{
"tokens": [
{
"token": "playstation",
"start_offset": 0,
"end_offset": 2,
"type": "SYNONYM",
"position": 0,
"positionLength": 2
},
{
"token": "play",
"start_offset": 0,
"end_offset": 2,
"type": "SYNONYM",
"position": 0
},
{
"token": "ps",
"start_offset": 0,
"end_offset": 2,
"type": "<ALPHANUM>",
"position": 0,
"positionLength": 2
},
{
"token": "station",
"start_offset": 0,
"end_offset": 2,
"type": "SYNONYM",
"position": 1
},
{
"token": "3",
"start_offset": 3,
"end_offset": 4,
"type": "<NUM>",
"position": 2
}
]
}

It reveals that the “PS” search question is changed and expanded with the tokens of the three synonyms (which is managed by the increase choice). It additionally proves that if equal synonyms are utilized at index time, the dimensions of the resultant index may be elevated fairly considerably.

Then after we carry out the identical search once more:

GET /inventory_synonym_graph/_search
{
"question": {
"match": {
"title": "PS"
}
}
}

All three paperwork shall be returned.

Use a synonym file

Above now we have been specifying the synonym listing straight when the index is created. Nonetheless, when you could have numerous synonyms, it is going to be cumbersome so as to add all of them to the index. A greater approach is to retailer them in a file and cargo them to the index dynamically. There are a lot of advantages of utilizing a synonym file, which embody:

  • Handy to take care of numerous synonyms.
  • Can be utilized by completely different indexes.
  • Could be reloaded dynamically with out closing the index.

To get began, we have to first put the synonyms in a file. Every line is a synonym rule which is similar as what’s demonstrated above. Extra particulars may be discovered within the official doc.

The synonym file we are going to create is named synonyms.txt, however it may be known as something. And it has the next content material:

# It is a remark! The file is called synonyms.txt.
PS, PlayStation, Play Station

Then we have to bind the synonym file to the Docker container. Replace docker-compose.yaml as follows:

......
volumes:
- sort: quantity
supply: es_data
goal: /usr/share/elasticsearch/information
- sort: bind
supply: ./synonyms.txt
goal: /usr/share/elasticsearch/config/synonyms.txt
......

Notice that the synonym file is loaded to the config folder within the container. You may get into the container and test it with considered one of these two instructions:

# Consumer docker
docker exec -it synonyms-elasticsearch-1 bash

# Consumer docker-compose
docker-compose exec elasticsearch bash

Now we have to cease and restart the service to make the adjustments work. Notice that simply restarting the service gained’t work.

docker-compose cease elasticsearch
docker-compose up -d elasticsearch

We are able to then create a brand new index utilizing the synonym file:

PUT /inventory_synonym_graph_file
{
"settings": {
"index": {
"evaluation": {
"analyzer": {
"index_analyzer": {
"tokenizer": "customary",
"filter": [
"lowercase"
]
},
"search_analyzer": {
"tokenizer": "customary",
"filter": [
"lowercase",
"synonym_filter"
]
}
},
"filter": {
"synonym_filter": {
"sort": "synonym_graph",
"synonyms_path": "synonyms.txt",
"updateable": true
}
}
}
}
},
"mappings": {
"properties": {
"title": {
"sort": "textual content",
"analyzer": "index_analyzer",
"search_analyzer": "search_analyzer"
}
}
}
}

Key factors:

  • For synonyms_path, it’s the trail of the synonyms file relative to the config folder within the Elasticsearch server.
  • A brand new updateable area is added which specifies if the corresponding filter is updateable. We are going to see the way to reload a search analyzer with out closing and opening an index quickly.

The habits of this new index inventory_synonym_graph_file ought to be the identical as that of the earlier one inventory_synonym_graph.

Now let’s add extra synonyms to the synonym file, which is able to then has the content material as follows:

# It is a remark! The file is called synonyms.txt.
PS, Play Station, PlayStation
JS => JavaScript
TS => TypeScript
Py => Python

When the synonyms have been added, we will shut and open the index to make it efficient. Nonetheless, since we mark the synonym filter as updateable, we will reload the search analyzer to make the adjustments efficient instantly with out closing the index and thus with no downtime.

To reload the search analyzers of an index, we have to name the _reload_search_analyzers endpoint:

POST /inventory_synonym_graph_file/_reload_search_analyzers

Now after we analyze the “JS” string, we are going to see the “javascript” token returned:

GET /inventory_synonym_graph_file/_analyze
{
"analyzer": "search_analyzer",
"textual content": "JS"
}

// You will note the "javascript" token returned.

Two vital issues ought to be famous right here:

  • If updateable is about true for a synonym filter, then the corresponding analyzer can solely be used as a search_analyzer, and can’t be used for indexing, even when the sort is synonym.
  • The updateable choice can solely be used when a synonym file is used with the synonym_path choice, and never when the synonyms are supplied straight with the synonyms choice.

Congratulations if you attain right here! We have now lined all of the necessities for utilizing the synonyms options in Elasticsearch.

We have now launched the way to use synonyms within the index-time and search-time analyzing steps, respectively. In addition to, it is usually launched the way to present synonym lists straight and the way to present them by a file. Final however not least, other ways are launched concerning the way to replace the synonym lists of an current index. It’s advisable to reload the search analyzer of an index as it’s going to carry no downtime to the service.

RELATED ARTICLES

LEAVE A REPLY

Please enter your comment!
Please enter your name here

- Advertisment -
Google search engine

Most Popular

Recent Comments