Measuring string similarity in BigQuery utilizing SQL | by Romain Granger | Jun, 2022

June 14, 2022

1

Use Levenshtein distance to find comparable or duplicated values, clear your information, and extra!

Utilizing the Levenshtein distance technique

This technique can be utilized amongst others (Soundex, LIKE assertion, Regexp) to carry out string similarity or string matching with a view to establish two parts (textual content, strings, inputs) which can be comparable however not similar.

This technique can be utilized for a wide range of purposes, together with figuring out duplicates, dealing with misspelled consumer enter information, cleansing buyer information, and so forth.

The thought of the Levenshtein technique is to compute the minimal edit distance between two strings.

However what’s an “edit” distance? It’s composed of three actions: insertions, substitutions, and deletions. These three actions are given equal weight by the algorithm, mainly a 1.

To assist perceive the mechanisms, listed here are just a few examples:

Bigquer → Bigquery: The letter y is added on the finish, so the space will likely be 1 (we made 1 addition).
music → mujic: The letter j is an alternative to the letter sfrom music, the space can also be 1 (we made 1 substitution).
french fries → frij: The letters frenchand area had been added, the letter j is a substitute for e and the letter s was deleted. (We made 7 additions, 1 substitution, and 1 deletion)

This technique is especially efficient when evaluating a full string to a different full string (and performs much less nicely when evaluating key phrases inside a sentence or evaluating a sentence in opposition to one other sentence).

You’ll find additional info on the algorithm on Wikipedia.

And now that we’ve realized the idea and purposes, let’s take a look at how we’d apply it in SQL.

Making a persistent operate

You may outline a UDFs as both persistent or non permanent. The primary distinction is that persistent features could be reused and accessed by a number of queries (it’s like making a dataset in your undertaking) whereas non permanent features solely work for the question you’re at present operating in your editor.

Personally, I get pleasure from working with a dataset that I name utility because it permits me to retailer tables and features that I can reuse throughout different datasets and queries.

Right here is the Javascript UDF operate we’ll in BigQuery:

Operate for Levenshtein distance as a Javascript UDF

When executing this question, it might then seem in your undertaking construction as observe:

Features seem below Routines within the dataset you specified (Picture by Writer)

Notice that this operate, as a substitute of returning the precise distinction as plenty of editions (returning 1 if we had been making 1 addition), will return a worth on a 0 to 1 scale. The place 1 means fully comparable and 0 means not comparable in any respect.

In our first theoric instance, Bigquer → Bigquery: The letter y was added, which means a distance of 1. Our operate will first compute the variety of editions divided by the longest inputted string, on this case, Bigquery with 8 characters (1 addition / 8 size) which is the space.

For getting the similarity, we merely reverse it, utilizing 1-(1/8) = 0,875

However first, let’s take a look at an actual SQL question!

Utilizing the operate in a SQL question

Now that the operate has been saved as a persistent UDF, you possibly can merely name it by writing its location. In our instance, will probably be referred to as utilizing the next:

`datastic.utility.levenshtein`(supply,goal)

We ready a number of circumstances to check the similarity rating produced by our technique.

Take a look at circumstances to judge the Levenshtein distance operate

The question will output the next outcomes:

Outcomes of the Levenshtein distance for every of our check circumstances (Picture by Writer)

Within the first case, there’s an addition of 1 letter, which ends up in a similarity between the 2 phrases of 0.88!

The 2 extremes are additionally demonstrated: with the similar mixture of phrases medium | medium, the result’s a similarity of 1, and with the alternative mixture of phrases france | PIzz, the result’s a similarity of 0.

An actual-world instance with publicly accessible information

The use case we wish to clear up is correcting consumer nation information from our CRM. We seen that there was plenty of misspelled information, and we’ll use BigQuery public information to assist us.

This desk comprises the information offered by BigQuery,