Friday, September 23, 2022
HomeData ScienceBigQuery SQL Capabilities For Information Cleansing | by Vicky Yu | Sep,...

BigQuery SQL Capabilities For Information Cleansing | by Vicky Yu | Sep, 2022


Use circumstances and features to use

Picture by Rosy — The world is price hundreds of images from Pixabay

Information cleansing is an important a part of any data-related place whether or not you’re an information engineer, knowledge scientist, or knowledge analyst. At present I wish to share a couple of BigQuery SQL features for knowledge cleansing and a use case I might’ve used them for.

Invisible Particular Characters In String Values

String values can include particular characters that don’t show on the display screen however are saved within the database. I realized this the exhausting means after I utilized a the place clause on a string discipline that resulted in 0 information discovered. This was extremely irritating and I needed to discover features to reveal the unicode values to take away them from the string earlier than making use of the the place clause to return the information I knew existed.

BigQuery has a NORMALIZE operate that handles this actual situation. Beneath are 3 information with unicode values between Jane and Smith which might be invisible within the question outcomes.

Screenshot instance created by creator

If I exploit a the place clause for Jane Smith no information are returned.

Screenshot instance created by creator

Nevertheless, if I exploit the NORMALIZE operate on the title discipline the unicode values are eliminated and the three Jane Smith information are returned within the question outcomes.

Screenshot instance utilizing NORMALIZE operate created by creator

Particular Point out: BigQuery additionally has a NORMALIZE_AND_CASEFOLD operate if you’d like string comparability to be case insensitive, i.e. information containing Jane Smith or jane smith, shall be returned within the question outcomes.

Sample Matching

I’ve all the time used the LIKE operator for sample matching in a string discipline. Not too long ago I needed to categorize referring URLs of web site guests to match Google analytics channel reporting. Since I didn’t know if the URLs had been higher or lowercase, I had to make use of the LOWER operate to transform the sector to all lowercase earlier than checking for a sample match.

BigQuery has a CONTAINS_SUBSTR operate for this case. Not solely does CONTAINS_SUBSTR carry out case-insensitive sample checking, it could possibly additionally verify for sample values in numeric fields, timestamps, and arrays as properly.

Within the instance beneath, I verify if the breakfast discipline comprises the string pancakes in all lowercase. Each rows are returned within the question outcomes though every report has a capitalized letter in pancakes.

Screenshot instance utilizing CONTAINS_SUBSTR operate created by creator

Particular Point out: BigQuery additionally has an ENDS_WITH operate to verify if a string ends with a sample. A standard use case I may’ve used this for was checking if an e mail ended with .edu to verify a person was a pupil.

Date Formatting

Up to now, I all the time downloaded question outcomes from SQL into Excel to format dates for reporting functions as a result of I wasn’t in a position to format the dates the way in which I wanted with SQL. This was time-consuming after I had a big quantity of knowledge to format.

BigQuery has a FORMAT_DATE operate to deal with date formatting. Within the instance beneath Sept. 30, 2022 is formatted in three alternative ways primarily based on the format string.

Screenshot instance utilizing FORMAT_DATE operate created by creator

Particular Point out: Moreover FORMAT_DATE you can too use FORMAT_DATETIME to format datetime values. There’s additionally a FORMAT operate to format a discipline as a string worth. One use case for this operate is to format massive numbers with comma separators. As an alternative of 1000000 you need to use the FORMAT operate to show 1,000,000 within the question outcomes.

Dividing With A Zero Denominator

I usually needed to calculate percentages the place the denominator may very well be 0 which might return a SQL error when dividing by 0. One possibility was to make use of a CASE assertion to verify if the denominator was 0 earlier than dividing to keep away from an error however most databases had a operate to deal with this case.

In BigQuery’s case, the operate known as SAFE_DIVIDE. Within the instance beneath I divide 10 by 0 and get a division by zero error.

Screenshot instance utilizing division error instance created by creator

After I exploit SAFE_DIVIDE the result’s a null worth as an alternative of an error.

Screenshot instance utilizing SAFE_DIVIDE instance created by creator

Particular Point out: BigQuery additionally has SAFE_ADD, SAFE_SUBTRACT, SAFE_MULTIPLY, and SAFE_NEGATE features that can return a null worth if an overflow happens.

Last Ideas

Whereas we are able to by no means get away from knowledge cleansing there are SQL features that may assist. I hope you realized a brand new operate or two that’ll be helpful sooner or later. Whereas the features I discussed are in BigQuery, they might be obtainable in your database too.

Observe: All queries above had been run on BigQuery sandbox that’s free to anybody with a Google account.

RELATED ARTICLES

LEAVE A REPLY

Please enter your comment!
Please enter your name here

- Advertisment -
Google search engine

Most Popular

Recent Comments