Friday, October 14, 2022
HomeData ScienceWITH vs. Temp Tables - BigQuery SQL Optimization Half 2 | by...

WITH vs. Temp Tables – BigQuery SQL Optimization Half 2 | by Martin Weitzmann


When to make use of Short-term Tables as a substitute of WITH

One of many largest efficiency killers for queries is to make use of WITH as a substitute of CREATE TEMP TABLE in conditions while you actually shouldn’t! After clarifying that we must always Filter as Early as Potential partially certainly one of this introductory collection to SQL optimization, let’s proceed with when to make use of or keep away from WITH.

(Picture by Creator)

WITH statements are additionally referred to as Frequent Desk Expression (CTE). They assist to de-clutter queries and make them extra readable as a result of they pull sub-queries out of their context and provides them a reputation.

I.e.

SELECT a, b, c
FROM (SELECT x, y, z FROM ...)

turns into

WITH my_CTE AS (SELECT x, y, z FROM ...)SELECT a, b, c
FROM my_CTE

my_CTE seems to be like a desk as a result of it’s behind a FROM — however it’s not a desk. It’s extra like stay directions to create a consequence desk on the fly at runtime each time it’s referred to as.

That’s wild, proper? CTEs don’t memorize the outcomes of any former execution! So in case you do issues like this …

WITH a AS (...),a1 AS (SELECT aggr_1 FROM a),a2 AS (SELECT aggr_2 FROM a)
SELECT ... FROM a1 LEFT JOIN a2

… then that you must change this question as quickly as you completed studying this text, since you put a number of pointless load in your question engine and computation slots by calculating a twice!

So what ought to we do? a needs to be a brief desk as a substitute, as a result of these memorize their outcomes — at the least throughout the question operating.

The question earlier ought to seem like this:

CREATE TEMP TABLE a AS (...)WITH a1 AS (SELECT aggr_1 FROM a),a2 AS (SELECT aggr_2 FROM a)
SELECT ... FROM a1 LEFT JOIN a2

We compute a solely as soon as and use it for the 2 completely different aggregations in a1 and a2.

It’s not that completely different from the sooner instance, proper? However it’ll carry out higher as a result of we save the second computation of a.

And also you don’t want to fret about creating the desk in a particular dataset or deleting the desk both — it will likely be taken care of by BigQuery and vanish after your SQL assertion is finished operating.

I’ve seen queries the place a CTE bought referenced greater than 5 instances. Refactoring that one into a brief desk helped fairly a bit, to say the least. To show that it helped, we are able to pattern a few runs:

  • have each queries prepared, optimized and un-optimized
  • de-activate caching
  • maintain whole workload cheap by decreasing quantity of knowledge queried
  • run each queries round 5 instances, kind of on the similar time and evaluate the imply of their common slot time

In addition to the modifications proven above you may run into situations the place you wish to combine CTEs with non permanent tables. Should you actually solely have to run a CTE one time then it will likely be barely sooner than operating after which storing it briefly. So if we are able to skip the non permanent storage step we must always do this. So how can we combine CTEs and non permanent tables?

You’ll be able to consider CREATE TEMP TABLE because the extra basic operation. With a purpose to use them each collectively, it’ll merely include CTE definitions, as a result of they’re simply making sub-queries extra readable however are primarily the identical factor as a sub-query:

CREATE TEMP TABLE a AS (  WITH x AS (...),
y as (...)
SELECT ... FROM x LEFT JOIN y ON ...)
SELECT ... FROM a ...

This question will use CTE x (as outlined inside the definition of a) to create the non permanent desk a.

To summarize: Use CTEs to tidy up your SQL statements and make them extra readable. However don’t reference a CTE extra then as soon as as a result of the question engine will recalculate the outcomes once more each time. For that case use non permanent tables as a substitute — they may add an additional storage step to the processing value, however that (+ studying from the temp desk) might be cheaper than recalculating the entire question once more.

Don’t overlook to mix this greatest observe with Filtering as early as potential!

Pleased refactoring!

RELATED ARTICLES

LEAVE A REPLY

Please enter your comment!
Please enter your name here

- Advertisment -
Google search engine

Most Popular

Recent Comments