I’ve discovered an attention-grabbing query on Twitter, lately. Is there any efficiency impression of utilizing FILTER
in SQL (PostgreSQL, particularly), or is it simply syntax sugar for a CASE
expression in an mixture operate?
As a fast reminder, FILTER
is an superior normal SQL extension to filter out values earlier than aggregating them in SQL. That is very helpful when aggregating a number of issues in a single question.
These two are the identical:
SELECT
fa.actor_id,
-- These:
SUM(size) FILTER (WHERE ranking = 'R'),
SUM(size) FILTER (WHERE ranking = 'PG'),
-- Are the identical as these:
SUM(CASE WHEN ranking = 'R' THEN size END),
SUM(CASE WHEN ranking = 'PG' THEN size END)
FROM film_actor AS fa
LEFT JOIN movie AS f
ON f.film_id = fa.film_id
GROUP BY fa.actor_id
As of jOOQ 3.17, these SQL dialects are recognized to assist FILTER
natively:
- CockroachDB
- Firebird
- H2
- HSQLDB
- PostgreSQL
- SQLite
- YugabyteDB
Ought to it matter?
However again to the query. Does it actually matter by way of efficiency? Ought to it? Clearly, it shouldn’t matter. The 2 sorts of mixture operate expressions will be confirmed to imply precisely the identical factor. And in reality, that’s what jOOQ does if you happen to’re utilizing FILTER
on another SQL dialect. Put the above question in our SQL translation software, translate to Oracle, for instance, and also you’ll be getting:
SELECT
fa.actor_id,
sum(CASE WHEN ranking = 'R' THEN size END),
sum(CASE WHEN ranking = 'PG' THEN size END),
sum(CASE WHEN ranking = 'R' THEN size END),
sum(CASE WHEN ranking = 'PG' THEN size END)
FROM film_actor fa
LEFT JOIN movie f
ON f.film_id = fa.film_id
GROUP BY fa.actor_id
The opposite manner needs to be attainable as nicely in an optimiser.
Does it matter?
However is that this being performed? Let’s strive evaluating the next 2 queries on PostgreSQL, in opposition to the sakila database:
Question 1:
SELECT
fa.actor_id,
SUM(size) FILTER (WHERE ranking = 'R'),
SUM(size) FILTER (WHERE ranking = 'PG')
FROM film_actor AS fa
LEFT JOIN movie AS f
ON f.film_id = fa.film_id
GROUP BY fa.actor_id
Question 2:
SELECT
fa.actor_id,
SUM(CASE WHEN ranking = 'R' THEN size END),
SUM(CASE WHEN ranking = 'PG' THEN size END)
FROM film_actor AS fa
LEFT JOIN movie AS f
ON f.film_id = fa.film_id
GROUP BY fa.actor_id
I can be utilizing this benchmark approach, and can publish the benchmark code on the finish of this weblog publish. The outcomes of working every question 500x are clear (much less time is healthier):
Run 1, Assertion 1: 00:00:00.786621
Run 1, Assertion 2: 00:00:00.839966
Run 2, Assertion 1: 00:00:00.775477
Run 2, Assertion 2: 00:00:00.829746
Run 3, Assertion 1: 00:00:00.774942
Run 3, Assertion 2: 00:00:00.834745
Run 4, Assertion 1: 00:00:00.776973
Run 4, Assertion 2: 00:00:00.836655
Run 5, Assertion 1: 00:00:00.775871
Run 5, Assertion 2: 00:00:00.845209
There’s a constant 8% efficiency penalty for utilizing the CASE
syntax, in comparison with the FILTER
syntax on my machine, working PostgreSQL 15 in docker. The precise distinction in a non-benchmark question will not be as spectacular, or extra spectacular, relying on {hardware} and knowledge units. However clearly, one factor appears to be a bit higher on this case than the opposite.
Since all these syntaxes are usually utilized in a reporting context, the variations can positively matter.
Including an auxiliary predicate
You would possibly suppose there’s further optimisation potential, if we make the predicates on the RATING
column redundant, like this:
Question 1:
SELECT
fa.actor_id,
SUM(size) FILTER (WHERE ranking = 'R'),
SUM(size) FILTER (WHERE ranking = 'PG')
FROM film_actor AS fa
LEFT JOIN movie AS f
ON f.film_id = fa.film_id
AND ranking IN ('R', 'PG') -- Redundant predicate right here
GROUP BY fa.actor_id
Question 2:
SELECT
fa.actor_id,
SUM(CASE WHEN ranking = 'R' THEN size END),
SUM(CASE WHEN ranking = 'PG' THEN size END)
FROM film_actor AS fa
LEFT JOIN movie AS f
ON f.film_id = fa.film_id
AND ranking IN ('R', 'PG')
GROUP BY fa.actor_id
Word it needs to be positioned within the LEFT JOIN
‘s ON
clause, so as to not tamper with the outcomes. It could’t be positioned within the question’s WHERE
clause. A proof for this distinction is right here.
What’s going to the benchmark yield now?
Run 1, Assertion 1: 00:00:00.701943
Run 1, Assertion 2: 00:00:00.747103
Run 2, Assertion 1: 00:00:00.69377
Run 2, Assertion 2: 00:00:00.746252
Run 3, Assertion 1: 00:00:00.684777
Run 3, Assertion 2: 00:00:00.745419
Run 4, Assertion 1: 00:00:00.688584
Run 4, Assertion 2: 00:00:00.740979
Run 5, Assertion 1: 00:00:00.688878
Run 5, Assertion 2: 00:00:00.742864
So, certainly, the redundant predicate improved issues (in an ideal world, it shouldn’t, however right here we’re. The optimiser doesn’t optimise this in addition to it might). However nonetheless, the FILTER
clause outperforms CASE
clause utilization.
Conclusion
In an ideal world, two provably equal SQL syntaxes additionally carry out the identical manner. However this isn’t at all times the case in the true world, the place optimisers make tradeoffs between:
- Time spent optimising uncommon syntaxes
- Time spent executing queries
In a earlier weblog publish (which might be outdated by now), I’ve proven quite a lot of these circumstances, the place the optimisation resolution doesn’t rely upon any value mannequin and knowledge units and may at all times be performed, ideally. There was an inclination of such optimisations being favoured by RDBMS which have an execution plan cache (e.g. Db2, Oracle, SQL Server), in case of which the optimisation must be performed solely as soon as per cached plan, after which the plan will be reused. In RDBMS that don’t have such a cache, optimisation time is extra expensive per question, so much less will be anticipated.
I believe this can be a case the place it’s value wanting into easy patterns of expressions in mixture features. AGG(CASE ..)
is such a well-liked idiom, and eight% is sort of the numerous enchancment, that I believe PostgreSQL ought to repair this. We’ll see. In any case, since FILTER
is already:
- Higher performing
- Higher wanting
You possibly can safely change to this good normal SQL syntax already now.
Benchmark code
As promised, this was the benchmark code used for this weblog publish:
DO $$
DECLARE
v_ts TIMESTAMP;
v_repeat CONSTANT INT := 500;
rec RECORD;
BEGIN
-- Repeat the entire benchmark a number of occasions to keep away from warmup penalty
FOR r IN 1..5 LOOP
v_ts := clock_timestamp();
FOR i IN 1..v_repeat LOOP
FOR rec IN (
SELECT
fa.actor_id,
SUM(size) FILTER (WHERE ranking = 'R'),
SUM(size) FILTER (WHERE ranking = 'PG')
FROM film_actor AS fa
LEFT JOIN movie AS f
ON f.film_id = fa.film_id
AND ranking IN ('R', 'PG')
GROUP BY fa.actor_id
) LOOP
NULL;
END LOOP;
END LOOP;
RAISE INFO 'Run %, Assertion 1: %', r, (clock_timestamp() - v_ts);
v_ts := clock_timestamp();
FOR i IN 1..v_repeat LOOP
FOR rec IN (
SELECT
fa.actor_id,
SUM(CASE WHEN ranking = 'R' THEN size END),
SUM(CASE WHEN ranking = 'PG' THEN size END)
FROM film_actor AS fa
LEFT JOIN movie AS f
ON f.film_id = fa.film_id
AND ranking IN ('R', 'PG')
GROUP BY fa.actor_id
) LOOP
NULL;
END LOOP;
END LOOP;
RAISE INFO 'Run %, Assertion 2: %', r, (clock_timestamp() - v_ts);
RAISE INFO '';
END LOOP;
END$$;