In a earlier weblog submit, we’ve marketed the use of SQL EXISTS
fairly than COUNT(*)
to verify for existence of a worth in SQL.
I.e. to verify if within the Sakila database, actors referred to as WAHLBERG have performed in any movies, as a substitute of:
SELECT rely(*)
FROM actor a
JOIN film_actor fa USING (actor_id)
WHERE a.last_name="WAHLBERG"
Do that:
SELECT EXISTS (
SELECT 1 FROM actor a
JOIN film_actor fa USING (actor_id)
WHERE a.last_name="WAHLBERG"
)
(Relying in your dialect you could require a FROM DUAL
clause, or a CASE
expression if BOOLEAN
sorts aren’t supported).
Verify for a number of rows
However what if you wish to verify if there are a minimum of 2
(or N
) rows? In that case, you can not use EXISTS
, however must revert to utilizing COUNT(*)
. Nevertheless, as a substitute of simply counting all matches, why not add a LIMIT
clause as properly? So, if you wish to verify if actors referred to as WAHLBERG have performed in a minimum of 2 movies, as a substitute of this:
SELECT (
SELECT rely(*)
FROM actor a
JOIN film_actor fa USING (actor_id)
WHERE a.last_name="WAHLBERG"
) >= 2
Write this:
SELECT (
SELECT rely(*)
FROM (
SELECT *
FROM actor a
JOIN film_actor fa USING (actor_id)
WHERE a.last_name="WAHLBERG"
LIMIT 2
) t
) >= 2
In different phrases:
- Run the be a part of question with a
LIMIT 2
in a derived desk - Then
COUNT(*)
the rows (at most 2) from that derived desk - Lastly, verify if the rely is excessive sufficient
Does it matter?
In precept, the optimiser might have figured this out itself, particularly as a result of we used a continuing to match the COUNT(*)
worth with. However did it actually apply the transformation?
Let’s verify execution plans and benchmark the question on numerous RDBMS.
PostgreSQL 15
No LIMIT
Outcome (value=14.70..14.71 rows=1 width=1) (precise time=0.039..0.039 rows=1 loops=1)
InitPlan 1 (returns $1)
-> Mixture (value=14.69..14.70 rows=1 width=8) (precise time=0.037..0.037 rows=1 loops=1)
-> Nested Loop (value=0.28..14.55 rows=55 width=0) (precise time=0.009..0.032 rows=56 loops=1)
-> Seq Scan on actor a (value=0.00..4.50 rows=2 width=4) (precise time=0.006..0.018 rows=2 loops=1)
Filter: ((last_name)::textual content="WAHLBERG"::textual content)
Rows Eliminated by Filter: 198
-> Index Solely Scan utilizing film_actor_pkey on film_actor fa (value=0.28..4.75 rows=27 width=4) (precise time=0.003..0.005 rows=28 loops=2)
Index Cond: (actor_id = a.actor_id)
Heap Fetches: 0
With LIMIT
Outcome (value=0.84..0.85 rows=1 width=1) (precise time=0.023..0.024 rows=1 loops=1)
InitPlan 1 (returns $1)
-> Mixture (value=0.83..0.84 rows=1 width=8) (precise time=0.021..0.022 rows=1 loops=1)
-> Restrict (value=0.28..0.80 rows=2 width=240) (precise time=0.016..0.018 rows=2 loops=1)
-> Nested Loop (value=0.28..14.55 rows=55 width=240) (precise time=0.015..0.016 rows=2 loops=1)
-> Seq Scan on actor a (value=0.00..4.50 rows=2 width=4) (precise time=0.008..0.008 rows=1 loops=1)
Filter: ((last_name)::textual content="WAHLBERG"::textual content)
Rows Eliminated by Filter: 1
-> Index Solely Scan utilizing film_actor_pkey on film_actor fa (value=0.28..4.75 rows=27 width=4) (precise time=0.005..0.005 rows=2 loops=1)
Index Cond: (actor_id = a.actor_id)
Heap Fetches: 0
To know the distinction, deal with these rows:
Earlier than:
Nested Loop (value=0.28..14.55 rows=55 width=0) (precise time=0.009..0.032 rows=56 loops=1)
After:
Nested Loop (value=0.28..14.55 rows=55 width=240) (precise time=0.015..0.016 rows=2 loops=1)
In each circumstances, the estimated variety of rows produced by the be a part of is 55 (i.e. all WAHLBERGs are anticipated to have performed in a complete of 55 movies in keeping with statistics).
However int he second execution the precise rows worth is far decrease, as a result of we solely wanted 2 rows earlier than we might cease execution of the operation, due to the LIMIT
above.
Benchmark outcomes:
Utilizing our advisable SQL benchmarking method that compares working two queries many instances (5 runs x 2000 executions on this case) on the identical occasion straight from throughout the RDBMS utilizing procedural languages (to keep away from community latency, and so on.), we get these outcomes:
RUN 1, Assertion 1: 2.61927
RUN 1, Assertion 2: 1.01506
RUN 2, Assertion 1: 2.47193
RUN 2, Assertion 2: 1.00614
RUN 3, Assertion 1: 2.63533
RUN 3, Assertion 2: 1.14282
RUN 4, Assertion 1: 2.55228
RUN 4, Assertion 2: 1.00000 -- Quickest run is 1
RUN 5, Assertion 1: 2.53801
RUN 5, Assertion 2: 1.02363
The quickest run is 1
models of time, slower runs run in multiples of that point. The entire COUNT(*)
question is persistently and considerably slower than the LIMIT
question.
Each the plans and benchmark outcomes converse for themselves.
Oracle 23c
With Oracle 23c, we will lastly use BOOLEAN
sorts and omit DUAL
, yay!
No FETCH FIRST
:
SQL_ID 40yy0tskvs1zw, baby quantity 0
-------------------------------------
SELECT /*+GATHER_PLAN_STATISTICS*/ ( SELECT rely(*)
FROM actor a JOIN film_actor fa USING (actor_id)
WHERE a.last_name="WAHLBERG" ) >= 2Plan hash worth: 2539243977
---------------------------------------------------------------------------------------------------------------------------
| Id | Operation | Identify | Begins | E-Rows | A-Rows | A-Time | Buffers |
---------------------------------------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | | 1 |00:00:00.01 | 0 |
| 1 | SORT AGGREGATE | | 1 | 1 | 1 |00:00:00.01 | 6 |
| 2 | NESTED LOOPS | | 1 | 55 | 56 |00:00:00.01 | 6 |
| 3 | TABLE ACCESS BY INDEX ROWID BATCHED| ACTOR | 1 | 2 | 2 |00:00:00.01 | 2 |
|* 4 | INDEX RANGE SCAN | IDX_ACTOR_LAST_NAME | 1 | 2 | 2 |00:00:00.01 | 1 |
|* 5 | INDEX RANGE SCAN | IDX_FK_FILM_ACTOR_ACTOR | 2 | 27 | 56 |00:00:00.01 | 4 |
| 6 | FAST DUAL | | 1 | 1 | 1 |00:00:00.01 | 0 |
---------------------------------------------------------------------------------------------------------------------------Predicate Info (recognized by operation id):
---------------------------------------------------4 - entry("A"."LAST_NAME"='WAHLBERG')
5 - entry("A"."ACTOR_ID"="FA"."ACTOR_ID")
With FETCH FIRST
:
SQL_ID f88t1r0avnr7b, baby quantity 0
-------------------------------------
SELECT /*+GATHER_PLAN_STATISTICS*/( SELECT rely(*)
from ( choose * FROM actor a JOIN
film_actor fa USING (actor_id) WHERE a.last_name =
'WAHLBERG' FETCH FIRST 2 ROWS ONLY ) t )
>= 2Plan hash worth: 4019277616
------------------------------------------------------------------------------------------------------------------------------------------------
| Id | Operation | Identify | Begins | E-Rows | A-Rows | A-Time | Buffers | OMem | 1Mem | Used-Mem |
------------------------------------------------------------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | | 1 |00:00:00.01 | 0 | | | |
| 1 | SORT AGGREGATE | | 1 | 1 | 1 |00:00:00.01 | 6 | | | |
|* 2 | VIEW | | 1 | 2 | 2 |00:00:00.01 | 6 | | | |
|* 3 | WINDOW BUFFER PUSHED RANK | | 1 | 55 | 2 |00:00:00.01 | 6 | 2048 | 2048 | 2048 (0)|
| 4 | NESTED LOOPS | | 1 | 55 | 56 |00:00:00.01 | 6 | | | |
| 5 | TABLE ACCESS BY INDEX ROWID| ACTOR | 1 | 2 | 2 |00:00:00.01 | 2 | | | |
|* 6 | INDEX RANGE SCAN | IDX_ACTOR_LAST_NAME | 1 | 2 | 2 |00:00:00.01 | 1 | | | |
|* 7 | INDEX RANGE SCAN | IDX_FK_FILM_ACTOR_ACTOR | 2 | 27 | 56 |00:00:00.01 | 4 | | | |
| 8 | FAST DUAL | | 1 | 1 | 1 |00:00:00.01 | 0 | | | |
------------------------------------------------------------------------------------------------------------------------------------------------Predicate Info (recognized by operation id):
---------------------------------------------------2 - filter("from$_subquery$_005"."rowlimit_$$_rownumber"<=2)
3 - filter(ROW_NUMBER() OVER ( ORDER BY NULL )<=2)
6 - entry("A"."LAST_NAME"='WAHLBERG')
7 - entry("A"."ACTOR_ID"="FA"."ACTOR_ID")
Uh oh, this doesn’t look higher. The NESTED LOOPS
operation doesn’t appear to have gotten the memo from the WINDOW BUFFER PUSHED RANK
operation in regards to the question being aborted. The E-Rows
(estimated) and A-Rows
(precise) values nonetheless match, so the JOIN
appears to be executed fully.
For good measure, let’s additionally attempt:
With ROWNUM
:
I had hoped that this undead syntax belongs solely to distant reminiscences after Oracle 12c launched the usual SQL FETCH
syntax, however let’s attempt what occurs with this various:
SELECT (
SELECT rely(*)
FROM (
SELECT *
FROM actor a
JOIN film_actor fa USING (actor_id)
WHERE a.last_name="WAHLBERG"
AND ROWNUM <= 2 -- Yuck, however it works
) t
) >= 2
The plan is now:
SQL_ID 6r7w9d0425j6c, baby quantity 0
-------------------------------------
SELECT /*+GATHER_PLAN_STATISTICS*/( SELECT rely(*)
from ( choose * FROM actor a JOIN
film_actor fa USING (actor_id) WHERE a.last_name =
'WAHLBERG' AND ROWNUM <= 2 ) t ) >= 2Plan hash worth: 1271700124
-----------------------------------------------------------------------------------------------------------------------------
| Id | Operation | Identify | Begins | E-Rows | A-Rows | A-Time | Buffers |
-----------------------------------------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | | 1 |00:00:00.01 | 0 |
| 1 | SORT AGGREGATE | | 1 | 1 | 1 |00:00:00.01 | 4 |
| 2 | VIEW | | 1 | 2 | 2 |00:00:00.01 | 4 |
|* 3 | COUNT STOPKEY | | 1 | | 2 |00:00:00.01 | 4 |
| 4 | NESTED LOOPS | | 1 | 55 | 2 |00:00:00.01 | 4 |
| 5 | TABLE ACCESS BY INDEX ROWID BATCHED| ACTOR | 1 | 2 | 1 |00:00:00.01 | 2 |
|* 6 | INDEX RANGE SCAN | IDX_ACTOR_LAST_NAME | 1 | 2 | 1 |00:00:00.01 | 1 |
|* 7 | INDEX RANGE SCAN | IDX_FK_FILM_ACTOR_ACTOR | 1 | 27 | 2 |00:00:00.01 | 2 |
| 8 | FAST DUAL | | 1 | 1 | 1 |00:00:00.01 | 0 |
-----------------------------------------------------------------------------------------------------------------------------Predicate Info (recognized by operation id):
---------------------------------------------------3 - filter(ROWNUM<=2)
6 - entry("A"."LAST_NAME"='WAHLBERG')
7 - entry("A"."ACTOR_ID"="FA"."ACTOR_ID")
Now, that’s what I’m speaking about. The NESTED LOOPS
operation has a A-Rows
worth of 2
, because it ought to have. The COUNT STOPKEY
operation is aware of how you can inform its successors to behave.
Benchmark outcomes:
Run 1, Assertion 1 : 1.9564
Run 1, Assertion 2 : 2.98499
Run 1, Assertion 3 : 1.07291
Run 2, Assertion 1 : 1.69192
Run 2, Assertion 2 : 2.66905
Run 2, Assertion 3 : 1.01144
Run 3, Assertion 1 : 1.71051
Run 3, Assertion 2 : 2.63831
Run 3, Assertion 3 : 1 -- Quickest run is 1
Run 4, Assertion 1 : 1.61544
Run 4, Assertion 2 : 2.67334
Run 4, Assertion 3 : 1.00786
Run 5, Assertion 1 : 1.72981
Run 5, Assertion 2 : 2.77913
Run 5, Assertion 3 : 1.02716
Whoopsies. Certainly, it seems that the FETCH FIRST 2 ROWS ONLY
clause is dangerous on this case. It even made efficiency worse than if we omit it and rely the entire end result. Nevertheless, the ROWNUM
filter helped tremendously, similar to earlier than with PostgreSQL’s LIMIT
. I’d contemplate this an optimiser bug in Oracle. FETCH FIRST
needs to be an operation that may be pushed down to varied different operations
MySQL
No LIMIT
:
-> Rows fetched earlier than execution (value=0.00..0.00 rows=1) (precise time=0.000..0.000 rows=1 loops=1)
-> Choose #2 (subquery in projection; run solely as soon as)
-> Mixture: rely(0) (value=1.35 rows=1) (precise time=0.479..0.479 rows=1 loops=1)
-> Nested loop internal be a part of (value=1.15 rows=2) (precise time=0.077..0.110 rows=56 loops=1)
-> Masking index lookup on a utilizing idx_actor_last_name (last_name="WAHLBERG") (value=0.45 rows=2) (precise time=0.059..0.061 rows=2 loops=1)
-> Masking index lookup on fa utilizing PRIMARY (actor_id=a.actor_id) (value=0.30 rows=1) (precise time=0.011..0.021 rows=28 loops=2)
With LIMIT
:
-> Rows fetched earlier than execution (value=0.00..0.00 rows=1) (precise time=0.000..0.000 rows=1 loops=1)
-> Choose #2 (subquery in projection; run solely as soon as)
-> Mixture: rely(0) (value=4.08..4.08 rows=1) (precise time=0.399..0.400 rows=1 loops=1)
-> Desk scan on t (value=2.62..3.88 rows=2) (precise time=0.394..0.394 rows=2 loops=1)
-> Materialize (value=1.35..1.35 rows=2) (precise time=0.033..0.033 rows=2 loops=1)
-> Restrict: 2 row(s) (value=1.15 rows=2) (precise time=0.024..0.025 rows=2 loops=1)
-> Nested loop internal be a part of (value=1.15 rows=2) (precise time=0.024..0.024 rows=2 loops=1)
-> Masking index lookup on a utilizing idx_actor_last_name (last_name="WAHLBERG") (value=0.45 rows=2) (precise time=0.014..0.014 rows=1 loops=1)
-> Masking index lookup on fa utilizing PRIMARY (actor_id=a.actor_id) (value=0.30 rows=1) (precise time=0.008..0.008 rows=2 loops=1)
We once more get the Nested loop internal be a part of
row with the wished distinction:
Earlier than:
Nested loop internal be a part of (value=1.15 rows=2) (precise time=0.077..0.110 rows=56 loops=1)
After:
Nested loop internal be a part of (value=1.15 rows=2) (precise time=0.024..0.024 rows=2 loops=1)
Benchmark outcomes:
Once more, the LIMIT
is useful, although the distinction is much less spectacular:
0 1 1.2933
0 2 1.0089
1 1 1.2489
1 2 1.0000 -- Quickest run is 1
2 1 1.2444
2 2 1.0933
3 1 1.2133
3 2 1.0178
4 1 1.2267
4 2 1.0178
SQL Server
No LIMIT
:
|--Compute Scalar(DEFINE:([Expr1006]=CASE WHEN [Expr1004]>=(2) THEN (1) ELSE (0) END))
|--Compute Scalar(DEFINE:([Expr1004]=CONVERT_IMPLICIT(int,[Expr1010],0)))
|--Stream Mixture(DEFINE:([Expr1010]=Rely(*)))
|--Nested Loops(Inside Be part of, OUTER REFERENCES:([a].[actor_id]))
|--Desk Scan(OBJECT:([sakila].[dbo].[actor] AS [a]), WHERE:([sakila].[dbo].[actor].[last_name] as [a].[last_name]='WAHLBERG'))
|--Index Search(OBJECT:([sakila].[dbo].[film_actor].[PK__film_act__086D31FF6BE587FC] AS [fa]), SEEK:([fa].[actor_id]=[sakila].[dbo].[actor].[actor_id] as [a].[actor_id]) ORDERED FORWARD)
With LIMIT
:
|--Compute Scalar(DEFINE:([Expr1007]=CASE WHEN [Expr1005]>=(2) THEN (1) ELSE (0) END))
|--Compute Scalar(DEFINE:([Expr1005]=CONVERT_IMPLICIT(int,[Expr1011],0)))
|--Stream Mixture(DEFINE:([Expr1011]=Rely(*)))
|--Prime(TOP EXPRESSION:((2)))
|--Nested Loops(Inside Be part of, OUTER REFERENCES:([a].[actor_id]))
|--Desk Scan(OBJECT:([sakila].[dbo].[actor] AS [a]), WHERE:([sakila].[dbo].[actor].[last_name] as [a].[last_name]='WAHLBERG'))
|--Index Search(OBJECT:([sakila].[dbo].[film_actor].[PK__film_act__086D31FF6BE587FC] AS [fa]), SEEK:([fa].[actor_id]=[sakila].[dbo].[actor].[actor_id] as [a].[actor_id]) ORDERED FORWARD)
The textual content model doesn’t point out precise rows, even with SHOWPLAN_ALL
, so let’s simply take a look at what occurs within the benchmark:
Benchmark outcomes:
Run 1, Assertion 1: 1.92118
Run 1, Assertion 2: 1.00000 -- Quickest run is 1
Run 2, Assertion 1: 1.95567
Run 2, Assertion 2: 1.01724
Run 3, Assertion 1: 1.91379
Run 3, Assertion 2: 1.01724
Run 4, Assertion 1: 1.93842
Run 4, Assertion 2: 1.04926
Run 5, Assertion 1: 1.95567
Run 5, Assertion 2: 1.03448
And once more, a formidable 2x enchancment for this explicit question
Conclusion
Simply as with our earlier weblog submit about COUNT(*)
vs EXISTS
, the seemingly apparent is true once more on this case the place we need to verify if N
or extra rows exist in a question. If we blindly rely all of the rows, then we’ve seen a lot worse efficiency than if we helped the optimiser with a LIMIT
or TOP
clause, or ROWNUM
in Oracle.
Technically, an optimiser might have detected this optimisation itself, however as our earlier article about optimisations that don’t depend upon the fee mannequin has proven, optimisers don’t at all times do all the things they will.
Sadly, in Oracle’s case, the usual SQL syntax made issues slower (on this benchmark). This doesn’t imply it’s usually slower for all circumstances, however it’s one thing value searching for. There are nonetheless circumstances the place historical ROWNUM
clause is best optimised. That is a type of circumstances.
Whether or not syntax X is quicker than syntax Y could be proven by finding out execution plans (not simply with estimates, however with precise values), or by working a easy SQL benchmark. As at all times with benchmarks, watch out when decoding outcomes, double verify, attempt extra alternate options.