Sunday, January 1, 2023
HomeData ScienceMost Subarray Sum Utilizing SQL. Methods to implement an answer in a…...

Most Subarray Sum Utilizing SQL. Methods to implement an answer in a… | by Dhruv Matani | Dec, 2022


The Most Sum Subarray Sum downside asks you to seek out that subarray of an array of integers that has the best sum of parts. It’s a in style pc science programming downside normally solved in crucial programming languages equivalent to C++, Java, Python, and many others… Nonetheless, it’s attention-grabbing to take a look at how it may be implement in a declarative question language equivalent to SQL.

Photograph by Caspar Camille Rubin on Unsplash

I hope to make this the primary in a sequence of many posts that attempt to remedy conventional programming issues utilizing declarative SQL syntax. Why? As a result of it’s enjoyable!

Given an array of integers, discover the contiguous sub-array of integers (with not less than 1 factor from the unique array) that has the utmost sum of parts.

This downside’s dialogue might be discovered at quite a few locations together with:

  1. geeksforgeeks
  2. guru99
  3. scaler

All of the hyperlinks above talk about an O(n) time implementation in a standard programming language. Nonetheless, we have a look at constructing an O(n) time resolution fully in SQL (with none of the procedural extensions).

CREATE TABLE array_data(index SERIAL PRIMARY KEY, worth INTEGER NOT NULL);

INSERT INTO array_data(worth)
VALUES(2), (4), (-3), (0), (-7), (4), (1), (-2), (7), (3), (1), (-6), (3);

SELECT * FROM array_data;
Contents of the desk array_data (Picture by writer)

The plain first resolution is to generate each pair of legitimate indexes {that a} subarray can have (which is O(n²) since each index is paired with an index after it) after which for every such pair, compute the sum of values inside that vary.

Computing the sum takes linear time over the enter (which is each pair of indices), and therefore the general price of this resolution is O(n³)

WITH all_pair_sums AS (
SELECT
lhs.index AS begin_idx,
rhs.index AS end_idx,
SUM(for_values.worth) AS array_sum
FROM
array_data lhs INNER JOIN array_data rhs
ON lhs.index <= rhs.index
INNER JOIN array_data for_values
ON for_values.index BETWEEN lhs.index AND rhs.index
GROUP BY 1, 2
ORDER BY lhs.index ASC, rhs.index ASC
)

SELECT * FROM all_pair_sums
WHERE array_sum = (SELECT MAX(array_sum) FROM all_pair_sums);

Results of the O(n³) resolution in SQL (Picture by writer)

Estimated Price: The estimated price (based on EXPLAIN) for this question on an enter desk with 13 rows is 108M.

The primary optimization we will carry out is keep away from repeatedly computing the of each array beginning at a given array index (begin_idx). As an alternative, we will keep a running-sum for each subarray that begins at a given index.

This requires the usage of Window Capabilities, and reduces the associated fee to O(n²), since we’re processing each index for each array index earlier than it.

WITH all_pair_sums AS (
SELECT
lhs.index AS begin_idx,
rhs.index AS end_idx,
SUM(rhs.worth) OVER (PARTITION BY lhs.index ORDER BY rhs.index ASC) AS array_sum
FROM
array_data lhs INNER JOIN array_data rhs
ON lhs.index <= rhs.index
ORDER BY lhs.index ASC, rhs.index ASC
),

with_max AS (
SELECT
begin_idx,
end_idx,
array_sum,
MAX(array_sum) OVER() AS max_subarray_sum
FROM all_pair_sums
)

SELECT begin_idx, end_idx, array_sum
FROM with_max
WHERE array_sum = max_subarray_sum;

Right here, the array_sum column stroes the operating sum of an array beginning at index begin_idx and ending at end_idx.

Results of the O(n²) resolution in SQL (Picture by writer)

Estimated Price: The estimated price (based on EXPLAIN) for this question on an enter desk with 13 rows is 371k. That’s a 99.65% discount in comparison with the earlier method.

Statement: To seek out the sum of a subarray beginning at index i and ending at index j, we will discover the sum of the subarray [0..j] and from it subtract the worth of the sum of the subarray [0..i-1]. In different phrases, sum[i..j] = sum[0..j] — sum[0..i-1].

Sum of the subarray parts listed [2..9] computed by performing sum([0..9]) — sum([0..1]) (Picture by writer)

The column array_sum above shops the worth of the sum of parts of the subarray [0..begin_idx].

The ultimate code for this resolution is proven under.

WITH running_sum AS (
SELECT
index AS begin_idx,
worth,
SUM(worth) OVER (ORDER BY index ASC) AS array_sum
FROM
(SELECT 0 AS index, 0 AS worth
UNION ALL
SELECT * FROM array_data
) AS temp
ORDER BY index ASC
),

running_sum_with_min AS (
SELECT
begin_idx,
worth,
array_sum,
MIN(array_sum) OVER(ORDER BY begin_idx ASC) AS min_sum_so_far
FROM running_sum
),

sum_of_subarray AS (
SELECT
begin_idx,
worth,
array_sum,
min_sum_so_far,
array_sum - LAG(min_sum_so_far, 1) OVER(ORDER BY begin_idx ASC) AS subarray_sum
FROM running_sum_with_min
),

max_sum_of_subarray AS (
SELECT
begin_idx,
worth,
array_sum,
min_sum_so_far,
subarray_sum,
MAX(subarray_sum) OVER() AS max_subarray_sum
FROM sum_of_subarray
)

SELECT *
FROM max_sum_of_subarray
WHERE subarray_sum = max_subarray_sum;

In case you print the intermediate desk max_sum_of_subarray, that is what it seems like. The row highlighted in purple is the one which holds the worth of the contiguous subarray with the biggest sum.

The max_sum_of_subarray desk (Picture by writer)

One disadvantage of this resolution is that whereas we’ve the top index, we don’t have the beginning index. This may be trivially fetched by operating one other question on the intermediate desk to seek for the primary incidence of the worth -4.

This subarray is the one between indexes [6..11]. The worth of the sum of parts on this subarray is 10 — (-4) =14.

Word that we insert a dummy row with (index, worth) = (0, 0) to make sure that the LAG operation for the primary legitimate row (i.e. row with index = 1) doesn’t find yourself producing a NULL worth, and therefore leading to an incorrect consequence.

Estimated Price: The estimated price (based on EXPLAIN) for this question on an enter desk with 13 rows is 695. That’s a 99.99% discount in comparison with the primary method, and a 99.81% discount in comparison with the 2nd method.

Why is that this resolution mentioned to be O(n) although there’s an ORDER BY clause in a number of CTEs within the SQL for this resolution?

The unique enter desk’s index column is a main key column (therefore distinctive). PostgreSQL’s default index kind is B-Tree, permits doing a sorted full desk scan, which is what we’re mainly doing in all of the CTEs on this resolution. Therefore, we anticipate that the optimizer will discover this (coupled with the truth that intermediate tables additionally come out sorted) and keep away from the type operation fully. Even whether it is unable to infer this, the operating time regresses to O(n log n), however not worse than that.

The SQL Fiddle hyperlink to all of the options on this submit might be discovered right here.

We noticed quite a few methods of fixing the identical downside in SQL, with various ranges of effectivity. Though the final resolution has probably the most quantity of code, it’s probably the most environment friendly when it comes to operating time and CPU utilization. Breaking down the options into smaller CTEs helps code readability and maintainability.

In case you’d prefer to see extra standard programming issues solved utilizing declarative SQL syntax, please add a remark and I’ll attempt to get to it if potential or write why it might be exhausting to do.

RELATED ARTICLES

LEAVE A REPLY

Please enter your comment!
Please enter your name here

- Advertisment -
Google search engine

Most Popular

Recent Comments