-
Notifications
You must be signed in to change notification settings - Fork 2.6k
Description
What happens?
I have two subqueries joined with a full outer join which returns more rows than it should. When I swapped the order of the two subqueries being joined the results are correct even though the results for each subquery are the same. The same query against Postgres returns the expected result (and is not affected by the order of the subqueries).
To Reproduce
Note: The queries are more convoluted than it should because they are being generated dynamically as part of a POC I’ve been working on.
The data
CREATE TABLE df1(day DATE, value INTEGER, organization VARCHAR);
INSERT INTO df1 VALUES
('2022-01-01', 10, 'org1'),
('2022-01-05', 20, 'org2'),
('2022-01-10', 30, 'org3');
CREATE TABLE df2(day DATE, value INTEGER, organization VARCHAR);
INSERT INTO df2 VALUES
('2022-01-01', 100, 'org1'),
('2022-09-01', 200, 'org2'),
('2022-03-01', 300, 'org3');
CREATE TABLE df3(day DATE, value INTEGER, organization VARCHAR);
INSERT INTO df3 VALUES
('2022-01-02', 1000, 'org1'),
('2022-02-03', 2000, 'org2'),
('2022-04-01', 3000, 'org3');
The bad query
- Expected results:
[(2022-01-01, 1100.0)]
- Actual results:
[(2022-01-01, 1000.0), (2022-01-01, 1100.0)]
SELECT
coalesce(anon_1.month, anon_2.month) AS month,
coalesce(coalesce(CAST(anon_1.value AS REAL), 0.0) + coalesce(CAST(anon_2.value AS REAL), 0.0), 0.0) AS value
FROM (
SELECT coalesce(anon_3.month, anon_4.month) AS month,
coalesce(coalesce(CAST(anon_3.value AS REAL), 0.0) + coalesce(CAST(anon_4.value AS REAL), 0.0), 0.0) AS value
FROM (
SELECT month AS month, sum(anon_5.value) AS value
FROM (
SELECT date_trunc('month', day) AS month, coalesce(sum(value), 0.0) AS value
FROM df1
WHERE CAST(day AS DATE) >= '2022-01-01 00:00:00'
AND CAST(day AS DATE) <= '2022-01-31 00:00:00'
AND (organization ILIKE 'org4')
GROUP BY date_trunc('month', day)
) AS anon_5
GROUP BY GROUPING SETS((month))
) AS anon_3
FULL OUTER JOIN (
SELECT month AS month, sum(anon_6.value) AS value
FROM (
SELECT date_trunc('month', day) AS month, coalesce(sum(value), 0.0) AS value
FROM df2
WHERE CAST(day AS DATE) >= '2022-01-01 00:00:00'
AND CAST(day AS DATE) <= '2022-01-31 00:00:00'
GROUP BY date_trunc('month', day)
) AS anon_6
GROUP BY GROUPING SETS((month))
) AS anon_4 ON anon_3.month = anon_4.month
) AS anon_1
FULL OUTER JOIN (
SELECT month AS month, sum(anon_7.value) AS value
FROM (
SELECT date_trunc('month', day) AS month, coalesce(sum(value), 0.0) AS value
FROM df3
WHERE CAST(day AS DATE) >= '2022-01-01 00:00:00'
AND CAST(day AS DATE) <= '2022-01-31 00:00:00'
GROUP BY date_trunc('month', day)
) AS anon_7
GROUP BY GROUPING SETS((month))
) AS anon_2 ON anon_1.month = anon_2.month
I've been playing around with the query trying to understand what is going on and found out that if I change the (organization ILIKE 'org4')
filter in anon_5
for one that returns at least one result (e.g. org1
) the final result is correct. I include the query here for convenience:
- Expected results:
[(2022-01-01, 1110.0)]
- Actual results:
[(2022-01-01, 1110.0)]
(correct)
SELECT
coalesce(anon_1.month, anon_2.month) AS month,
coalesce(coalesce(CAST(anon_1.value AS REAL), 0.0) + coalesce(CAST(anon_2.value AS REAL), 0.0), 0.0) AS value
FROM (
SELECT coalesce(anon_3.month, anon_4.month) AS month,
coalesce(coalesce(CAST(anon_3.value AS REAL), 0.0) + coalesce(CAST(anon_4.value AS REAL), 0.0), 0.0) AS value
FROM (
SELECT month AS month, sum(anon_5.value) AS value
FROM (
SELECT date_trunc('month', day) AS month, coalesce(sum(value), 0.0) AS value
FROM df1
WHERE CAST(day AS DATE) >= '2022-01-01 00:00:00'
AND CAST(day AS DATE) <= '2022-01-31 00:00:00'
AND (organization ILIKE 'org1')
GROUP BY date_trunc('month', day)
) AS anon_5
GROUP BY GROUPING SETS((month))
) AS anon_3
FULL OUTER JOIN (
SELECT month AS month, sum(anon_6.value) AS value
FROM (
SELECT date_trunc('month', day) AS month, coalesce(sum(value), 0.0) AS value
FROM df2
WHERE CAST(day AS DATE) >= '2022-01-01 00:00:00'
AND CAST(day AS DATE) <= '2022-01-31 00:00:00'
GROUP BY date_trunc('month', day)
) AS anon_6
GROUP BY GROUPING SETS((month))
) AS anon_4 ON anon_3.month = anon_4.month
) AS anon_1
FULL OUTER JOIN (
SELECT month AS month, sum(anon_7.value) AS value
FROM (
SELECT date_trunc('month', day) AS month, coalesce(sum(value), 0.0) AS value
FROM df3
WHERE CAST(day AS DATE) >= '2022-01-01 00:00:00'
AND CAST(day AS DATE) <= '2022-01-31 00:00:00'
GROUP BY date_trunc('month', day)
) AS anon_7
GROUP BY GROUPING SETS((month))
) AS anon_2 ON anon_1.month = anon_2.month
Also, if I swap the order in which anon_3
and anon_4
subqueries are joined (keeping the original org4
filter) the results are also correct.
- Expected results:
[(2022-01-01, 1100.0)]
- Actual results:
[(2022-01-01, 1100.0)]
(correct)
SELECT
coalesce(anon_1.month, anon_2.month) AS month,
coalesce(coalesce(CAST(anon_1.value AS REAL), 0.0) + coalesce(CAST(anon_2.value AS REAL), 0.0), 0.0) AS value
FROM (
SELECT coalesce(anon_3.month, anon_4.month) AS month,
coalesce(coalesce(CAST(anon_3.value AS REAL), 0.0) + coalesce(CAST(anon_4.value AS REAL), 0.0), 0.0) AS value
FROM (
SELECT month AS month, sum(anon_6.value) AS value
FROM (
SELECT date_trunc('month', day) AS month, coalesce(sum(value), 0.0) AS value
FROM df2
WHERE CAST(day AS DATE) >= '2022-01-01 00:00:00'
AND CAST(day AS DATE) <= '2022-01-31 00:00:00'
GROUP BY date_trunc('month', day)
) AS anon_6
GROUP BY GROUPING SETS((month))
) AS anon_3
FULL OUTER JOIN (
SELECT month AS month, sum(anon_5.value) AS value
FROM (
SELECT date_trunc('month', day) AS month, coalesce(sum(value), 0.0) AS value
FROM df1
WHERE CAST(day AS DATE) >= '2022-01-01 00:00:00'
AND CAST(day AS DATE) <= '2022-01-31 00:00:00'
AND (organization ILIKE 'org4')
GROUP BY date_trunc('month', day)
) AS anon_5
GROUP BY GROUPING SETS((month))
) AS anon_4 ON anon_3.month = anon_4.month
) AS anon_1
FULL OUTER JOIN (
SELECT month AS month, sum(anon_7.value) AS value
FROM (
SELECT date_trunc('month', day) AS month, coalesce(sum(value), 0.0) AS value
FROM df3
WHERE CAST(day AS DATE) >= '2022-01-01 00:00:00'
AND CAST(day AS DATE) <= '2022-01-31 00:00:00'
GROUP BY date_trunc('month', day)
) AS anon_7
GROUP BY GROUPING SETS((month))
) AS anon_2 ON anon_1.month = anon_2.month
So, I think that having no results from anon_5 at the left side of the outer join is messing with the final result somehow. Also, the EXPLAIN ANALYZE
diagram is not coherent with the results I am getting from the first (bad) query. According to it I should be getting 1 resulting row not 2.
OS:
Docker: python:3.9-slim-bullseye - Host: macOS 11.5.2 10. (Tried also directly in the host with same results)
DuckDB Version:
0.4.0
DuckDB Client:
Python
Full Name:
Francisco Espino Romero
Affiliation:
Cofi
Have you tried this on the latest master
branch?
- I agree
Have you tried the steps to reproduce? Do they include all relevant data and configuration? Does the issue you report still appear there?
- I agree