Skip to content

Full outer join returning more rows than expected #4252

@fespino

Description

@fespino

What happens?

I have two subqueries joined with a full outer join which returns more rows than it should. When I swapped the order of the two subqueries being joined the results are correct even though the results for each subquery are the same. The same query against Postgres returns the expected result (and is not affected by the order of the subqueries).

To Reproduce

Note: The queries are more convoluted than it should because they are being generated dynamically as part of a POC I’ve been working on.

The data

CREATE TABLE df1(day DATE, value INTEGER, organization VARCHAR);
INSERT INTO df1 VALUES 
    ('2022-01-01', 10, 'org1'),
    ('2022-01-05', 20, 'org2'),
    ('2022-01-10', 30, 'org3');

CREATE TABLE df2(day DATE, value INTEGER, organization VARCHAR);
INSERT INTO df2 VALUES 
    ('2022-01-01', 100, 'org1'),
    ('2022-09-01', 200, 'org2'),
    ('2022-03-01', 300, 'org3');

CREATE TABLE df3(day DATE, value INTEGER, organization VARCHAR);
INSERT INTO df3 VALUES
    ('2022-01-02',  1000, 'org1'),
    ('2022-02-03',  2000, 'org2'),
    ('2022-04-01',  3000, 'org3');

The bad query

  • Expected results: [(2022-01-01, 1100.0)]
  • Actual results: [(2022-01-01, 1000.0), (2022-01-01, 1100.0)]
SELECT 
    coalesce(anon_1.month, anon_2.month) AS month, 
    coalesce(coalesce(CAST(anon_1.value AS REAL), 0.0) + coalesce(CAST(anon_2.value AS REAL), 0.0), 0.0) AS value
FROM (
    SELECT coalesce(anon_3.month, anon_4.month) AS month, 
    coalesce(coalesce(CAST(anon_3.value AS REAL), 0.0) + coalesce(CAST(anon_4.value AS REAL), 0.0), 0.0) AS value
    FROM (
        SELECT month AS month, sum(anon_5.value) AS value
        FROM (
            SELECT date_trunc('month', day) AS month, coalesce(sum(value), 0.0) AS value
            FROM df1
            WHERE CAST(day AS DATE) >= '2022-01-01 00:00:00' 
            AND CAST(day AS DATE) <= '2022-01-31 00:00:00' 
            AND (organization ILIKE 'org4') 
            GROUP BY date_trunc('month', day)
        ) AS anon_5 
        GROUP BY GROUPING SETS((month))
    ) AS anon_3 
    FULL OUTER JOIN (
        SELECT month AS month, sum(anon_6.value) AS value
        FROM (
            SELECT date_trunc('month', day) AS month, coalesce(sum(value), 0.0) AS value
            FROM df2
            WHERE CAST(day AS DATE) >= '2022-01-01 00:00:00' 
            AND CAST(day AS DATE) <= '2022-01-31 00:00:00' 
            GROUP BY date_trunc('month', day)
        ) AS anon_6 
        GROUP BY GROUPING SETS((month))
    ) AS anon_4 ON anon_3.month = anon_4.month
) AS anon_1 
FULL OUTER JOIN (
    SELECT month AS month, sum(anon_7.value) AS value
    FROM (
        SELECT date_trunc('month', day) AS month, coalesce(sum(value), 0.0) AS value
        FROM df3
        WHERE CAST(day AS DATE) >= '2022-01-01 00:00:00' 
        AND CAST(day AS DATE) <= '2022-01-31 00:00:00' 
        GROUP BY date_trunc('month', day)
    ) AS anon_7 
    GROUP BY GROUPING SETS((month))
) AS anon_2 ON anon_1.month = anon_2.month

I've been playing around with the query trying to understand what is going on and found out that if I change the (organization ILIKE 'org4') filter in anon_5 for one that returns at least one result (e.g. org1) the final result is correct. I include the query here for convenience:

  • Expected results: [(2022-01-01, 1110.0)]
  • Actual results: [(2022-01-01, 1110.0)] (correct)
SELECT 
    coalesce(anon_1.month, anon_2.month) AS month, 
    coalesce(coalesce(CAST(anon_1.value AS REAL), 0.0) + coalesce(CAST(anon_2.value AS REAL), 0.0), 0.0) AS value
FROM (
    SELECT coalesce(anon_3.month, anon_4.month) AS month, 
    coalesce(coalesce(CAST(anon_3.value AS REAL), 0.0) + coalesce(CAST(anon_4.value AS REAL), 0.0), 0.0) AS value
    FROM (
        SELECT month AS month, sum(anon_5.value) AS value
        FROM (
            SELECT date_trunc('month', day) AS month, coalesce(sum(value), 0.0) AS value
            FROM df1
            WHERE CAST(day AS DATE) >= '2022-01-01 00:00:00' 
            AND CAST(day AS DATE) <= '2022-01-31 00:00:00' 
            AND (organization ILIKE 'org1') 
            GROUP BY date_trunc('month', day)
        ) AS anon_5 
        GROUP BY GROUPING SETS((month))
    ) AS anon_3 
    FULL OUTER JOIN (
        SELECT month AS month, sum(anon_6.value) AS value
        FROM (
            SELECT date_trunc('month', day) AS month, coalesce(sum(value), 0.0) AS value
            FROM df2
            WHERE CAST(day AS DATE) >= '2022-01-01 00:00:00' 
            AND CAST(day AS DATE) <= '2022-01-31 00:00:00' 
            GROUP BY date_trunc('month', day)
        ) AS anon_6 
        GROUP BY GROUPING SETS((month))
    ) AS anon_4 ON anon_3.month = anon_4.month
) AS anon_1 
FULL OUTER JOIN (
    SELECT month AS month, sum(anon_7.value) AS value
    FROM (
        SELECT date_trunc('month', day) AS month, coalesce(sum(value), 0.0) AS value
        FROM df3
        WHERE CAST(day AS DATE) >= '2022-01-01 00:00:00' 
        AND CAST(day AS DATE) <= '2022-01-31 00:00:00' 
        GROUP BY date_trunc('month', day)
    ) AS anon_7 
    GROUP BY GROUPING SETS((month))
) AS anon_2 ON anon_1.month = anon_2.month

Also, if I swap the order in which anon_3 and anon_4 subqueries are joined (keeping the original org4 filter) the results are also correct.

  • Expected results: [(2022-01-01, 1100.0)]
  • Actual results: [(2022-01-01, 1100.0)] (correct)
SELECT 
    coalesce(anon_1.month, anon_2.month) AS month, 
    coalesce(coalesce(CAST(anon_1.value AS REAL), 0.0) + coalesce(CAST(anon_2.value AS REAL), 0.0), 0.0) AS value
FROM (
    SELECT coalesce(anon_3.month, anon_4.month) AS month, 
    coalesce(coalesce(CAST(anon_3.value AS REAL), 0.0) + coalesce(CAST(anon_4.value AS REAL), 0.0), 0.0) AS value
    FROM (
        SELECT month AS month, sum(anon_6.value) AS value
        FROM (
            SELECT date_trunc('month', day) AS month, coalesce(sum(value), 0.0) AS value
            FROM df2
            WHERE CAST(day AS DATE) >= '2022-01-01 00:00:00' 
            AND CAST(day AS DATE) <= '2022-01-31 00:00:00' 
            GROUP BY date_trunc('month', day)
        ) AS anon_6 
        GROUP BY GROUPING SETS((month))
    ) AS anon_3 
    FULL OUTER JOIN (
        SELECT month AS month, sum(anon_5.value) AS value
        FROM (
            SELECT date_trunc('month', day) AS month, coalesce(sum(value), 0.0) AS value
            FROM df1
            WHERE CAST(day AS DATE) >= '2022-01-01 00:00:00' 
            AND CAST(day AS DATE) <= '2022-01-31 00:00:00' 
            AND (organization ILIKE 'org4') 
            GROUP BY date_trunc('month', day)
        ) AS anon_5 
        GROUP BY GROUPING SETS((month))
    ) AS anon_4 ON anon_3.month = anon_4.month
) AS anon_1 
FULL OUTER JOIN (
    SELECT month AS month, sum(anon_7.value) AS value
    FROM (
        SELECT date_trunc('month', day) AS month, coalesce(sum(value), 0.0) AS value
        FROM df3
        WHERE CAST(day AS DATE) >= '2022-01-01 00:00:00' 
        AND CAST(day AS DATE) <= '2022-01-31 00:00:00' 
        GROUP BY date_trunc('month', day)
    ) AS anon_7 
    GROUP BY GROUPING SETS((month))
) AS anon_2 ON anon_1.month = anon_2.month

So, I think that having no results from anon_5 at the left side of the outer join is messing with the final result somehow. Also, the EXPLAIN ANALYZE diagram is not coherent with the results I am getting from the first (bad) query. According to it I should be getting 1 resulting row not 2.

OS:

Docker: python:3.9-slim-bullseye - Host: macOS 11.5.2 10. (Tried also directly in the host with same results)

DuckDB Version:

0.4.0

DuckDB Client:

Python

Full Name:

Francisco Espino Romero

Affiliation:

Cofi

Have you tried this on the latest master branch?

  • I agree

Have you tried the steps to reproduce? Do they include all relevant data and configuration? Does the issue you report still appear there?

  • I agree

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions