Skip to content

Query with syntax error in CTE causes cartesian product and runs with high memory consumption #17456

@mikegarciam

Description

@mikegarciam

What happens?

Issue

A query that should return a syntax error is run when written in a particular way, using the using an extremely high memory usage (in relation to the input data), and starts spilling to disk.

Detail

The following aggregation query is used in a CTE, and the field name is mis-spelled (it doesn't exist in the FROM clause):

payments_made AS (
  SELECT 
    invoice_id,
    SUM(amount  -- this field does not exist in `payments`, but exists in `invoices`
      * COALESCE(exchange_rate, 1)) AS total_paid_amount
  FROM payments
  GROUP BY invoice_id
),

when joining this CTE to another table, which contains the mis-spelled field, no syntax error is thrown:

FROM invoices
  LEFT JOIN payments_made
    ON payments_made.invoice_id = invoices.invoice_id

instead, the query starts running and using an extremely high memory usage (in relation to the input data), and starts spilling to disk.

As an example, an input table with around 100,000 records (less than 10 MB on disk) was using ~60GB RAM (all available RAM in the system) and started spilling to disk using more than 500GB before it was manually interrupted.

Error conditions

This issue only occurs when the aggregation function includes * COALESCE(exchange_rate, 1) inside the sum function. Changing it to * 1, or removing it, causes the engine to return the following error (which is expected):

Binder Error: LATERAL join cannot contain aggregates!

Also, changing the order of the join causes the engine to identify the syntax error (expected behavior):

FROM payments_made
  RIGHT JOIN invoices
    ON payments_made.invoice_id = invoices.invoice_id

To Reproduce

Reproducible example

The following create a reproducible example with synthetic data.

**Be careful when running it as it can crash the system or fill up the entire disk. **

For my tests, I kept the Activity Monitor window up to see the amount of RAM being used and interrupted it with Ctrl+C when it started using all the available RAM.

Setting up the input data

-- Create test tables
CREATE TABLE invoices (
  invoice_id VARCHAR,
  amount DECIMAL(10,2)
);

CREATE TABLE payments (
  invoice_id VARCHAR,
  exchange_rate DECIMAL(10,2),
  paid_amount DECIMAL(10,2)  -- This is our amt_impuesto_trasladado_realizado
);

-- Generate invoices with sequential IDs
INSERT INTO invoices 
SELECT 
  'INV' || LPAD(i::VARCHAR, 6, '0'),
  RANDOM() * 10000
FROM RANGE(1, 100001) t(i);

-- Generate 3 payment relations per invoice on average (approx. 3,000 rows)
-- This will make the cartesian product somewhat expensive
INSERT INTO payments
SELECT
  'INV' || LPAD((i/3 + 1)::VARCHAR, 6, '0'),  -- Maps roughly 3 payments to each invoice
  1 + (RANDOM() * 0.2),  -- Exchange rate between 1.0 and 1.2
  (RANDOM() * 500)  -- Random paid amount
FROM RANGE(1, 30001) t(i);

Problematic query

Here is a sample query plan of how the query is interpreted:
https://db.cs.uni-tuebingen.de/explain/plan/3c787165b895a1d4

-- this runs but uses a lot of memory and spills to disk.
WITH 
payments_made AS (
  SELECT 
    invoice_id,
    SUM(amount  -- this field does not exist in `payments`, but exists in `invoices`
      * COALESCE(exchange_rate, 1)) AS total_paid_amount
  FROM payments
  GROUP BY invoice_id
),
final_invoices AS (
  SELECT
    invoices.invoice_id
  FROM invoices
  LEFT JOIN payments_made
    ON payments_made.invoice_id = invoices.invoice_id
)
SELECT * FROM final_invoices;

This throws an expected error

The only change in this query vs the previous one is that * COALESCE(exchange_rate, 1) is not used

--this throws an error
-- Binder Error: LATERAL join cannot contain aggregates!
WITH 
payments_made AS (
  SELECT 
    invoice_id,
    SUM(amount) AS total_paid_amount  -- amount doesn't exist in payments
  FROM payments
  GROUP BY invoice_id
),
final_invoices AS (
  SELECT
    invoices.invoice_id
  FROM invoices
  LEFT JOIN payments_made
    ON payments_made.invoice_id = invoices.invoice_id
)
SELECT * FROM final_invoices;

This also throws an error

-- Changing the order of the join triggers the expected error
-- Binder Error: Referenced column "amount" not found in FROM clause!
WITH 
payments_made AS (
  SELECT 
    invoice_id,
    SUM(amount  -- this field does not exist in `payments`, but exists in `invoices`
      * COALESCE(exchange_rate, 1)) AS total_paid_amount
  FROM payments
  GROUP BY invoice_id
),
final_invoices AS (
  SELECT
    invoices.invoice_id
  FROM payments_made
  RIGHT JOIN invoices
    ON payments_made.invoice_id = invoices.invoice_id
)
SELECT * FROM final_invoices;

OS:

MacOS aarch64

DuckDB Version:

v1.1.0

DuckDB Client:

CLI

Hardware:

No response

Full Name:

Miguel Garcia

Affiliation:

AfterSync

What is the latest build you tested with? If possible, we recommend testing with the latest nightly build.

I have not tested with any build

Did you include all relevant data sets for reproducing the issue?

Yes

Did you include all code required to reproduce the issue?

  • Yes, I have

Did you include all relevant configuration (e.g., CPU architecture, Python version, Linux distribution) to reproduce the issue?

  • Yes, I have

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions