-
Notifications
You must be signed in to change notification settings - Fork 2.6k
Description
What happens?
Issue
A query that should return a syntax error is run when written in a particular way, using the using an extremely high memory usage (in relation to the input data), and starts spilling to disk.
Detail
The following aggregation query is used in a CTE, and the field name is mis-spelled (it doesn't exist in the FROM clause):
payments_made AS (
SELECT
invoice_id,
SUM(amount -- this field does not exist in `payments`, but exists in `invoices`
* COALESCE(exchange_rate, 1)) AS total_paid_amount
FROM payments
GROUP BY invoice_id
),
when joining this CTE to another table, which contains the mis-spelled field, no syntax error is thrown:
FROM invoices
LEFT JOIN payments_made
ON payments_made.invoice_id = invoices.invoice_id
instead, the query starts running and using an extremely high memory usage (in relation to the input data), and starts spilling to disk.
As an example, an input table with around 100,000 records (less than 10 MB on disk) was using ~60GB RAM (all available RAM in the system) and started spilling to disk using more than 500GB before it was manually interrupted.
Error conditions
This issue only occurs when the aggregation function includes * COALESCE(exchange_rate, 1)
inside the sum function. Changing it to * 1
, or removing it, causes the engine to return the following error (which is expected):
Binder Error: LATERAL join cannot contain aggregates!
Also, changing the order of the join causes the engine to identify the syntax error (expected behavior):
FROM payments_made
RIGHT JOIN invoices
ON payments_made.invoice_id = invoices.invoice_id
To Reproduce
Reproducible example
The following create a reproducible example with synthetic data.
**Be careful when running it as it can crash the system or fill up the entire disk. **
For my tests, I kept the Activity Monitor window up to see the amount of RAM being used and interrupted it with Ctrl+C
when it started using all the available RAM.
Setting up the input data
-- Create test tables
CREATE TABLE invoices (
invoice_id VARCHAR,
amount DECIMAL(10,2)
);
CREATE TABLE payments (
invoice_id VARCHAR,
exchange_rate DECIMAL(10,2),
paid_amount DECIMAL(10,2) -- This is our amt_impuesto_trasladado_realizado
);
-- Generate invoices with sequential IDs
INSERT INTO invoices
SELECT
'INV' || LPAD(i::VARCHAR, 6, '0'),
RANDOM() * 10000
FROM RANGE(1, 100001) t(i);
-- Generate 3 payment relations per invoice on average (approx. 3,000 rows)
-- This will make the cartesian product somewhat expensive
INSERT INTO payments
SELECT
'INV' || LPAD((i/3 + 1)::VARCHAR, 6, '0'), -- Maps roughly 3 payments to each invoice
1 + (RANDOM() * 0.2), -- Exchange rate between 1.0 and 1.2
(RANDOM() * 500) -- Random paid amount
FROM RANGE(1, 30001) t(i);
Problematic query
Here is a sample query plan of how the query is interpreted:
https://db.cs.uni-tuebingen.de/explain/plan/3c787165b895a1d4
-- this runs but uses a lot of memory and spills to disk.
WITH
payments_made AS (
SELECT
invoice_id,
SUM(amount -- this field does not exist in `payments`, but exists in `invoices`
* COALESCE(exchange_rate, 1)) AS total_paid_amount
FROM payments
GROUP BY invoice_id
),
final_invoices AS (
SELECT
invoices.invoice_id
FROM invoices
LEFT JOIN payments_made
ON payments_made.invoice_id = invoices.invoice_id
)
SELECT * FROM final_invoices;
This throws an expected error
The only change in this query vs the previous one is that * COALESCE(exchange_rate, 1)
is not used
--this throws an error
-- Binder Error: LATERAL join cannot contain aggregates!
WITH
payments_made AS (
SELECT
invoice_id,
SUM(amount) AS total_paid_amount -- amount doesn't exist in payments
FROM payments
GROUP BY invoice_id
),
final_invoices AS (
SELECT
invoices.invoice_id
FROM invoices
LEFT JOIN payments_made
ON payments_made.invoice_id = invoices.invoice_id
)
SELECT * FROM final_invoices;
This also throws an error
-- Changing the order of the join triggers the expected error
-- Binder Error: Referenced column "amount" not found in FROM clause!
WITH
payments_made AS (
SELECT
invoice_id,
SUM(amount -- this field does not exist in `payments`, but exists in `invoices`
* COALESCE(exchange_rate, 1)) AS total_paid_amount
FROM payments
GROUP BY invoice_id
),
final_invoices AS (
SELECT
invoices.invoice_id
FROM payments_made
RIGHT JOIN invoices
ON payments_made.invoice_id = invoices.invoice_id
)
SELECT * FROM final_invoices;
OS:
MacOS aarch64
DuckDB Version:
v1.1.0
DuckDB Client:
CLI
Hardware:
No response
Full Name:
Miguel Garcia
Affiliation:
AfterSync
What is the latest build you tested with? If possible, we recommend testing with the latest nightly build.
I have not tested with any build
Did you include all relevant data sets for reproducing the issue?
Yes
Did you include all code required to reproduce the issue?
- Yes, I have
Did you include all relevant configuration (e.g., CPU architecture, Python version, Linux distribution) to reproduce the issue?
- Yes, I have