Query with syntax error in CTE causes cartesian product and runs with high memory consumption

### What happens?

## Issue
A query that should return a syntax error is run when written in a particular way, using the using an extremely high memory usage (in relation to the input data), and starts spilling to disk.

## Detail

The following aggregation query is used in a CTE, and the field name is mis-spelled (it doesn't exist in the FROM clause):
```sql
payments_made AS (
  SELECT 
    invoice_id,
    SUM(amount  -- this field does not exist in `payments`, but exists in `invoices`
      * COALESCE(exchange_rate, 1)) AS total_paid_amount
  FROM payments
  GROUP BY invoice_id
),
```

when joining this CTE to another table, which contains the mis-spelled field, no syntax error is thrown:

```sql
FROM invoices
  LEFT JOIN payments_made
    ON payments_made.invoice_id = invoices.invoice_id
```

instead, the query starts running and using an extremely high memory usage (in relation to the input data), and starts spilling to disk.

As an example, an input table with around 100,000 records (less than 10 MB on disk) was using ~60GB RAM (all available RAM in the system) and started spilling to disk using more than 500GB before it was manually interrupted.


## Error conditions

This issue only occurs when the aggregation function includes `* COALESCE(exchange_rate, 1)` inside the sum function. Changing it to `* 1`, or removing it, causes the engine to return the following error (which is expected):
```
Binder Error: LATERAL join cannot contain aggregates!
```

Also, changing the order of the join causes the engine to identify the syntax error (expected behavior):
```
FROM payments_made
  RIGHT JOIN invoices
    ON payments_made.invoice_id = invoices.invoice_id
```


### To Reproduce


## Reproducible example
The following create a reproducible example with synthetic data.

**Be careful when running it as it can crash the system or fill up the entire disk. **

For my tests, I kept the Activity Monitor window up to see the amount of RAM being used and interrupted it with `Ctrl+C` when it started using all the available RAM.

### Setting up the input data
```sql
-- Create test tables
CREATE TABLE invoices (
  invoice_id VARCHAR,
  amount DECIMAL(10,2)
);

CREATE TABLE payments (
  invoice_id VARCHAR,
  exchange_rate DECIMAL(10,2),
  paid_amount DECIMAL(10,2)  -- This is our amt_impuesto_trasladado_realizado
);

-- Generate invoices with sequential IDs
INSERT INTO invoices 
SELECT 
  'INV' || LPAD(i::VARCHAR, 6, '0'),
  RANDOM() * 10000
FROM RANGE(1, 100001) t(i);

-- Generate 3 payment relations per invoice on average (approx. 3,000 rows)
-- This will make the cartesian product somewhat expensive
INSERT INTO payments
SELECT
  'INV' || LPAD((i/3 + 1)::VARCHAR, 6, '0'),  -- Maps roughly 3 payments to each invoice
  1 + (RANDOM() * 0.2),  -- Exchange rate between 1.0 and 1.2
  (RANDOM() * 500)  -- Random paid amount
FROM RANGE(1, 30001) t(i);
```

### Problematic query

Here is a sample query plan of how the query is interpreted:
https://db.cs.uni-tuebingen.de/explain/plan/3c787165b895a1d4

```sql
-- this runs but uses a lot of memory and spills to disk.
WITH 
payments_made AS (
  SELECT 
    invoice_id,
    SUM(amount  -- this field does not exist in `payments`, but exists in `invoices`
      * COALESCE(exchange_rate, 1)) AS total_paid_amount
  FROM payments
  GROUP BY invoice_id
),
final_invoices AS (
  SELECT
    invoices.invoice_id
  FROM invoices
  LEFT JOIN payments_made
    ON payments_made.invoice_id = invoices.invoice_id
)
SELECT * FROM final_invoices;
```


### This throws an expected error
The only change in this query vs the previous one is that `* COALESCE(exchange_rate, 1)` is not used
```sql
--this throws an error
-- Binder Error: LATERAL join cannot contain aggregates!
WITH 
payments_made AS (
  SELECT 
    invoice_id,
    SUM(amount) AS total_paid_amount  -- amount doesn't exist in payments
  FROM payments
  GROUP BY invoice_id
),
final_invoices AS (
  SELECT
    invoices.invoice_id
  FROM invoices
  LEFT JOIN payments_made
    ON payments_made.invoice_id = invoices.invoice_id
)
SELECT * FROM final_invoices;

```
### This also throws an error

```sql

-- Changing the order of the join triggers the expected error
-- Binder Error: Referenced column "amount" not found in FROM clause!
WITH 
payments_made AS (
  SELECT 
    invoice_id,
    SUM(amount  -- this field does not exist in `payments`, but exists in `invoices`
      * COALESCE(exchange_rate, 1)) AS total_paid_amount
  FROM payments
  GROUP BY invoice_id
),
final_invoices AS (
  SELECT
    invoices.invoice_id
  FROM payments_made
  RIGHT JOIN invoices
    ON payments_made.invoice_id = invoices.invoice_id
)
SELECT * FROM final_invoices;
```

### OS:

MacOS aarch64

### DuckDB Version:

v1.1.0

### DuckDB Client:

CLI

### Hardware:

_No response_

### Full Name:

Miguel Garcia

### Affiliation:

AfterSync

### What is the latest build you tested with? If possible, we recommend testing with the latest nightly build.

I have not tested with any build

### Did you include all relevant data sets for reproducing the issue?

Yes

### Did you include all code required to reproduce the issue?

- [x] Yes, I have

### Did you include all relevant configuration (e.g., CPU architecture, Python version, Linux distribution) to reproduce the issue?

- [x] Yes, I have

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Query with syntax error in CTE causes cartesian product and runs with high memory consumption #17456

What happens?

Issue

Detail

Error conditions

To Reproduce

Reproducible example

Setting up the input data

Problematic query

This throws an expected error

This also throws an error

OS:

DuckDB Version:

DuckDB Client:

Hardware:

Full Name:

Affiliation:

What is the latest build you tested with? If possible, we recommend testing with the latest nightly build.

Did you include all relevant data sets for reproducing the issue?

Did you include all code required to reproduce the issue?

Did you include all relevant configuration (e.g., CPU architecture, Python version, Linux distribution) to reproduce the issue?

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Query with syntax error in CTE causes cartesian product and runs with high memory consumption #17456

Description

What happens?

Issue

Detail

Error conditions

To Reproduce

Reproducible example

Setting up the input data

Problematic query

This throws an expected error

This also throws an error

OS:

DuckDB Version:

DuckDB Client:

Hardware:

Full Name:

Affiliation:

What is the latest build you tested with? If possible, we recommend testing with the latest nightly build.

Did you include all relevant data sets for reproducing the issue?

Did you include all code required to reproduce the issue?

Did you include all relevant configuration (e.g., CPU architecture, Python version, Linux distribution) to reproduce the issue?

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions