UUID Comparison in Aggregation Filter broken on Linux (v1.3.0 Regression)

### What happens?

The aggregation filter yields different results between MacOS and Linux. On MacOS, the filter works as expected. On Linux, for roughly 10% of the input it doesn't work correctly.

I have this table:
```sql
-- create codeshares table
CREATE TABLE lh_operating_codeshares (
    operatingAirlineId UUID NOT NULL,
    operatingFlightNumber USMALLINT NOT NULL,
    operatingSuffix TEXT NOT NULL,
    departureAirportId UUID NOT NULL,
    departureDateLocal DATE NOT NULL,
    codeShares STRUCT(airline_id UUID, number USMALLINT, suffix TEXT)[] NOT NULL,
    CHECK ( TO_JSON(codeShares) = TO_JSON(LIST_SORT(LIST_DISTINCT(codeShares))) )
) ;
```

Where I insert all aggregated codeshare flight numbers (or any empty list, if a flight does not have any):
```sql
-- insert codeshares
INSERT INTO lh_operating_codeshares (
    operatingAirlineId,
    operatingFlightNumber,
    operatingSuffix,
    departureAirportId,
    departureDateLocal,
    codeShares
)
SELECT
    operatingAirlineId,
    operatingFlightNumber,
    operatingSuffix,
    departureAirportId,
    departureDateLocal,
    LIST_SORT(
        LIST_DISTINCT(
            COALESCE(
                ARRAY_AGG({
                    'airline_id': airlineId,
                    'number': flightNumber,
                    'suffix': suffix
                }) FILTER (
                    airlineId IS NOT NULL
                    AND ( airlineId != operatingAirlineId OR flightNumber != operatingFlightNumber OR suffix != operatingSuffix )
                ),
                []
            )
        )
    ) AS codeShares
FROM lh_all_flights_with_ids
GROUP BY
    operatingAirlineId,
    operatingFlightNumber,
    operatingSuffix,
    departureAirportId,
    departureDateLocal
;
```

The filter:
```sql
FILTER (
                    airlineId IS NOT NULL
                    AND ( airlineId != operatingAirlineId OR flightNumber != operatingFlightNumber OR suffix != operatingSuffix )
                )
```

should ensure that the aggregated list does not contain the operating flight number itself. On MacOS, this works as expected. On Linux, about 10% of the rows still contain the operating flight number itself in the aggregated list.

As a workaround, I've added the following update afterwards:
```sql
-- harden codeshares, for some reason the filter doesnt work on all platforms(?)
UPDATE lh_operating_codeshares
SET codeShares = LIST_SORT(LIST_DISTINCT(LIST_FILTER(
    codeShares,
    lambda cs: ( cs.airline_id != operatingAirlineId OR cs.number != operatingFlightNumber OR cs.suffix != operatingSuffix )
)))
WHERE LENGTH(LIST_FILTER(
    codeShares,
    lambda cs: ( cs.airline_id = operatingAirlineId AND cs.number = operatingFlightNumber AND cs.suffix = operatingSuffix )
)) > 0 ;
```

Which, as expected, never touches any rows when executed on MacOS, but touches about 10% of the rows when I run the process on Linux.

### To Reproduce
Create a docker image using this Dockerfile:
```Dockerfile
FROM amazonlinux:2023
RUN yum install -y gzip
RUN curl https://install.duckdb.org | sh
ENTRYPOINT ["/root/.duckdb/cli/latest/duckdb"]
```

Build and run an interactive container using:
```sh
docker build . -t "duckdb"
docker run -it --rm duckdb
```

Then, run the following commands to create some sample data:
```sql
CREATE TABLE id_mapping (
  id UUID NOT NULL,
  child_id UUID NOT NULL
) ;

-- some initial data, mapping one id to 5 others
INSERT INTO id_mapping (id, child_id)
SELECT id, UNNEST(child_ids)
FROM (
  SELECT
    UUID() AS id,
    [UUID(), UUID(), UUID(), UUID(), UUID()] AS child_ids
  FROM generate_series(10000)
) ;

-- insert mappings from id to itself
INSERT INTO id_mapping (id, child_id)
SELECT id, id FROM id_mapping ;

-- each id should now have exactly 6 child_ids (unless the same uuid was generated twice somewhere)
SELECT MIN(c), MAX(c)
FROM (
  SELECT COUNT(DISTINCT child_id) AS c
  FROM id_mapping
  GROUP BY id
) ;

-- create the aggregated table
CREATE TABLE example AS
SELECT
  id,
  LIST_DISTINCT(
    COALESCE(
      ARRAY_AGG(child_id) FILTER ( child_id != id ),
      []
    )
  ) AS child_ids
FROM id_mapping
GROUP BY id ;
```

With the above aggregation filter, the aggregated table should have no element in the `child_ids` list that is the id itself. This works as expected on MacOS.

Now run:
```sql
-- this query should return 0 rows, because:
-- (1) the aggregated list should never contain the id itself
-- (2) all values in the list should be unique
-- (3) there should be exactly 5 child_ids for every id
SELECT *
FROM example
WHERE LENGTH(LIST_FILTER(child_ids, lambda child_id: id = child_id)) > 0
OR LENGTH(child_ids) != LENGTH(LIST_DISTINCT(child_ids))
OR LENGTH(child_ids) != 5 ;
```

You'll see some results. You'd expect none, because:
- (1) the aggregated list should never contain the id itself
- (2) all values in the list should be unique
- (3) there should be exactly 5 child_ids for every id

### OS:

Linux aarch64 (affected), MacOS aarch64 (working)

### DuckDB Version:

v1.3.0

### DuckDB Client:

go-duckdb

### Hardware:

AWS Fargate ARM64, 16 GB Memory, 8 vCPUs (affected), M1 Pro (working)

### Full Name:

Felix Wollschläger

### Affiliation:

N/A

### What is the latest build you tested with? If possible, we recommend testing with the latest nightly build.

I have tested with a stable release

### Did you include all relevant data sets for reproducing the issue?

Yes

### Did you include all code required to reproduce the issue?

- [x] Yes, I have

### Did you include all relevant configuration (e.g., CPU architecture, Python version, Linux distribution) to reproduce the issue?

- [x] Yes, I have

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

UUID Comparison in Aggregation Filter broken on Linux (v1.3.0 Regression) #17757

What happens?

To Reproduce

OS:

DuckDB Version:

DuckDB Client:

Hardware:

Full Name:

Affiliation:

What is the latest build you tested with? If possible, we recommend testing with the latest nightly build.

Did you include all relevant data sets for reproducing the issue?

Did you include all code required to reproduce the issue?

Did you include all relevant configuration (e.g., CPU architecture, Python version, Linux distribution) to reproduce the issue?

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

UUID Comparison in Aggregation Filter broken on Linux (v1.3.0 Regression) #17757

Description

What happens?

To Reproduce

OS:

DuckDB Version:

DuckDB Client:

Hardware:

Full Name:

Affiliation:

What is the latest build you tested with? If possible, we recommend testing with the latest nightly build.

Did you include all relevant data sets for reproducing the issue?

Did you include all code required to reproduce the issue?

Did you include all relevant configuration (e.g., CPU architecture, Python version, Linux distribution) to reproduce the issue?

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions