Skip to content

UUID Comparison in Aggregation Filter broken on Linux (v1.3.0 Regression) #17757

@its-felix

Description

@its-felix

What happens?

The aggregation filter yields different results between MacOS and Linux. On MacOS, the filter works as expected. On Linux, for roughly 10% of the input it doesn't work correctly.

I have this table:

-- create codeshares table
CREATE TABLE lh_operating_codeshares (
    operatingAirlineId UUID NOT NULL,
    operatingFlightNumber USMALLINT NOT NULL,
    operatingSuffix TEXT NOT NULL,
    departureAirportId UUID NOT NULL,
    departureDateLocal DATE NOT NULL,
    codeShares STRUCT(airline_id UUID, number USMALLINT, suffix TEXT)[] NOT NULL,
    CHECK ( TO_JSON(codeShares) = TO_JSON(LIST_SORT(LIST_DISTINCT(codeShares))) )
) ;

Where I insert all aggregated codeshare flight numbers (or any empty list, if a flight does not have any):

-- insert codeshares
INSERT INTO lh_operating_codeshares (
    operatingAirlineId,
    operatingFlightNumber,
    operatingSuffix,
    departureAirportId,
    departureDateLocal,
    codeShares
)
SELECT
    operatingAirlineId,
    operatingFlightNumber,
    operatingSuffix,
    departureAirportId,
    departureDateLocal,
    LIST_SORT(
        LIST_DISTINCT(
            COALESCE(
                ARRAY_AGG({
                    'airline_id': airlineId,
                    'number': flightNumber,
                    'suffix': suffix
                }) FILTER (
                    airlineId IS NOT NULL
                    AND ( airlineId != operatingAirlineId OR flightNumber != operatingFlightNumber OR suffix != operatingSuffix )
                ),
                []
            )
        )
    ) AS codeShares
FROM lh_all_flights_with_ids
GROUP BY
    operatingAirlineId,
    operatingFlightNumber,
    operatingSuffix,
    departureAirportId,
    departureDateLocal
;

The filter:

FILTER (
                    airlineId IS NOT NULL
                    AND ( airlineId != operatingAirlineId OR flightNumber != operatingFlightNumber OR suffix != operatingSuffix )
                )

should ensure that the aggregated list does not contain the operating flight number itself. On MacOS, this works as expected. On Linux, about 10% of the rows still contain the operating flight number itself in the aggregated list.

As a workaround, I've added the following update afterwards:

-- harden codeshares, for some reason the filter doesnt work on all platforms(?)
UPDATE lh_operating_codeshares
SET codeShares = LIST_SORT(LIST_DISTINCT(LIST_FILTER(
    codeShares,
    lambda cs: ( cs.airline_id != operatingAirlineId OR cs.number != operatingFlightNumber OR cs.suffix != operatingSuffix )
)))
WHERE LENGTH(LIST_FILTER(
    codeShares,
    lambda cs: ( cs.airline_id = operatingAirlineId AND cs.number = operatingFlightNumber AND cs.suffix = operatingSuffix )
)) > 0 ;

Which, as expected, never touches any rows when executed on MacOS, but touches about 10% of the rows when I run the process on Linux.

To Reproduce

Create a docker image using this Dockerfile:

FROM amazonlinux:2023
RUN yum install -y gzip
RUN curl https://install.duckdb.org | sh
ENTRYPOINT ["/root/.duckdb/cli/latest/duckdb"]

Build and run an interactive container using:

docker build . -t "duckdb"
docker run -it --rm duckdb

Then, run the following commands to create some sample data:

CREATE TABLE id_mapping (
  id UUID NOT NULL,
  child_id UUID NOT NULL
) ;

-- some initial data, mapping one id to 5 others
INSERT INTO id_mapping (id, child_id)
SELECT id, UNNEST(child_ids)
FROM (
  SELECT
    UUID() AS id,
    [UUID(), UUID(), UUID(), UUID(), UUID()] AS child_ids
  FROM generate_series(10000)
) ;

-- insert mappings from id to itself
INSERT INTO id_mapping (id, child_id)
SELECT id, id FROM id_mapping ;

-- each id should now have exactly 6 child_ids (unless the same uuid was generated twice somewhere)
SELECT MIN(c), MAX(c)
FROM (
  SELECT COUNT(DISTINCT child_id) AS c
  FROM id_mapping
  GROUP BY id
) ;

-- create the aggregated table
CREATE TABLE example AS
SELECT
  id,
  LIST_DISTINCT(
    COALESCE(
      ARRAY_AGG(child_id) FILTER ( child_id != id ),
      []
    )
  ) AS child_ids
FROM id_mapping
GROUP BY id ;

With the above aggregation filter, the aggregated table should have no element in the child_ids list that is the id itself. This works as expected on MacOS.

Now run:

-- this query should return 0 rows, because:
-- (1) the aggregated list should never contain the id itself
-- (2) all values in the list should be unique
-- (3) there should be exactly 5 child_ids for every id
SELECT *
FROM example
WHERE LENGTH(LIST_FILTER(child_ids, lambda child_id: id = child_id)) > 0
OR LENGTH(child_ids) != LENGTH(LIST_DISTINCT(child_ids))
OR LENGTH(child_ids) != 5 ;

You'll see some results. You'd expect none, because:

  • (1) the aggregated list should never contain the id itself
  • (2) all values in the list should be unique
  • (3) there should be exactly 5 child_ids for every id

OS:

Linux aarch64 (affected), MacOS aarch64 (working)

DuckDB Version:

v1.3.0

DuckDB Client:

go-duckdb

Hardware:

AWS Fargate ARM64, 16 GB Memory, 8 vCPUs (affected), M1 Pro (working)

Full Name:

Felix Wollschläger

Affiliation:

N/A

What is the latest build you tested with? If possible, we recommend testing with the latest nightly build.

I have tested with a stable release

Did you include all relevant data sets for reproducing the issue?

Yes

Did you include all code required to reproduce the issue?

  • Yes, I have

Did you include all relevant configuration (e.g., CPU architecture, Python version, Linux distribution) to reproduce the issue?

  • Yes, I have

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions