-
Notifications
You must be signed in to change notification settings - Fork 2.5k
Description
What happens?
The aggregation filter yields different results between MacOS and Linux. On MacOS, the filter works as expected. On Linux, for roughly 10% of the input it doesn't work correctly.
I have this table:
-- create codeshares table
CREATE TABLE lh_operating_codeshares (
operatingAirlineId UUID NOT NULL,
operatingFlightNumber USMALLINT NOT NULL,
operatingSuffix TEXT NOT NULL,
departureAirportId UUID NOT NULL,
departureDateLocal DATE NOT NULL,
codeShares STRUCT(airline_id UUID, number USMALLINT, suffix TEXT)[] NOT NULL,
CHECK ( TO_JSON(codeShares) = TO_JSON(LIST_SORT(LIST_DISTINCT(codeShares))) )
) ;
Where I insert all aggregated codeshare flight numbers (or any empty list, if a flight does not have any):
-- insert codeshares
INSERT INTO lh_operating_codeshares (
operatingAirlineId,
operatingFlightNumber,
operatingSuffix,
departureAirportId,
departureDateLocal,
codeShares
)
SELECT
operatingAirlineId,
operatingFlightNumber,
operatingSuffix,
departureAirportId,
departureDateLocal,
LIST_SORT(
LIST_DISTINCT(
COALESCE(
ARRAY_AGG({
'airline_id': airlineId,
'number': flightNumber,
'suffix': suffix
}) FILTER (
airlineId IS NOT NULL
AND ( airlineId != operatingAirlineId OR flightNumber != operatingFlightNumber OR suffix != operatingSuffix )
),
[]
)
)
) AS codeShares
FROM lh_all_flights_with_ids
GROUP BY
operatingAirlineId,
operatingFlightNumber,
operatingSuffix,
departureAirportId,
departureDateLocal
;
The filter:
FILTER (
airlineId IS NOT NULL
AND ( airlineId != operatingAirlineId OR flightNumber != operatingFlightNumber OR suffix != operatingSuffix )
)
should ensure that the aggregated list does not contain the operating flight number itself. On MacOS, this works as expected. On Linux, about 10% of the rows still contain the operating flight number itself in the aggregated list.
As a workaround, I've added the following update afterwards:
-- harden codeshares, for some reason the filter doesnt work on all platforms(?)
UPDATE lh_operating_codeshares
SET codeShares = LIST_SORT(LIST_DISTINCT(LIST_FILTER(
codeShares,
lambda cs: ( cs.airline_id != operatingAirlineId OR cs.number != operatingFlightNumber OR cs.suffix != operatingSuffix )
)))
WHERE LENGTH(LIST_FILTER(
codeShares,
lambda cs: ( cs.airline_id = operatingAirlineId AND cs.number = operatingFlightNumber AND cs.suffix = operatingSuffix )
)) > 0 ;
Which, as expected, never touches any rows when executed on MacOS, but touches about 10% of the rows when I run the process on Linux.
To Reproduce
Create a docker image using this Dockerfile:
FROM amazonlinux:2023
RUN yum install -y gzip
RUN curl https://install.duckdb.org | sh
ENTRYPOINT ["/root/.duckdb/cli/latest/duckdb"]
Build and run an interactive container using:
docker build . -t "duckdb"
docker run -it --rm duckdb
Then, run the following commands to create some sample data:
CREATE TABLE id_mapping (
id UUID NOT NULL,
child_id UUID NOT NULL
) ;
-- some initial data, mapping one id to 5 others
INSERT INTO id_mapping (id, child_id)
SELECT id, UNNEST(child_ids)
FROM (
SELECT
UUID() AS id,
[UUID(), UUID(), UUID(), UUID(), UUID()] AS child_ids
FROM generate_series(10000)
) ;
-- insert mappings from id to itself
INSERT INTO id_mapping (id, child_id)
SELECT id, id FROM id_mapping ;
-- each id should now have exactly 6 child_ids (unless the same uuid was generated twice somewhere)
SELECT MIN(c), MAX(c)
FROM (
SELECT COUNT(DISTINCT child_id) AS c
FROM id_mapping
GROUP BY id
) ;
-- create the aggregated table
CREATE TABLE example AS
SELECT
id,
LIST_DISTINCT(
COALESCE(
ARRAY_AGG(child_id) FILTER ( child_id != id ),
[]
)
) AS child_ids
FROM id_mapping
GROUP BY id ;
With the above aggregation filter, the aggregated table should have no element in the child_ids
list that is the id itself. This works as expected on MacOS.
Now run:
-- this query should return 0 rows, because:
-- (1) the aggregated list should never contain the id itself
-- (2) all values in the list should be unique
-- (3) there should be exactly 5 child_ids for every id
SELECT *
FROM example
WHERE LENGTH(LIST_FILTER(child_ids, lambda child_id: id = child_id)) > 0
OR LENGTH(child_ids) != LENGTH(LIST_DISTINCT(child_ids))
OR LENGTH(child_ids) != 5 ;
You'll see some results. You'd expect none, because:
- (1) the aggregated list should never contain the id itself
- (2) all values in the list should be unique
- (3) there should be exactly 5 child_ids for every id
OS:
Linux aarch64 (affected), MacOS aarch64 (working)
DuckDB Version:
v1.3.0
DuckDB Client:
go-duckdb
Hardware:
AWS Fargate ARM64, 16 GB Memory, 8 vCPUs (affected), M1 Pro (working)
Full Name:
Felix Wollschläger
Affiliation:
N/A
What is the latest build you tested with? If possible, we recommend testing with the latest nightly build.
I have tested with a stable release
Did you include all relevant data sets for reproducing the issue?
Yes
Did you include all code required to reproduce the issue?
- Yes, I have
Did you include all relevant configuration (e.g., CPU architecture, Python version, Linux distribution) to reproduce the issue?
- Yes, I have