-
Notifications
You must be signed in to change notification settings - Fork 2.6k
Description
What happens?
When performing a COPY
using a PARTITION_BY(<timestamp-column>)
we end up losing the time information of <timestamp-column>
.
After some preliminary testing I found out that the issue is with 0.9.x and 0.8.1 has what I consider to be the correct behavior based on the test case below.
To Reproduce
Create a script duck-issue.sh
to generate a fake dataset of page views per minute that is rolled up into hourly buckets:
# Ensure a clean state
output="/tmp/duckdb-issue/$(${DUCK_CLI:-duckdb} --version | awk '{print $1}')"
rm -rf "${output}"
mkdir -p "${output}"
# Generate the data and do the partitioned export
cat <<-SQL | "${DUCK_CLI:-duckdb}"
CREATE TABLE raw_data (
ts TIMESTAMP_S NOT NULL,
hits INTEGER NOT NULL
);
INSERT INTO raw_data
SELECT *, (random() * 500)::INTEGER
FROM RANGE(TIMESTAMP '2023-11-01', TIMESTAMP '2023-11-06', INTERVAL 1 MINUTE);
CREATE TABLE timeseries AS (
SELECT DATE_TRUNC('hour', ts) AS bucket, SUM(hits)::BIGINT AS total
FROM raw_data
GROUP BY bucket
);
COPY (
SELECT * FROM timeseries
) TO '${output}' (
FORMAT 'PARQUET', COMPRESSION 'SNAPPY', PARTITION_BY (bucket), OVERWRITE_OR_IGNORE
);
SQL
Run the script with each version:
DUCK_CLI=duckdb-0.8.1 ./duck-issue.sh
DUCK_CLI=duckdb-0.9.1 ./duck-issue.sh
DUCK_CLI=duckdb-latest ./duck-issue.sh
Then query partitioned parquets, notice that the time is present for the bucket column on 0.8.1:
$ duckdb -table -c "SELECT * FROM read_parquet('/tmp/duckdb-issue/v0.8.1/*/*.parquet') LIMIT 5"
+---------------------+-------+
| bucket | total |
+---------------------+-------+
| 2023-11-02 02:00:00 | 16181 |
| 2023-11-05 10:00:00 | 13685 |
| 2023-11-01 15:00:00 | 13948 |
| 2023-11-04 05:00:00 | 16777 |
| 2023-11-04 08:00:00 | 15608 |
+---------------------+-------+
But not for 0.9.1:
$ duckdb -table -c "SELECT * FROM read_parquet('/tmp/duckdb-issue/v0.9.1/*/*.parquet') LIMIT 5"
+------------+-------+
| bucket | total |
+------------+-------+
| 2023-11-02 | 12711 |
| 2023-11-05 | 14800 |
| 2023-11-01 | 12593 |
| 2023-11-04 | 15917 |
| 2023-11-04 | 15025 |
+------------+-------+
There is a workaround to make things work, but definitely not optimal and would require downstream consumers to know about it:
$ duckdb -table -c "SELECT * FROM read_parquet('/tmp/duckdb-issue/v0.9.1/*/*.parquet', hive_partitioning=1, hive_types={'bucket': timestamp}) LIMIT 5"
+---------------------+-------+
| bucket | total |
+---------------------+-------+
| 2023-11-02 02:00:00 | 12711 |
| 2023-11-05 10:00:00 | 14800 |
| 2023-11-01 15:00:00 | 12593 |
| 2023-11-04 05:00:00 | 15917 |
| 2023-11-04 08:00:00 | 15025 |
+---------------------+-------+
Looks like the problem is still there on 0.9.2-dev242:
$ duckdb -table -c "SELECT * FROM read_parquet('/tmp/duckdb-issue/v0.9.2-dev242/*/*.parquet') LIMIT 5"
+------------+-------+
| bucket | total |
+------------+-------+
| 2023-11-02 | 16394 |
| 2023-11-05 | 14492 |
| 2023-11-01 | 14577 |
| 2023-11-04 | 14179 |
| 2023-11-04 | 13432 |
+------------+-------+
OS:
x64
DuckDB Version:
0.9.1
DuckDB Client:
CLI
Full Name:
Fabio Rehm
Affiliation:
Koala
Have you tried this on the latest main
branch?
I have tested with a main build
Have you tried the steps to reproduce? Do they include all relevant data and configuration? Does the issue you report still appear there?
- Yes, I have