Skip to content

Potential regression from 0.8 to 0.9 around partitioned writes by timestamp columns #9589

@fgrehm

Description

@fgrehm

What happens?

When performing a COPY using a PARTITION_BY(<timestamp-column>) we end up losing the time information of <timestamp-column>.

After some preliminary testing I found out that the issue is with 0.9.x and 0.8.1 has what I consider to be the correct behavior based on the test case below.

To Reproduce

Create a script duck-issue.sh to generate a fake dataset of page views per minute that is rolled up into hourly buckets:

# Ensure a clean state
output="/tmp/duckdb-issue/$(${DUCK_CLI:-duckdb} --version | awk '{print $1}')"
rm -rf "${output}"
mkdir -p "${output}"

# Generate the data and do the partitioned export
cat <<-SQL | "${DUCK_CLI:-duckdb}"
CREATE TABLE raw_data (
  ts TIMESTAMP_S NOT NULL,
  hits INTEGER NOT NULL
);
INSERT INTO raw_data
SELECT *, (random() * 500)::INTEGER
FROM RANGE(TIMESTAMP '2023-11-01', TIMESTAMP '2023-11-06', INTERVAL 1 MINUTE);
CREATE TABLE timeseries AS (
  SELECT DATE_TRUNC('hour', ts) AS bucket, SUM(hits)::BIGINT AS total
  FROM raw_data
  GROUP BY bucket
);
COPY (
  SELECT * FROM timeseries
) TO '${output}' (
  FORMAT 'PARQUET', COMPRESSION 'SNAPPY', PARTITION_BY (bucket), OVERWRITE_OR_IGNORE
);
SQL

Run the script with each version:

DUCK_CLI=duckdb-0.8.1 ./duck-issue.sh
DUCK_CLI=duckdb-0.9.1 ./duck-issue.sh
DUCK_CLI=duckdb-latest ./duck-issue.sh

Then query partitioned parquets, notice that the time is present for the bucket column on 0.8.1:

$ duckdb -table -c "SELECT * FROM read_parquet('/tmp/duckdb-issue/v0.8.1/*/*.parquet') LIMIT 5"

+---------------------+-------+
|       bucket        | total |
+---------------------+-------+
| 2023-11-02 02:00:00 | 16181 |
| 2023-11-05 10:00:00 | 13685 |
| 2023-11-01 15:00:00 | 13948 |
| 2023-11-04 05:00:00 | 16777 |
| 2023-11-04 08:00:00 | 15608 |
+---------------------+-------+

But not for 0.9.1:

$ duckdb -table -c "SELECT * FROM read_parquet('/tmp/duckdb-issue/v0.9.1/*/*.parquet') LIMIT 5"

+------------+-------+
|   bucket   | total |
+------------+-------+
| 2023-11-02 | 12711 |
| 2023-11-05 | 14800 |
| 2023-11-01 | 12593 |
| 2023-11-04 | 15917 |
| 2023-11-04 | 15025 |
+------------+-------+

There is a workaround to make things work, but definitely not optimal and would require downstream consumers to know about it:

$ duckdb -table -c "SELECT * FROM read_parquet('/tmp/duckdb-issue/v0.9.1/*/*.parquet', hive_partitioning=1, hive_types={'bucket': timestamp}) LIMIT 5"
+---------------------+-------+
|       bucket        | total |
+---------------------+-------+
| 2023-11-02 02:00:00 | 12711 |
| 2023-11-05 10:00:00 | 14800 |
| 2023-11-01 15:00:00 | 12593 |
| 2023-11-04 05:00:00 | 15917 |
| 2023-11-04 08:00:00 | 15025 |
+---------------------+-------+

Looks like the problem is still there on 0.9.2-dev242:

$ duckdb -table -c "SELECT * FROM read_parquet('/tmp/duckdb-issue/v0.9.2-dev242/*/*.parquet') LIMIT 5"

+------------+-------+
|   bucket   | total |
+------------+-------+
| 2023-11-02 | 16394 |
| 2023-11-05 | 14492 |
| 2023-11-01 | 14577 |
| 2023-11-04 | 14179 |
| 2023-11-04 | 13432 |
+------------+-------+

OS:

x64

DuckDB Version:

0.9.1

DuckDB Client:

CLI

Full Name:

Fabio Rehm

Affiliation:

Koala

Have you tried this on the latest main branch?

I have tested with a main build

Have you tried the steps to reproduce? Do they include all relevant data and configuration? Does the issue you report still appear there?

  • Yes, I have

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions