Skip to content

Seeing a single value changed after inserting ~1.5million row polars df to duckdb and then querying back out #14020

@MaxPowerWasTaken

Description

@MaxPowerWasTaken

What happens?

When I write these 9 polars dataframes (saved as parquet files here: https://drive.google.com/drive/folders/18-xFUBLvPyux-erwETaATOV_fUpkLp4B?usp=sharing) to a duckdb table, I find that one single value from one single column is changed when I read the data back out.

To Reproduce

First, download these 9 parquet files to your local machine: https://drive.google.com/drive/folders/18-xFUBLvPyux-erwETaATOV_fUpkLp4B?usp=sharing

Then, run the following code:

import duckdb
import polars as pl
import os
from datetime import date
from zipfile import ZipFile

### Constants
FILESDIR = "path_you_saved_downloaded_parquet_files_to" 
DAYS = range(1,10)
DUCKDB_FILE = os.path.join("path_you_saved_downloaded_parquet_files_to", "so_test.ddb")

### Read data from disk and write to duckdb
for day in DAYS:
    pldf = pl.read_parquet(os.path.join(FILESDIR, f"f{day}_masked.parquet"))
    with duckdb.connect(database=DUCKDB_FILE, read_only=False) as conn:
        if day == 1:
            conn.execute(f"CREATE OR REPLACE TABLE so_test AS SELECT * FROM pldf")
        else:
            conn.execute(f"INSERT INTO so_test SELECT * FROM pldf")        

### Verify that the data has changed by comparing a specific row from last df to its record in duckdb
PROBLEM_ROW = 1108956

print(pldf.slice(PROBLEM_ROW,1))  # hour=4.0
with duckdb.connect(database=DUCKDB_FILE, read_only=False) as conn:
    print(conn.execute(f"SELECT * from so_test WHERE (DAILY_FILE_ROW_ORDER = {PROBLEM_ROW}) and (DAILY_FILE_DATE = '2021-01-09')").pl())  # hour=3.0 now

Below is a screenshot of what I see when I run the code above. Note the rows are identical except hour has been changed from 4.0 to 3.0 (Also, I see hour for that row has been changed to 3.0 even if I read the result as a pandas .df()).
screenshot for duckdb issue

OS:

Windows Server 2019 Datacenter, Version 10.0.17763 Build 17763 x64

DuckDB Version:

1.1.0

DuckDB Client:

python (python version 3.9.6)

Hardware:

No response

Full Name:

Max Epstein

Affiliation:

Max Power Consulting, LLC

What is the latest build you tested with? If possible, we recommend testing with the latest nightly build.

I have tested with a stable release

Did you include all relevant data sets for reproducing the issue?

Yes

Did you include all code required to reproduce the issue?

  • Yes, I have

Did you include all relevant configuration (e.g., CPU architecture, Python version, Linux distribution) to reproduce the issue?

  • Yes, I have

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions