-
Notifications
You must be signed in to change notification settings - Fork 2.6k
Description
What happens?
When I write these 9 polars dataframes (saved as parquet files here: https://drive.google.com/drive/folders/18-xFUBLvPyux-erwETaATOV_fUpkLp4B?usp=sharing) to a duckdb table, I find that one single value from one single column is changed when I read the data back out.
To Reproduce
First, download these 9 parquet files to your local machine: https://drive.google.com/drive/folders/18-xFUBLvPyux-erwETaATOV_fUpkLp4B?usp=sharing
Then, run the following code:
import duckdb
import polars as pl
import os
from datetime import date
from zipfile import ZipFile
### Constants
FILESDIR = "path_you_saved_downloaded_parquet_files_to"
DAYS = range(1,10)
DUCKDB_FILE = os.path.join("path_you_saved_downloaded_parquet_files_to", "so_test.ddb")
### Read data from disk and write to duckdb
for day in DAYS:
pldf = pl.read_parquet(os.path.join(FILESDIR, f"f{day}_masked.parquet"))
with duckdb.connect(database=DUCKDB_FILE, read_only=False) as conn:
if day == 1:
conn.execute(f"CREATE OR REPLACE TABLE so_test AS SELECT * FROM pldf")
else:
conn.execute(f"INSERT INTO so_test SELECT * FROM pldf")
### Verify that the data has changed by comparing a specific row from last df to its record in duckdb
PROBLEM_ROW = 1108956
print(pldf.slice(PROBLEM_ROW,1)) # hour=4.0
with duckdb.connect(database=DUCKDB_FILE, read_only=False) as conn:
print(conn.execute(f"SELECT * from so_test WHERE (DAILY_FILE_ROW_ORDER = {PROBLEM_ROW}) and (DAILY_FILE_DATE = '2021-01-09')").pl()) # hour=3.0 now
Below is a screenshot of what I see when I run the code above. Note the rows are identical except hour
has been changed from 4.0
to 3.0
(Also, I see hour
for that row has been changed to 3.0
even if I read the result as a pandas .df()
).
OS:
Windows Server 2019 Datacenter, Version 10.0.17763 Build 17763 x64
DuckDB Version:
1.1.0
DuckDB Client:
python (python version 3.9.6)
Hardware:
No response
Full Name:
Max Epstein
Affiliation:
Max Power Consulting, LLC
What is the latest build you tested with? If possible, we recommend testing with the latest nightly build.
I have tested with a stable release
Did you include all relevant data sets for reproducing the issue?
Yes
Did you include all code required to reproduce the issue?
- Yes, I have
Did you include all relevant configuration (e.g., CPU architecture, Python version, Linux distribution) to reproduce the issue?
- Yes, I have