-
Notifications
You must be signed in to change notification settings - Fork 2.6k
Description
What happens?
When I use DuckDB, it seems to produce incorrect query results. I'm not exactly sure the root cause, but I tried to reproduce the error with much a smaller test case.
To Reproduce
I put the data obj_clevrer_short.csv
on Google Drive: https://drive.google.com/file/d/1QkwvNnEEPT_MyRSrW4Nm-WS5luUl2V0P/view?usp=sharing. Note that commenting out the line conn.execute("CREATE INDEX IF NOT EXISTS idx_obj_clevrer ON Obj_clevrer (vid);")
will resolve the error.
import duckdb
# Create database, and load data
conn = duckdb.connect(database='test.duckdb', read_only=False)
conn.execute("CREATE TABLE Obj_clevrer (oid INT, vid INT, fid INT, shape varchar, color varchar, material varchar, x1 float, y1 float, x2 float, y2 float)")
conn.execute("COPY Obj_clevrer FROM 'obj_clevrer_short.csv' (FORMAT 'csv', DELIMITER ',', HEADER 0);")
# Creating index causes incorrect results
conn.execute("CREATE INDEX IF NOT EXISTS idx_obj_clevrer ON Obj_clevrer (vid);")
# Run some queries
query_results = []
for vid in range(500):
df1 = conn.execute(f"""
SELECT oid as o0_oid, fid
FROM Obj_clevrer
WHERE color = 'red' AND vid = {vid}
""").df()
df2 = conn.execute(f"""
SELECT oid as o0_oid, fid
FROM Obj_clevrer
WHERE material = 'rubber' AND vid = {vid}
""").df()
graph_df1 = conn.execute(f"SELECT * FROM df1 natural join df2").df()
if len(graph_df1):
query_results.append(vid)
# query_results should be equal to ground_truth
ground_truth = [1, 2, 3, 6, 14, 21, 23, 25, 29, 30, 32, 33, 40, 41, 44, 50, 51, 54, 63, 71, 74, 75, 78, 79, 83, 84, 90, 91, 94, 95, 97, 98, 100, 105, 107, 109, 111, 114, 115, 117, 125, 126, 128, 132, 138, 139, 144, 146, 149, 151, 154, 156, 163, 165, 169, 171, 177, 180, 183, 185, 186, 189, 191, 193, 197, 201, 202, 205, 206, 207, 209, 214, 217, 221, 222, 223, 235, 241, 244, 248, 249, 250, 252, 254, 257, 261, 266, 268, 270, 271, 276, 287, 291, 292, 299, 308, 314, 315, 317, 326, 342, 344, 346, 347, 349, 350, 351, 355, 364, 366, 367, 368, 369, 372, 380, 384, 392, 394, 399, 400, 402, 406, 407, 410, 412, 414, 416, 418, 419, 420, 424, 428, 431, 433, 435, 438, 442, 448, 450, 452, 456, 462, 463, 464, 476, 477, 479, 480, 484, 485, 486, 489, 490, 493]
diff1 = sorted(list(set(query_results) - set(ground_truth)))
print("diff1", diff1)
if len(diff1):
# This gives incorrect results. Specifically, I found that 'oid' and 'fid' values of some tuples become 0 when running this query
test_df1 = conn.execute("select * from Obj_clevrer where vid = {} and fid = 0".format(diff1[0])).df()
print(test_df1.to_string())
# This gives correct results
test_df2 = conn.execute("select * from Obj_clevrer").df()
print(test_df2.loc[(test_df2['vid'] == diff1[0]) & (test_df2['fid'] == 0)].to_string())
Output: the output is non-deterministic, but I'm able to obtain a non-empty diff1
every time I run the above code. In this example output, query_results
includes three vids that are not in ground_truth
. Looking at vid 181, test_df1 gives an incorrect result, where the last tuple's oid and fid are incorrectly set to 0. However, test_df2 gives the correct result.
diff1:
[181, 232, 352]
test_df1:
oid vid fid shape color material x1 y1 x2 y2
0 0 181 0 sphere blue rubber 321.0 73.0 355.0 108.0
1 1 181 0 cube brown rubber 114.0 72.0 154.0 116.0
2 2 181 0 cylinder green rubber 138.0 1.0 161.0 18.0
3 3 181 0 sphere red metal 112.0 109.0 152.0 149.0
4 0 181 0 sphere red metal 335.0 77.0 371.0 112.0
test_df2:
oid vid fid shape color material x1 y1 x2 y2
110051 0 181 0 sphere blue rubber 321.0 73.0 355.0 108.0
110052 1 181 0 cube brown rubber 114.0 72.0 154.0 116.0
110053 2 181 0 cylinder green rubber 138.0 1.0 161.0 18.0
110054 3 181 0 sphere red metal 112.0 109.0 152.0 149.0
OS:
macos (and also linux)
DuckDB Version:
0.9.2 and 0.9.3-dev2705
DuckDB Client:
Python
Full Name:
Enhao Zhang
Affiliation:
University of Washington
Have you tried this on the latest main
branch?
I have tested with a main build
Have you tried the steps to reproduce? Do they include all relevant data and configuration? Does the issue you report still appear there?
- Yes, I have