Skip to content

Creating index seems to lead to incorrect query results #10251

@ZHANG-EH

Description

@ZHANG-EH

What happens?

When I use DuckDB, it seems to produce incorrect query results. I'm not exactly sure the root cause, but I tried to reproduce the error with much a smaller test case.

To Reproduce

I put the data obj_clevrer_short.csv on Google Drive: https://drive.google.com/file/d/1QkwvNnEEPT_MyRSrW4Nm-WS5luUl2V0P/view?usp=sharing. Note that commenting out the line conn.execute("CREATE INDEX IF NOT EXISTS idx_obj_clevrer ON Obj_clevrer (vid);") will resolve the error.

import duckdb

# Create database, and load data
conn = duckdb.connect(database='test.duckdb', read_only=False)
conn.execute("CREATE TABLE Obj_clevrer (oid INT, vid INT, fid INT, shape varchar, color varchar, material varchar, x1 float, y1 float, x2 float, y2 float)")
conn.execute("COPY Obj_clevrer FROM 'obj_clevrer_short.csv' (FORMAT 'csv', DELIMITER ',', HEADER 0);")

# Creating index causes incorrect results
conn.execute("CREATE INDEX IF NOT EXISTS idx_obj_clevrer ON Obj_clevrer (vid);")

# Run some queries
query_results = []
for vid in range(500):
    df1 = conn.execute(f"""
        SELECT oid as o0_oid, fid
        FROM Obj_clevrer
        WHERE color = 'red' AND vid = {vid}
    """).df()
    df2 = conn.execute(f"""
        SELECT oid as o0_oid, fid
        FROM Obj_clevrer
        WHERE material = 'rubber' AND vid = {vid}
    """).df()
    graph_df1 = conn.execute(f"SELECT * FROM df1 natural join df2").df()
    if len(graph_df1):
        query_results.append(vid)

# query_results should be equal to ground_truth
ground_truth = [1, 2, 3, 6, 14, 21, 23, 25, 29, 30, 32, 33, 40, 41, 44, 50, 51, 54, 63, 71, 74, 75, 78, 79, 83, 84, 90, 91, 94, 95, 97, 98, 100, 105, 107, 109, 111, 114, 115, 117, 125, 126, 128, 132, 138, 139, 144, 146, 149, 151, 154, 156, 163, 165, 169, 171, 177, 180, 183, 185, 186, 189, 191, 193, 197, 201, 202, 205, 206, 207, 209, 214, 217, 221, 222, 223, 235, 241, 244, 248, 249, 250, 252, 254, 257, 261, 266, 268, 270, 271, 276, 287, 291, 292, 299, 308, 314, 315, 317, 326, 342, 344, 346, 347, 349, 350, 351, 355, 364, 366, 367, 368, 369, 372, 380, 384, 392, 394, 399, 400, 402, 406, 407, 410, 412, 414, 416, 418, 419, 420, 424, 428, 431, 433, 435, 438, 442, 448, 450, 452, 456, 462, 463, 464, 476, 477, 479, 480, 484, 485, 486, 489, 490, 493]
diff1 = sorted(list(set(query_results) - set(ground_truth)))
print("diff1", diff1)
if len(diff1):
    # This gives incorrect results. Specifically, I found that 'oid' and 'fid' values of some tuples become 0 when running this query
    test_df1 = conn.execute("select * from Obj_clevrer where vid = {} and fid = 0".format(diff1[0])).df()
    print(test_df1.to_string())

    # This gives correct results
    test_df2 = conn.execute("select * from Obj_clevrer").df()
    print(test_df2.loc[(test_df2['vid'] == diff1[0]) & (test_df2['fid'] == 0)].to_string())

Output: the output is non-deterministic, but I'm able to obtain a non-empty diff1 every time I run the above code. In this example output, query_results includes three vids that are not in ground_truth. Looking at vid 181, test_df1 gives an incorrect result, where the last tuple's oid and fid are incorrectly set to 0. However, test_df2 gives the correct result.

diff1:
[181, 232, 352]

test_df1:
   oid  vid  fid     shape  color material     x1     y1     x2     y2
0    0  181    0    sphere   blue   rubber  321.0   73.0  355.0  108.0
1    1  181    0      cube  brown   rubber  114.0   72.0  154.0  116.0
2    2  181    0  cylinder  green   rubber  138.0    1.0  161.0   18.0
3    3  181    0    sphere    red    metal  112.0  109.0  152.0  149.0
4    0  181    0    sphere    red    metal  335.0   77.0  371.0  112.0

test_df2:
        oid  vid  fid     shape  color material     x1     y1     x2     y2
110051    0  181    0    sphere   blue   rubber  321.0   73.0  355.0  108.0
110052    1  181    0      cube  brown   rubber  114.0   72.0  154.0  116.0
110053    2  181    0  cylinder  green   rubber  138.0    1.0  161.0   18.0
110054    3  181    0    sphere    red    metal  112.0  109.0  152.0  149.0

OS:

macos (and also linux)

DuckDB Version:

0.9.2 and 0.9.3-dev2705

DuckDB Client:

Python

Full Name:

Enhao Zhang

Affiliation:

University of Washington

Have you tried this on the latest main branch?

I have tested with a main build

Have you tried the steps to reproduce? Do they include all relevant data and configuration? Does the issue you report still appear there?

  • Yes, I have

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions