-
Notifications
You must be signed in to change notification settings - Fork 2.6k
Description
What happens?
For a query involving a string column with NULLs, on a relatively large DataFrame (3.2 million rows), I receive the following error:
InvalidInputException: Invalid Input Error: Failed to cast value: Unimplemented type for cast (VARCHAR -> NULL)
The query completes successfully when the dataset is slightly smaller.
To Reproduce
Load an initial 3201 row dataset and remove all but one column (Major_Genre
). This is a VARCHAR column that contains NULL values.
import pandas as pd
import duckdb
# Load movies dataset using pandas and remove all but one string column
movies = pd.read_json('https://cdn.jsdelivr.net/npm/vega-datasets@v1.29.0/data/movies.json')
movies = movies[["Major_Genre"]]
duckdb.query("SELECT * from movies")
┌───────────────────┐
│ Major_Genre │
│ varchar │
├───────────────────┤
│ NULL │
│ Drama │
│ Comedy │
│ Comedy │
│ Drama │
│ NULL │
│ NULL │
│ Comedy │
│ NULL │
│ NULL │
│ · │
│ · │
│ · │
│ Drama │
│ Adventure │
│ Comedy │
│ Comedy │
│ Comedy │
│ Comedy │
│ Thriller/Suspense │
│ Adventure │
│ Adventure │
│ Adventure │
├───────────────────┤
│ 3201 rows │
│ (20 shown) │
└───────────────────┘
Duplicate and concat the original dataset n
times to create a larger dataset, then count the number of each unique value of the Major_Genera column.
When n is 999, the total number of rows of the dataset is 3197799 and the query completes as expected.
n = 999
movies_medium = pd.concat([movies] * n, axis=0).reset_index()
print(len(movies_medium))
duckdb.query("SELECT Major_Genre, COUNT(*) from movies_medium GROUP BY Major_Genre")
3197799
┌─────────────────────┬──────────────┐
│ Major_Genre │ count_star() │
│ varchar │ int64 │
├─────────────────────┼──────────────┤
│ Action │ 419580 │
│ Drama │ 788211 │
│ NULL │ 274725 │
│ Comedy │ 674325 │
│ Black Comedy │ 35964 │
│ Adventure │ 273726 │
│ Thriller/Suspense │ 238761 │
│ Musical │ 52947 │
│ Romantic Comedy │ 136863 │
│ Horror │ 218781 │
│ Western │ 35964 │
│ Documentary │ 42957 │
│ Concert/Performance │ 4995 │
├─────────────────────┴──────────────┤
│ 13 rows 2 columns │
└────────────────────────────────────┘
When n is 1000, the total number of rows of the dataset is 3201000 and the query raises an error:
n = 1000
movies_medium = pd.concat([movies] * n, axis=0).reset_index()
print(len(movies_medium))
duckdb.query("SELECT Major_Genre, COUNT(*) from movies_medium GROUP BY Major_Genre")
3201000
---------------------------------------------------------------------------
InvalidInputException Traceback (most recent call last)
...
InvalidInputException: Invalid Input Error: Failed to cast value: Unimplemented type for cast (VARCHAR -> NULL)
If nulls are removed, the query works for even much larger datasets
n = 2000
movies_medium = pd.concat([movies] * n, axis=0).reset_index().dropna()
print(len(movies_medium))
duckdb.query("SELECT Major_Genre, COUNT(*) from movies_medium GROUP BY Major_Genre")
5852000
┌─────────────────────┬──────────────┐
│ Major_Genre │ count_star() │
│ varchar │ int64 │
├─────────────────────┼──────────────┤
│ Thriller/Suspense │ 478000 │
│ Drama │ 1578000 │
│ Horror │ 438000 │
│ Adventure │ 548000 │
│ Action │ 840000 │
│ Romantic Comedy │ 274000 │
│ Comedy │ 1350000 │
│ Documentary │ 86000 │
│ Black Comedy │ 72000 │
│ Musical │ 106000 │
│ Western │ 72000 │
│ Concert/Performance │ 10000 │
├─────────────────────┴──────────────┤
│ 12 rows 2 columns │
└────────────────────────────────────┘
OS:
macOS Ventura 13.2.1
DuckDB Version:
0.7.2-dev586. The error is present in 0.7.0 and beyond, but there is no error in 0.6.0.
DuckDB Client:
Python
Full Name:
Jon Mease
Affiliation:
VegaFusion / Hex
Have you tried this on the latest master
branch?
- I agree
Have you tried the steps to reproduce? Do they include all relevant data and configuration? Does the issue you report still appear there?
- I agree