Skip to content

Unimplemented type for cast (VARCHAR -> NULL) error for aggregation on large string column #6669

@jonmmease

Description

@jonmmease

What happens?

For a query involving a string column with NULLs, on a relatively large DataFrame (3.2 million rows), I receive the following error:

InvalidInputException: Invalid Input Error: Failed to cast value: Unimplemented type for cast (VARCHAR -> NULL)

The query completes successfully when the dataset is slightly smaller.

To Reproduce

Load an initial 3201 row dataset and remove all but one column (Major_Genre). This is a VARCHAR column that contains NULL values.

import pandas as pd
import duckdb

# Load movies dataset using pandas and remove all but one string column
movies = pd.read_json('https://cdn.jsdelivr.net/npm/vega-datasets@v1.29.0/data/movies.json')
movies = movies[["Major_Genre"]]
duckdb.query("SELECT * from movies")
┌───────────────────┐
│    Major_Genre    │
│      varchar      │
├───────────────────┤
│ NULL              │
│ Drama             │
│ Comedy            │
│ Comedy            │
│ Drama             │
│ NULL              │
│ NULL              │
│ Comedy            │
│ NULL              │
│ NULL              │
│  ·                │
│  ·                │
│  ·                │
│ Drama             │
│ Adventure         │
│ Comedy            │
│ Comedy            │
│ Comedy            │
│ Comedy            │
│ Thriller/Suspense │
│ Adventure         │
│ Adventure         │
│ Adventure         │
├───────────────────┤
│     3201 rows     │
│    (20 shown)     │
└───────────────────┘

Duplicate and concat the original dataset n times to create a larger dataset, then count the number of each unique value of the Major_Genera column.

When n is 999, the total number of rows of the dataset is 3197799 and the query completes as expected.

n = 999
movies_medium = pd.concat([movies] * n, axis=0).reset_index()
print(len(movies_medium))
duckdb.query("SELECT Major_Genre, COUNT(*) from movies_medium GROUP BY Major_Genre")
3197799
┌─────────────────────┬──────────────┐
│     Major_Genre     │ count_star() │
│       varchar       │    int64     │
├─────────────────────┼──────────────┤
│ Action              │       419580 │
│ Drama               │       788211 │
│ NULL                │       274725 │
│ Comedy              │       674325 │
│ Black Comedy        │        35964 │
│ Adventure           │       273726 │
│ Thriller/Suspense   │       238761 │
│ Musical             │        52947 │
│ Romantic Comedy     │       136863 │
│ Horror              │       218781 │
│ Western             │        35964 │
│ Documentary         │        42957 │
│ Concert/Performance │         4995 │
├─────────────────────┴──────────────┤
│ 13 rows                  2 columns │
└────────────────────────────────────┘

When n is 1000, the total number of rows of the dataset is 3201000 and the query raises an error:

n = 1000
movies_medium = pd.concat([movies] * n, axis=0).reset_index()
print(len(movies_medium))
duckdb.query("SELECT Major_Genre, COUNT(*) from movies_medium GROUP BY Major_Genre")
3201000
---------------------------------------------------------------------------
InvalidInputException                     Traceback (most recent call last)
...
InvalidInputException: Invalid Input Error: Failed to cast value: Unimplemented type for cast (VARCHAR -> NULL)

If nulls are removed, the query works for even much larger datasets

n = 2000
movies_medium = pd.concat([movies] * n, axis=0).reset_index().dropna()
print(len(movies_medium))
duckdb.query("SELECT Major_Genre, COUNT(*) from movies_medium GROUP BY Major_Genre")
5852000
┌─────────────────────┬──────────────┐
│     Major_Genre     │ count_star() │
│       varchar       │    int64     │
├─────────────────────┼──────────────┤
│ Thriller/Suspense   │       478000 │
│ Drama               │      1578000 │
│ Horror              │       438000 │
│ Adventure           │       548000 │
│ Action              │       840000 │
│ Romantic Comedy     │       274000 │
│ Comedy              │      1350000 │
│ Documentary         │        86000 │
│ Black Comedy        │        72000 │
│ Musical             │       106000 │
│ Western             │        72000 │
│ Concert/Performance │        10000 │
├─────────────────────┴──────────────┤
│ 12 rows                  2 columns │
└────────────────────────────────────┘

OS:

macOS Ventura 13.2.1

DuckDB Version:

0.7.2-dev586. The error is present in 0.7.0 and beyond, but there is no error in 0.6.0.

DuckDB Client:

Python

Full Name:

Jon Mease

Affiliation:

VegaFusion / Hex

Have you tried this on the latest master branch?

  • I agree

Have you tried the steps to reproduce? Do they include all relevant data and configuration? Does the issue you report still appear there?

  • I agree

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions