Skip to content

read_json JSON transform error (related to sampling?) #7448

@liquidcarbon

Description

@liquidcarbon

What happens?

I have a large, nested JSON where one of the fields is mostly null, except for a handful of lines.

cat metadata.json_part2 | grep ResultFlags | sort | uniq -c

      1             "ResultFlags": "MicrofluidicsRegulatedVacuumError",
      7             "ResultFlags": "PressureError",
     18             "ResultFlags": "TemperatureError",
  47973             "ResultFlags": null,

An error is thrown when the duckdb parser encounters this field. I can bypass it using ignore_errors=1 but then I don't know what else could be wrong.

Would you consider this appropriate behavior or a bug?
Why does it say in the error message that it's trying to cast to numerical?

SELECT
...
FROM read_json_auto('metadata.json_part2', ignore_errors=0)
InvalidInputException                     Traceback (most recent call last)
File <timed exec>:7

InvalidInputException: Invalid Input Error: JSON transform error in file "E:/output\metadata.json_part2", in object 2286: Failed to cast value to numerical: "MicrofluidicsRegulatedVacuumError"
Try increasing 'sample_size', reducing 'maximum_depth', specifying 'columns' manually, specifying 'lines' or 'json_format' manually, or setting 'ignore_errors' to true.

To Reproduce

It doesn't reproduce with a small file:

{"Results": {"ResultFlags": null}}
{"Results": {"ResultFlags": "notnull"}}
{"Results": {"ResultFlags": null}}
SELECT Results.ResultFlags
FROM read_json_auto('test_null_then_filled.json', ignore_errors=0)

returns expected result.

So I'm not sure how to proceed here.

OS:

Windows (pyenv-win) and Ubuntu (WSL, pyenv)

DuckDB Version:

0.7.1

DuckDB Client:

python

Full Name:

Alex Kislukhin

Affiliation:

self

Have you tried this on the latest master branch?

  • I agree

Have you tried the steps to reproduce? Do they include all relevant data and configuration? Does the issue you report still appear there?

  • I agree

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions