JSON reader improvements/fixes #7478

lnkuiper · 2023-05-12T09:04:12Z

This PR improves the read_json functionality, fixes bugs, and changes the API (slightly).

API Changes

Whereas we first had the parameters lines and json_format, we now have records and format. This allows for slightly more kinds of JSON to be read.

Parameter records can be ['auto', 'true', 'false'].
Parameter format can be any of ['auto', 'unstructured', 'newline_delimited', 'array'].

Combinations of these allow most kinds of JSON to be read.

Records

If the elements in the JSON you are trying to read are "records", i.e., JSON objects where the key/values should be unpacked to columns, then records should be set to true. However, if you prefer to read the JSON objects as DuckDB STRUCTs, then records can be set to false instead.

If the elements are non-records, e.g., JSON arrays, strings, etc., then records should be set to false.

In general, however, this parameter be auto-detected.

Format

If these elements are separated by a newline, i.e., NDJSON, then format should be set to 'newline_delimited' (or 'nd' in short, or use read_ndjson instead).

If they are contained in a JSON array, then format should be set to 'array'. Support for reading JSON arrays has improved greatly, as we can now do streaming reads of these, i.e., we do not require the whole array to fit in a buffer, just individual array elements.

If your elements have no real structure, for example, each JSON file contains one or more pretty-printed (containing newlines) JSON object/array/etc., then format should be set to 'unstructured'.

This parameter can also be auto-detected; as a bonus, this can differ across files. For example:

select * from read_json_auto(['newline_delimited_objects.ndjson', 'array_of_objects.json']);

Should work fine, as long as the objects have the same schema.

MultiFileReader Integration

This PR also integrates the JSON reader with the MultiFileReader introduced in #6912, which adds the filename, union_by_name, and hive_partitioning parameters.

So, if you have different schemas and different formats, you can still read them just fine with:

select * from read_json_auto(['newline_delimited_objects.ndjson', 'array_of_objects.json'], union_by_name=true);

As this will force DuckDB to sample data from each of the files.

We now also support hive-partitioned JSON reads but not yet writes.

COPY

We now also support creating JSON array files with the COPY statement (we only supported NDJSON before).

For example:

copy (select * from range(5)) to 'my.json' (ARRAY TRUE);

Will create the following file:

[
	{"range":0},
	{"range":1},
	{"range":2},
	{"range":3},
	{"range":4}
]

Which can be read like so:

create table test (range bigint);
copy test from 'my.json' (ARRAY TRUE);
copy test from 'my.json' (AUTO_DETECT TRUE);

Bugfixes

This PR also fixes bugs that have accumulated since the JSON reader was released. I've also improved a ton of error messages all over the place, as well as increased the sample_size and maximum_object_size parameter defaults, to allow for JSON to be read easily without setting parameters.

This PR fixes the following issues:

Default to JSON type in auto-detection when we see only NULL read_json JSON transform error (related to sampling?) #7448
JSON null now always goes to our NULL, even though null is valid JSON (follows Postgres behavior) Extracting a key that is null in json returns a non-null value #6779
When forcing columns (e.g., columns={"ts": "TIMESTAMP[]"}), we can cast to timestamp without a specifier Parse a datetime with optional fractional seconds from json file #6774
Reader improvements in this PR (+ cast improvements in PR Implement JSON <-> Nested types casting #7366) fix the following list of issues:

… do)

…type detection

… the process

…ory-intensive

…ches

Mytherin · 2023-05-13T10:08:42Z

Had a go at fixing the CI failures and added some more tests for JSON writing. The CI failure was relatively minor (two newlines being written instead of 1 for empty CSV files) and the new tests I added all passed, with one exception - the following empty file can no longer be parsed:

[
	
]

This results in the following error:

Error: Invalid Input Error: Malformed JSON in file "empty.json", at byte 1 in record/value 2: unexpected character.

v0.7.1 parses it correctly:

duckdb -c "FROM 'empty.json'"
┌────────┐
│  json  │
│ int32  │
├────────┤
│ 0 rows │
└────────┘

…ile tests + re-enable previously skipped tests

…e into an optional_ptr

…hen no bytes have been written (this can happen in case of a single NULL row)

lnkuiper added 30 commits May 3, 2023 16:56

implement array element iteration

8a9c8fe

bunch of memory safety stuff

e6ac006

re-write JSON format detection

e7e0071

re-work json_scan for streaming JSON array reads

e619034

Merge branch 'json_hackathon' into json_hackathon_ctd

6df01a7

implement and test streaming reads of JSON arrays (still some work to…

483bd5e

… do)

Merge branch 'json_hackathon' into json_hackathon_ctd

061643f

streaming reads of json arrays working

94211fd

more sensible replacement scan aliases duckdb#6880

8489ac3

fix maximum depth and some format detection issues

a5b5764

properly detect dates again

c1ada5e

fix bug with ndjson scanner introduced by refactor

0b1d014

integrate MultiFileReader and JSON reader

c2df3eb

merge with master

4292086

add MultiFileReader/JSON test file

4bf65ec

test issue duckdb#6646 and add some descriptive binder errors

314483a

fix some multi-file json reader issues and disable ubigint from json …

fd9a7a8

…type detection

properly enable writing JSON arrays, and bugfix JSON array reading in…

e8a01c4

… the process

Merge branch 'master' into json_hackathon_ctd

12789a3

properly check if we can end a parallel scan

cfdfc89

convert to json array because this is now the default for read_json

be1afa5

more descriptive error messages

1d1fdb6

remove skips from test and remove wrong assert

30482af

update to new error message

256b4d0

Merge branch 'master' into json_hackathon_ctd

c4232bf

always mention file name in any JSON error

92da503

add alias in replacement scan register

efaf5ae

windows forbids certain characters from directory names so grr

df05893

fix accidental regression in read_json_objects and make test less mem…

ed163cb

…ory-intensive

Merge branch 'master' into json_hackathon_ctd

03f872e

Mytherin added 3 commits May 13, 2023 11:59

CSV writer - avoid writing two newlines for empty files with headers

8003c0b

Add additional test for reading and writing JSON files from mixed bat…

46d412b

…ches

Format fix

dedc855

Mytherin added 10 commits May 13, 2023 14:06

Windows filename fixes

6acdabf

read_ndjson_objects

44ca319

Correctly handle empty array of records

86ea003

Format fix

05ddc01

Empty array - handle correctly when ] is the last byte in the file

b21230e

Remove assertion that triggered on empty files - and add more empty f…

d770f52

…ile tests + re-enable previously skipped tests

Add missing extension entries map

27caf2a

Merge branch 'master' into json_hackathon_ctd

5c295d7

Avoid FileExists in copy to file for remote files - and turn HTTPStat…

697a7cd

…e into an optional_ptr

Correctly check whether or not we should flush in the combine, also w…

5e7c69c

…hen no bytes have been written (this can happen in case of a single NULL row)

Mytherin merged commit eae707d into duckdb:master May 14, 2023

Tishj mentioned this pull request May 14, 2023

[Python] read_json API changes #7505

Merged

Mytherin mentioned this pull request May 15, 2023

Alias replacement scans to table name if no explicit alias is provided by the replacement scan #7526

Merged

geoffreyd mentioned this pull request Jul 11, 2024

loadJSON should no longer pass json_format uwdata/mosaic#455

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

JSON reader improvements/fixes #7478

JSON reader improvements/fixes #7478

Uh oh!

lnkuiper commented May 12, 2023 •

edited

Loading

Uh oh!

Mytherin commented May 13, 2023

Uh oh!

Uh oh!

JSON reader improvements/fixes #7478

JSON reader improvements/fixes #7478

Uh oh!

Conversation

lnkuiper commented May 12, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

API Changes

Records

Format

MultiFileReader Integration

COPY

Bugfixes

Uh oh!

Mytherin commented May 13, 2023

Uh oh!

Uh oh!

lnkuiper commented May 12, 2023 •

edited

Loading