Skip to content

Conversation

lnkuiper
Copy link
Contributor

@lnkuiper lnkuiper commented May 12, 2023

This PR improves the read_json functionality, fixes bugs, and changes the API (slightly).

API Changes

Whereas we first had the parameters lines and json_format, we now have records and format. This allows for slightly more kinds of JSON to be read.

Parameter records can be ['auto', 'true', 'false'].
Parameter format can be any of ['auto', 'unstructured', 'newline_delimited', 'array'].

Combinations of these allow most kinds of JSON to be read.

Records

If the elements in the JSON you are trying to read are "records", i.e., JSON objects where the key/values should be unpacked to columns, then records should be set to true. However, if you prefer to read the JSON objects as DuckDB STRUCTs, then records can be set to false instead.

If the elements are non-records, e.g., JSON arrays, strings, etc., then records should be set to false.

In general, however, this parameter be auto-detected.

Format

If these elements are separated by a newline, i.e., NDJSON, then format should be set to 'newline_delimited' (or 'nd' in short, or use read_ndjson instead).

If they are contained in a JSON array, then format should be set to 'array'. Support for reading JSON arrays has improved greatly, as we can now do streaming reads of these, i.e., we do not require the whole array to fit in a buffer, just individual array elements.

If your elements have no real structure, for example, each JSON file contains one or more pretty-printed (containing newlines) JSON object/array/etc., then format should be set to 'unstructured'.

This parameter can also be auto-detected; as a bonus, this can differ across files. For example:

select * from read_json_auto(['newline_delimited_objects.ndjson', 'array_of_objects.json']);

Should work fine, as long as the objects have the same schema.

MultiFileReader Integration

This PR also integrates the JSON reader with the MultiFileReader introduced in #6912, which adds the filename, union_by_name, and hive_partitioning parameters.

So, if you have different schemas and different formats, you can still read them just fine with:

select * from read_json_auto(['newline_delimited_objects.ndjson', 'array_of_objects.json'], union_by_name=true);

As this will force DuckDB to sample data from each of the files.

We now also support hive-partitioned JSON reads but not yet writes.

COPY

We now also support creating JSON array files with the COPY statement (we only supported NDJSON before).

For example:

copy (select * from range(5)) to 'my.json' (ARRAY TRUE);

Will create the following file:

[
	{"range":0},
	{"range":1},
	{"range":2},
	{"range":3},
	{"range":4}
]

Which can be read like so:

create table test (range bigint);
copy test from 'my.json' (ARRAY TRUE);
copy test from 'my.json' (AUTO_DETECT TRUE);

Bugfixes

This PR also fixes bugs that have accumulated since the JSON reader was released. I've also improved a ton of error messages all over the place, as well as increased the sample_size and maximum_object_size parameter defaults, to allow for JSON to be read easily without setting parameters.

This PR fixes the following issues:

  1. Default to JSON type in auto-detection when we see only NULL read_json JSON transform error (related to sampling?) #7448
  2. JSON null now always goes to our NULL, even though null is valid JSON (follows Postgres behavior) Extracting a key that is null in json returns a non-null value #6779
  3. When forcing columns (e.g., columns={"ts": "TIMESTAMP[]"}), we can cast to timestamp without a specifier Parse a datetime with optional fractional seconds from json file #6774
  4. Reader improvements in this PR (+ cast improvements in PR Implement JSON <-> Nested types casting #7366) fix the following list of issues:

lnkuiper added 30 commits May 3, 2023 16:56
@Mytherin
Copy link
Collaborator

Had a go at fixing the CI failures and added some more tests for JSON writing. The CI failure was relatively minor (two newlines being written instead of 1 for empty CSV files) and the new tests I added all passed, with one exception - the following empty file can no longer be parsed:

[
	
]

This results in the following error:

Error: Invalid Input Error: Malformed JSON in file "empty.json", at byte 1 in record/value 2: unexpected character. 

v0.7.1 parses it correctly:

duckdb -c "FROM 'empty.json'"
┌────────┐
│  json  │
│ int32  │
├────────┤
│ 0 rows │
└────────┘

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants