-
Notifications
You must be signed in to change notification settings - Fork 2.6k
JSON reader improvements/fixes #7478
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Had a go at fixing the CI failures and added some more tests for JSON writing. The CI failure was relatively minor (two newlines being written instead of 1 for empty CSV files) and the new tests I added all passed, with one exception - the following empty file can no longer be parsed: [
] This results in the following error:
v0.7.1 parses it correctly:
|
…ile tests + re-enable previously skipped tests
…e into an optional_ptr
…hen no bytes have been written (this can happen in case of a single NULL row)
This was referenced May 15, 2023
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR improves the
read_json
functionality, fixes bugs, and changes the API (slightly).API Changes
Whereas we first had the parameters
lines
andjson_format
, we now haverecords
andformat
. This allows for slightly more kinds of JSON to be read.Parameter
records
can be['auto', 'true', 'false']
.Parameter
format
can be any of['auto', 'unstructured', 'newline_delimited', 'array']
.Combinations of these allow most kinds of JSON to be read.
Records
If the elements in the JSON you are trying to read are "records", i.e., JSON objects where the key/values should be unpacked to columns, then
records
should be set totrue
. However, if you prefer to read the JSON objects as DuckDBSTRUCT
s, thenrecords
can be set tofalse
instead.If the elements are non-records, e.g., JSON arrays, strings, etc., then
records
should be set tofalse
.In general, however, this parameter be auto-detected.
Format
If these elements are separated by a newline, i.e., NDJSON, then
format
should be set to'newline_delimited'
(or'nd'
in short, or useread_ndjson
instead).If they are contained in a JSON array, then
format
should be set to'array'
. Support for reading JSON arrays has improved greatly, as we can now do streaming reads of these, i.e., we do not require the whole array to fit in a buffer, just individual array elements.If your elements have no real structure, for example, each JSON file contains one or more pretty-printed (containing newlines) JSON object/array/etc., then
format
should be set to'unstructured'
.This parameter can also be auto-detected; as a bonus, this can differ across files. For example:
Should work fine, as long as the objects have the same schema.
MultiFileReader Integration
This PR also integrates the JSON reader with the
MultiFileReader
introduced in #6912, which adds thefilename
,union_by_name
, andhive_partitioning
parameters.So, if you have different schemas and different formats, you can still read them just fine with:
As this will force DuckDB to sample data from each of the files.
We now also support hive-partitioned JSON reads but not yet writes.
COPY
We now also support creating JSON array files with the
COPY
statement (we only supported NDJSON before).For example:
Will create the following file:
Which can be read like so:
Bugfixes
This PR also fixes bugs that have accumulated since the JSON reader was released. I've also improved a ton of error messages all over the place, as well as increased the
sample_size
andmaximum_object_size
parameter defaults, to allow for JSON to be read easily without setting parameters.This PR fixes the following issues:
JSON
type in auto-detection when we see onlyNULL
read_json JSON transform error (related to sampling?) #7448null
now always goes to ourNULL
, even thoughnull
is valid JSON (follows Postgres behavior) Extracting a key that is null in json returns a non-null value #6779columns={"ts": "TIMESTAMP[]"}
), we can cast to timestamp without a specifier Parse a datetime with optional fractional seconds from json file #6774read_json
with nested STRUCT data types yields unexpected end of data #7015