Skip to content

read_json issue in yyjson when reading a very large array_of_values file #6646

@Mytherin

Description

@Mytherin

What happens?

When reading a large file with the array_of_values flag and a large maximum_object_size an address sanitizer issue is triggered. Interestingly enough this does not seem to happen when using array_of_records.

To Reproduce

Python script to generate the file:

text = '''{
  "event_id": "VARCHAR",
  "user_id": "VARCHAR",
  "action": "VARCHAR",
  "client_time": "VARCHAR",
  "metadata": {
    "argv": [
      "VARCHAR"
    ],
    "dag": {
      "dag_size": "VARCHAR",
      "tasks": {
        "load_oscar": {
          "status": "VARCHAR",
          "type": "VARCHAR",
          "upstream": "JSON",
          "products": {
            "nb": "VARCHAR"
          }
        },
        "load_weather": {
          "status": "VARCHAR",
          "type": "VARCHAR",
          "upstream": "JSON",
          "products": {
            "nb": "VARCHAR"
          }
        },
        "compress": {
          "status": "VARCHAR",
          "type": "VARCHAR",
          "upstream": {
            "load_oscar": "VARCHAR"
          },
          "products": {
            "nb": "VARCHAR"
          }
        }
      }
    }
  },
  "total_runtime": "VARCHAR",
  "python_version": "VARCHAR",
  "version": "VARCHAR",
  "package_name": "VARCHAR",
  "docker_container": "BOOLEAN",
  "cloud": "NULL",
  "email": "NULL",
  "os": "VARCHAR",
  "environment": "VARCHAR",
  "telemetry_version": "VARCHAR",
  "$lib": "VARCHAR",
  "$lib_version": "VARCHAR",
  "$geoip_city_name": "VARCHAR",
  "$geoip_country_name": "VARCHAR",
  "$geoip_country_code": "VARCHAR",
  "$geoip_continent_name": "VARCHAR",
  "$geoip_continent_code": "VARCHAR",
  "$geoip_postal_code": "VARCHAR",
  "$geoip_latitude": "DOUBLE",
  "$geoip_longitude": "DOUBLE",
  "$geoip_time_zone": "VARCHAR",
  "$geoip_subdivision_1_code": "VARCHAR",
  "$geoip_subdivision_1_name": "VARCHAR",
  "$plugins_succeeded": [
    "VARCHAR"
  ],
  "$plugins_failed": [
    "NULL"
  ],
  "$plugins_deferred": [
    "NULL"
  ],
  "$ip": "VARCHAR"
}
'''

with open('issue.json', 'w+') as f:
	f.write('[' + ','.join([text for x in range(10000)]) + ']')

SQL query:

select json_structure(json ->> '$.properties') as structure,
from read_json('issue.json', json_format='array_of_values', columns={'json': 'JSON'}, maximum_object_size=104857600)
limit 1;

OS:

MacOS

DuckDB Version:

Dev

DuckDB Client:

CLI

Full Name:

Mark

Affiliation:

DuckDB Labs

Have you tried this on the latest master branch?

  • I agree

Have you tried the steps to reproduce? Do they include all relevant data and configuration? Does the issue you report still appear there?

  • I agree

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions