Skip to content

A bug of Dataset.to_json() function #7037

@LinglingGreat

Description

@LinglingGreat

Describe the bug

When using the Dataset.to_json() function, an unexpected error occurs if the parameter is set to lines=False. The stored data should be in the form of a list, but it actually turns into multiple lists, which causes an error when reading the data again.
The reason is that to_json() writes to the file in several segments based on the batch size. This is not a problem when lines=True, but it is incorrect when lines=False, because writing in several times will produce multiple lists(when len(dataset) > batch_size).

Steps to reproduce the bug

try this code:

from datasets import load_dataset
import json

train_dataset = load_dataset("Anthropic/hh-rlhf", data_dir="harmless-base")["train"]
output_path = "./harmless-base_hftojs.json"
print(len(train_dataset))
train_dataset.to_json(output_path, lines=False, force_ascii=False, indent=2)

with open(output_path, encoding="utf-8") as f:
    data = json.loads(f.read())

it raise error: json.decoder.JSONDecodeError: Extra data: line 4003 column 1 (char 1373709)

Extra square brackets have appeared here:
image

Expected behavior

The code runs normally.

Environment info

datasets=2.20.0

Metadata

Metadata

Labels

bugSomething isn't working

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions