Skip to content

Convert Array features to numpy arrays rather than lists by default #7210

@alex-hh

Description

@alex-hh

Feature request

It is currently quite easy to cause massive slowdowns when using datasets and not familiar with the underlying data conversions by e.g. making bad choices of formatting.

Would it be more user-friendly to set defaults that avoid this as much as possible? e.g. format Array features as numpy arrays rather than python lists

Motivation

Default array formatting leads to slow performance: e.g.

import numpy as np
from datasets import Dataset, Features, Array3D
features=Features(**{"array0": Array3D((None, 10, 10), dtype="float32"), "array1": Array3D((None,10,10), dtype="float32")})
dataset = Dataset.from_dict({f"array{i}": [np.zeros((x,10,10), dtype=np.float32) for x in [2000,1000]*25] for i in range(2)}, features=features)
t0 = time.time()
for ex in ds:
   pass
t1 = time.time()

~1.4 s

ds = dataset.to_iterable_dataset()
t0 = time.time()
for ex in ds:
    pass
t1 = time.time()

~10s

ds = dataset.with_format("numpy")
t0 = time.time()
for ex in ds:
    pass
t1 = time.time()

~0.04s

ds = dataset.to_iterable_dataset().with_format("numpy")
t0 = time.time()
for ex in ds:
    pass
t1 = time.time()

~0.04s

Your contribution

May be able to contribute

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions