Skip to content

Feature request: IterableDataset.push_to_hub #5665

@NielsRogge

Description

@NielsRogge

Feature request

It'd be great to have a lazy push to hub, similar to the lazy loading we have with IterableDataset.

Suppose you'd like to filter LAION based on certain conditions, but as LAION doesn't fit into your disk, you'd like to leverage streaming:

from datasets import load_dataset

dataset = load_dataset("laion/laion400m", streaming=True, split="train")

Then you could filter the dataset based on certain conditions:

filtered_dataset = dataset.filter(lambda example: example['HEIGHT'] > 400)

In order to persist this dataset and push it back to the hub, one currently needs to first load the entire filtered dataset on disk and then push:

from datasets import Dataset

Dataset.from_generator(filtered_dataset.__iter__).push_to_hub(...)

It would be great if we can instead lazy push to the data to the hub (basically stream the data to the hub), not being limited by our disk size:

filtered_dataset.push_to_hub("my-filtered-dataset")

Motivation

This feature would be very useful for people that want to filter huge datasets without having to load the entire dataset or a filtered version thereof on their local disk.

Your contribution

Happy to test out a PR :)

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions