-
Notifications
You must be signed in to change notification settings - Fork 2.9k
Description
Feature request
It'd be great to have a lazy push to hub, similar to the lazy loading we have with IterableDataset
.
Suppose you'd like to filter LAION based on certain conditions, but as LAION doesn't fit into your disk, you'd like to leverage streaming:
from datasets import load_dataset
dataset = load_dataset("laion/laion400m", streaming=True, split="train")
Then you could filter the dataset based on certain conditions:
filtered_dataset = dataset.filter(lambda example: example['HEIGHT'] > 400)
In order to persist this dataset and push it back to the hub, one currently needs to first load the entire filtered dataset on disk and then push:
from datasets import Dataset
Dataset.from_generator(filtered_dataset.__iter__).push_to_hub(...)
It would be great if we can instead lazy push to the data to the hub (basically stream the data to the hub), not being limited by our disk size:
filtered_dataset.push_to_hub("my-filtered-dataset")
Motivation
This feature would be very useful for people that want to filter huge datasets without having to load the entire dataset or a filtered version thereof on their local disk.
Your contribution
Happy to test out a PR :)