Feature request: IterableDataset.push_to_hub

### Feature request

It'd be great to have a lazy push to hub, similar to the lazy loading we have with `IterableDataset`.

Suppose you'd like to filter [LAION](https://huggingface.co/datasets/laion/laion400m) based on certain conditions, but as LAION doesn't fit into your disk, you'd like to leverage streaming:
```
from datasets import load_dataset

dataset = load_dataset("laion/laion400m", streaming=True, split="train")
```
Then you could filter the dataset based on certain conditions:
```
filtered_dataset = dataset.filter(lambda example: example['HEIGHT'] > 400)
```

In order to persist this dataset and push it back to the hub, one currently needs to first load the entire filtered dataset on disk and then push:

```
from datasets import Dataset

Dataset.from_generator(filtered_dataset.__iter__).push_to_hub(...)
```
It would be great if we can instead lazy push to the data to the hub (basically stream the data to the hub), not being limited by our disk size:
```
filtered_dataset.push_to_hub("my-filtered-dataset")
```

### Motivation

This feature would be very useful for people that want to filter huge datasets without having to load the entire dataset or a filtered version thereof on their local disk.

### Your contribution

Happy to test out a PR :)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Feature request: IterableDataset.push_to_hub #5665

Feature request

Motivation

Your contribution

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Feature request: IterableDataset.push_to_hub #5665

Description

Feature request

Motivation

Your contribution

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions