Add `IterableDataset.push_to_hub()` #7595

lhoestq · 2025-06-05T15:29:32Z

Basic implementation, which writes one shard per input dataset shard.
This is to be improved later.

PS: for image/audio datasets structured as actual image/audio files (not parquet), you can sometimes speed it up with ds.decode(num_threads=...).push_to_hub(...)

HuggingFaceDocBuilderDev · 2025-06-05T15:32:19Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

lhoestq added 3 commits June 5, 2025 16:22

add to_parquet and push_to_hub

e4046d4

style

28866c5

fix

7895217

lhoestq added 3 commits June 5, 2025 17:58

for datasetdict as well

019175b

update docs

23d0f41

docs

453ebe4

lhoestq marked this pull request as ready for review June 5, 2025 16:40

lhoestq merged commit 11320c3 into main Jun 6, 2025
7 of 15 checks passed

lhoestq deleted the iterable-dataset-push-to-hub branch June 6, 2025 16:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add `IterableDataset.push_to_hub()` #7595

Add `IterableDataset.push_to_hub()` #7595

Uh oh!

lhoestq commented Jun 5, 2025 •

edited

Loading

Uh oh!

HuggingFaceDocBuilderDev commented Jun 5, 2025

Uh oh!

Uh oh!

Uh oh!

Add IterableDataset.push_to_hub() #7595

Add IterableDataset.push_to_hub() #7595

Uh oh!

Conversation

lhoestq commented Jun 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

HuggingFaceDocBuilderDev commented Jun 5, 2025

Uh oh!

Uh oh!

Uh oh!

Add `IterableDataset.push_to_hub()` #7595

Add `IterableDataset.push_to_hub()` #7595

lhoestq commented Jun 5, 2025 •

edited

Loading