Align the Dataset and IterableDataset processing API

## Intro

items marked like <s>this</s> are done already :)

Currently the two classes have two distinct API for processing:

### The `.map()` method

Both have those parameters in common: function, batched, batch_size

- IterableDataset is missing those parameters:
<s>with_indices</s>, with_rank, <s>input_columns</s>, <s>drop_last_batch</s>, <s>remove_columns</s>, features, disable_nullable, fn_kwargs, num_proc

- Dataset also has additional parameters that are exclusive, due to caching:
keep_in_memory, load_from_cache_file, cache_file_name, writer_batch_size, suffix_template, new_fingerprint

- <s>There is also an important difference in terms of behavior:
**Dataset.map adds new columns** (with dict.update)
BUT
**IterableDataset discards previous columns** (it overwrites the dict)
IMO the two methods should have the same behavior. This would be an important breaking change though.</s>

- Dataset.map is eager while IterableDataset.map is lazy

### The `.shuffle()` method

- <s>Both have an optional seed parameter, but IterableDataset requires a mandatory parameter buffer_size to control the size of the local buffer used for approximate shuffling.</s>

- <s>IterableDataset is missing the parameter generator</s>

- Also Dataset has exclusive parameters due to caching: keep_in_memory, load_from_cache_file, indices_cache_file_name, writer_batch_size, new_fingerprint

### The `.with_format()` method

- <s>IterableDataset only supports "torch" (it misses tf, jax, pandas, arrow)</s> and is missing the parameters: columns, output_all_columns and format_kwargs
- other methods like `set_format`, `reset_format` or `formatted_as` are also missing

### Other methods

- Both have the same `remove_columns` method
- IterableDataset is missing: <s>cast</s>, <s>cast_column</s>, <s>filter</s>, <s>rename_column</s>, <s>rename_columns</s>, class_encode_column, flatten, train_test_split, <s>shard</s>
- Some other methods are missing but we can discuss them: set_transform, formatted_as, with_transform
- And others don't really make sense for an iterable dataset: select, sort, <s>add_column</s>, add_item
- Dataset is missing skip and take, that IterableDataset implements.

## Questions

I think it would be nice to be able to switch between streaming and regular dataset easily, without changing the processing code significantly.

1. What should be aligned and what shouldn't between those two APIs ?

IMO the minimum is to align the main processing methods.

It would mean aligning breaking the current `Iterable.map` to have the same behavior as `Dataset.map` (add columns with dict.update), and add multiprocessing as well as the missing parameters. DONE ✅

It would also mean implementing the missing methods: cast, cast_column, filter, rename_column, rename_columns, class_encode_column, flatten, prepare_for_task, train_test_split, shard. WIP 🟠

2. What are the breaking changes for IterableDataset ?

The main breaking change would be the change of behavior of `IterableDataset.map`, because currently it discards all the previous columns instead of keeping them. DONE ✅

3. Shall we also do some changes for regular datasets ?

I agree the simplest would be to have the exact same methods for both Dataset and IterableDataset. However this is probably not a good idea because it would prevent users from using the best benefits of them. That's why we can keep some aspects of regular datasets as they are:
- keep the eager Dataset.map with caching
- keep the with_transform method for lazy processing
- keep Dataset.select (it could also be added to IterableDataset even though it's not recommended)

We could have a completely aligned `map` method if both methods were lazy by default, but this is a very big breaking change so I'm not sure we can consider doing that.

For information, TFDS does lazy map by default, and has an additional `.cache()` method.

## Opinions ?

I'd love to gather some opinions about this here. If the two APIs are more aligned it would be awesome for the examples in `transformers`, and it would create a satisfactory experience for users that want to switch from one mode to the other.

cc @mariosasko @albertvillanova @thomwolf @patrickvonplaten @sgugger 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Align the Dataset and IterableDataset processing API #3444

Intro

The `.map()` method

The `.shuffle()` method

The `.with_format()` method

Other methods

Questions

Opinions ?

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Align the Dataset and IterableDataset processing API #3444

Description

Intro

The .map() method

The .shuffle() method

The .with_format() method

Other methods

Questions

Opinions ?

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

The `.map()` method

The `.shuffle()` method

The `.with_format()` method