Skip to content

Dataset streaming example not working #17132

@HLasse

Description

@HLasse

System Info

- `transformers` version: 4.18.0
- Platform: Linux-5.4.173.el7-x86_64-with-glibc2.10
- Python version: 3.8.12
- Huggingface_hub version: 0.5.1
- PyTorch version (GPU?): 1.11.0a0+17540c5 (True)
- Tensorflow version (GPU?): 2.8.0 (True)
- Flax version (CPU?/GPU?/TPU?): 0.4.2 (gpu)
- Jax version: 0.3.10
- JaxLib version: 0.3.10
- Using GPU in script?: Yes
- Using distributed or parallel set-up in script?: No

Who can help?

@patrickvonplaten

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

Following the guide to train a model in streaming mode using the dataset-streaming directory results in the following error.

[11:11:16] - INFO - datasets_modules.datasets.oscar.84838bd49d2295f62008383b05620571535451d84545037bb94d6f3501651df2.oscar - generating examples from = https://s3.amazonaws.com/datasets.huggingface.co/oscar/1.0/unshuffled/deduplicated/en/en_part_480.txt.gz
Token indices sequence length is longer than the specified maximum sequence length for this model (1195 > 512). Running this sequence through the model will result in indexing errors
Traceback (most recent call last):
  File "./run_mlm_flax_stream.py", line 549, in <module>
    eval_samples = advance_iter_and_group_samples(training_iter, data_args.num_eval_samples, max_seq_length)
  File "./run_mlm_flax_stream.py", line 284, in advance_iter_and_group_samples
    samples = {k: samples[k] + tokenized_samples[k] for k in tokenized_samples.keys()}
  File "./run_mlm_flax_stream.py", line 284, in <dictcomp>
    samples = {k: samples[k] + tokenized_samples[k] for k in tokenized_samples.keys()}
TypeError: can only concatenate list (not "int") to list

Expected behavior

Model training to start.

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions