Skip to content

Broken Huggingface Datasets integration #10700

@maxstrobel

Description

@maxstrobel

Bug Report

Description

The DVC integration seems to be broken.
Followed this guide: https://dvc.org/doc/user-guide/integrations/huggingface

Reproduce

from datasets import load_dataset

dataset = load_dataset(
    "csv",
    data_files="dvc://workshop/satellite-data/jan_train.csv",
    storage_options={"url": "https://github.com/iterative/dataset-registry.git"},
)

print(dataset)
Traceback (most recent call last):
  File "C:\tmp\test\load.py", line 3, in <module>
    dataset = load_dataset(
              ^^^^^^^^^^^^^
  File "C:\tmp\test\.venv\Lib\site-packages\datasets\load.py", line 2151, in load_dataset
    builder_instance.download_and_prepare(
  File "C:\tmp\test\.venv\Lib\site-packages\datasets\builder.py", line 808, in download_and_prepare
    fs, output_dir = url_to_fs(output_dir, **(storage_options or {}))
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: url_to_fs() got multiple values for argument 'url'

Expected

Integration would work and the indicated file is downloaded and opened.

Environment information

Python version

python --version
Python 3.11.10

Venv (pip install datasets dvc):

Package                Version
---------------------- -----------
aiohappyeyeballs       2.4.6
aiohttp                3.11.13
aiohttp-retry          2.9.1
aiosignal              1.3.2
amqp                   5.3.1
annotated-types        0.7.0
antlr4-python3-runtime 4.9.3
appdirs                1.4.4
asyncssh               2.20.0
atpublic               5.1
attrs                  25.1.0
billiard               4.2.1
celery                 5.4.0
certifi                2025.1.31
cffi                   1.17.1
charset-normalizer     3.4.1
click                  8.1.8
click-didyoumean       0.3.1
click-plugins          1.1.1
click-repl             0.3.0
colorama               0.4.6
configobj              5.0.9
cryptography           44.0.1
datasets               3.3.2
dictdiffer             0.9.0
dill                   0.3.8
diskcache              5.6.3
distro                 1.9.0
dpath                  2.2.0
dulwich                0.22.7
dvc                    3.59.1
dvc-data               3.16.9
dvc-http               2.32.0
dvc-objects            5.1.0
dvc-render             1.0.2
dvc-studio-client      0.21.0
dvc-task               0.40.2
entrypoints            0.4
filelock               3.17.0
flatten-dict           0.4.2
flufl-lock             8.1.0
frozenlist             1.5.0
fsspec                 2024.12.0
funcy                  2.0
gitdb                  4.0.12
gitpython              3.1.44
grandalf               0.8
gto                    1.7.2
huggingface-hub        0.29.1
hydra-core             1.3.2
idna                   3.10
iterative-telemetry    0.0.10
kombu                  5.4.2
markdown-it-py         3.0.0
mdurl                  0.1.2
multidict              6.1.0
multiprocess           0.70.16
networkx               3.4.2
numpy                  2.2.3
omegaconf              2.3.0
orjson                 3.10.15
packaging              24.2
pandas                 2.2.3
pathspec               0.12.1
platformdirs           4.3.6
prompt-toolkit         3.0.50
propcache              0.3.0
psutil                 7.0.0
pyarrow                19.0.1
pycparser              2.22
pydantic               2.10.6
pydantic-core          2.27.2
pydot                  3.0.4
pygit2                 1.17.0
pygments               2.19.1
pygtrie                2.5.0
pyparsing              3.2.1
python-dateutil        2.9.0.post0
pytz                   2025.1
pywin32                308
pyyaml                 6.0.2
requests               2.32.3
rich                   13.9.4
ruamel-yaml            0.18.10
ruamel-yaml-clib       0.2.12
scmrepo                3.3.10
semver                 3.0.4
setuptools             75.8.0
shellingham            1.5.4
shortuuid              1.0.13
shtab                  1.7.1
six                    1.17.0
smmap                  5.0.2
sqltrie                0.11.2
tabulate               0.9.0
tomlkit                0.13.2
tqdm                   4.67.1
typer                  0.15.1
typing-extensions      4.12.2
tzdata                 2025.1
urllib3                2.3.0
vine                   5.1.0
voluptuous             0.15.2
wcwidth                0.2.13
xxhash                 3.5.0
yarl                   1.18.3
zc-lockfile            3.0.post1

Additional Information (if any):

Unfortunately url is a reserved argument in fsspec.url_to_fs, so ideally file system implementations like DVC should use another argument name to avoid this kind of errors

Metadata

Metadata

Assignees

Labels

bugDid we break something?

Type

Projects

Status

Done

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions