-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Closed
Labels
bugDid we break something?Did we break something?
Description
Bug Report
Description
The DVC integration seems to be broken.
Followed this guide: https://dvc.org/doc/user-guide/integrations/huggingface
Reproduce
from datasets import load_dataset
dataset = load_dataset(
"csv",
data_files="dvc://workshop/satellite-data/jan_train.csv",
storage_options={"url": "https://github.com/iterative/dataset-registry.git"},
)
print(dataset)
Traceback (most recent call last):
File "C:\tmp\test\load.py", line 3, in <module>
dataset = load_dataset(
^^^^^^^^^^^^^
File "C:\tmp\test\.venv\Lib\site-packages\datasets\load.py", line 2151, in load_dataset
builder_instance.download_and_prepare(
File "C:\tmp\test\.venv\Lib\site-packages\datasets\builder.py", line 808, in download_and_prepare
fs, output_dir = url_to_fs(output_dir, **(storage_options or {}))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: url_to_fs() got multiple values for argument 'url'
Expected
Integration would work and the indicated file is downloaded and opened.
Environment information
Python version
python --version
Python 3.11.10
Venv (pip install datasets dvc):
Package Version
---------------------- -----------
aiohappyeyeballs 2.4.6
aiohttp 3.11.13
aiohttp-retry 2.9.1
aiosignal 1.3.2
amqp 5.3.1
annotated-types 0.7.0
antlr4-python3-runtime 4.9.3
appdirs 1.4.4
asyncssh 2.20.0
atpublic 5.1
attrs 25.1.0
billiard 4.2.1
celery 5.4.0
certifi 2025.1.31
cffi 1.17.1
charset-normalizer 3.4.1
click 8.1.8
click-didyoumean 0.3.1
click-plugins 1.1.1
click-repl 0.3.0
colorama 0.4.6
configobj 5.0.9
cryptography 44.0.1
datasets 3.3.2
dictdiffer 0.9.0
dill 0.3.8
diskcache 5.6.3
distro 1.9.0
dpath 2.2.0
dulwich 0.22.7
dvc 3.59.1
dvc-data 3.16.9
dvc-http 2.32.0
dvc-objects 5.1.0
dvc-render 1.0.2
dvc-studio-client 0.21.0
dvc-task 0.40.2
entrypoints 0.4
filelock 3.17.0
flatten-dict 0.4.2
flufl-lock 8.1.0
frozenlist 1.5.0
fsspec 2024.12.0
funcy 2.0
gitdb 4.0.12
gitpython 3.1.44
grandalf 0.8
gto 1.7.2
huggingface-hub 0.29.1
hydra-core 1.3.2
idna 3.10
iterative-telemetry 0.0.10
kombu 5.4.2
markdown-it-py 3.0.0
mdurl 0.1.2
multidict 6.1.0
multiprocess 0.70.16
networkx 3.4.2
numpy 2.2.3
omegaconf 2.3.0
orjson 3.10.15
packaging 24.2
pandas 2.2.3
pathspec 0.12.1
platformdirs 4.3.6
prompt-toolkit 3.0.50
propcache 0.3.0
psutil 7.0.0
pyarrow 19.0.1
pycparser 2.22
pydantic 2.10.6
pydantic-core 2.27.2
pydot 3.0.4
pygit2 1.17.0
pygments 2.19.1
pygtrie 2.5.0
pyparsing 3.2.1
python-dateutil 2.9.0.post0
pytz 2025.1
pywin32 308
pyyaml 6.0.2
requests 2.32.3
rich 13.9.4
ruamel-yaml 0.18.10
ruamel-yaml-clib 0.2.12
scmrepo 3.3.10
semver 3.0.4
setuptools 75.8.0
shellingham 1.5.4
shortuuid 1.0.13
shtab 1.7.1
six 1.17.0
smmap 5.0.2
sqltrie 0.11.2
tabulate 0.9.0
tomlkit 0.13.2
tqdm 4.67.1
typer 0.15.1
typing-extensions 4.12.2
tzdata 2025.1
urllib3 2.3.0
vine 5.1.0
voluptuous 0.15.2
wcwidth 0.2.13
xxhash 3.5.0
yarl 1.18.3
zc-lockfile 3.0.post1
Additional Information (if any):
- Raised the issue already in huggingface/datasets: DVC integration broken huggingface/datasets#7421
Unfortunately
url
is a reserved argument infsspec.url_to_fs
, so ideally file system implementations like DVC should use another argument name to avoid this kind of errors
mathias-atla and shcheklein
Metadata
Metadata
Assignees
Labels
bugDid we break something?Did we break something?
Type
Projects
Status
Done