Skip to content

Experiments work properly when run directly, but fail with no explanation when run from a queue #10779

@jeremymatt

Description

@jeremymatt

I'm having an issue with queued jobs failing with no explanation. The jobs complete as expected when run without the --queue flag. Somewhat similar to #10673 but there are some differences (some of the tasks #10673 completed) so I'm starting a new issue. I also tried using absolute paths in dvc.yaml as mentioned in #8747 but that didn't work.

Running directly with
dvc exp run --set-param decimation_percent=50 --name "dp-502" works as expected:

root@docker-desktop:/eod3d_pipeline# dvc exp run --set-param decimation_percent=50 --name "dp-502"
Reproducing experiment 'dp-502'                                                                                                                                                               
Building workspace index                                                                                                                                             |445 [00:00,  738entry/s]
Comparing indexes                                                                                                                                                   |444 [00:00, 24.6kentry/s]
WARNING: No file hash info found for '/eod3d_pipeline/results_50/plots'. It won't be created.                                                                                                 
WARNING: No file hash info found for '/eod3d_pipeline/results_50/metrics.json'. It won't be created.                                                                                          
Applying changes                                                                                                                                                    |0.00 [00:01,     ?file/s]
Stage 'query' didn't change, skipping                                                                                                                                                         
Running stage 'eval':                                                                                                                                                                         
> python eval.py
Importing samples...
 100% |█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 432/432 [63.1ms elapsed, 0s remaining, 6.8K samples/s]   
 100% |█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 432/432 [5.3s elapsed, 0s remaining, 94.7 samples/s]       

Updating lock file 'dvc.lock'                                                                                                                                                                 
                                                                                                                                                                                              
Ran experiment(s): dp-502
Experiment results have been applied to your workspace.
root@docker-desktop:/eod3d_pipeline# 

However, when I do dvc exp run --queue --set-param decimation_percent=50 --name "dp-502" and then dvc queue start -v I get the following output:

root@docker-desktop:/eod3d_pipeline# dvc queue start -v
2025-06-12 13:24:30,686 DEBUG: v3.60.0 (pip), CPython 3.9.23 on Linux-5.15.167.4-microsoft-standard-WSL2-x86_64-with-glibc2.36
2025-06-12 13:24:30,686 DEBUG: command: /usr/local/bin/dvc queue start -v
2025-06-12 13:24:31,171 DEBUG: Spawning 1 exp queue workers
2025-06-12 13:24:32,257 DEBUG: Worker status: {}
2025-06-12 13:24:32,259 DEBUG: Spawning exp queue worker
2025-06-12 13:24:32,259 DEBUG: start a new worker: dvc-exp-worker-1, node: dvc-exp-9e32c0-1@localhost
Started '1' new experiments task queue worker.
2025-06-12 13:24:32,273 DEBUG: Analytics is enabled.
2025-06-12 13:24:32,308 DEBUG: Trying to spawn ['daemon', 'analytics', '/tmp/tmpp1vkuyyv', '-v']
2025-06-12 13:24:32,314 DEBUG: Spawned ['daemon', 'analytics', '/tmp/tmpp1vkuyyv', '-v'] with pid 24563
root@docker-desktop:/eod3d_pipeline# dvc doctor

dvc queue status shows:

root@docker-desktop:/eod3d_pipeline# dvc queue status
Task     Name    Created    Status
56cca35  dp-502  01:22 PM   Failed

dvc queue logs 56cca35 does not return any logs:

root@docker-desktop:/eod3d_pipeline# dvc queue logs dp-502
ERROR: No output logs found for experiment 'dp-502'
root@docker-desktop:/eod3d_pipeline# 

The output of dvc doctor:

root@docker-desktop:/eod3d_pipeline# dvc doctor
DVC version: 3.60.0 (pip)
-------------------------
Platform: Python 3.9.23 on Linux-5.15.167.4-microsoft-standard-WSL2-x86_64-with-glibc2.36
Subprojects:
        dvc_data = 3.16.10
        dvc_objects = 5.1.1
        dvc_render = 1.0.2
        dvc_task = 0.40.2
        scmrepo = 3.3.11
Supports:
        http (aiohttp = 3.12.9, aiohttp-retry = 2.9.1),
        https (aiohttp = 3.12.9, aiohttp-retry = 2.9.1)
Config:
        Global: /root/.config/dvc
        System: /etc/xdg/dvc
Cache types: symlink
Cache directory: 9p on C:\
Caches: local
Remotes: local
Workspace directory: 9p on C:\
Repo: dvc, git
Repo.site_cache_dir: /var/tmp/dvc/repo/965097ccbc74643f161ae89ff14d8ce1

Metadata

Metadata

Assignees

No one assigned

    Labels

    triageNeeds to be triaged

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions