Skip to content

[Bug]: "Failed to connect to job <name> to collect Prometheus metrics" on container-based backends #2565

@un-def

Description

@un-def

Steps to reproduce

Start a run on any container-based backend, e.g., RunPod

Actual behaviour

No response

Expected behaviour

No response

dstack version

0.19.5

Server logs

DEBUG    dstack._internal.server.services.runner.ssh:105 Cannot connect to <x.x.x.x>'s API: ('Connection aborted.',
                    ConnectionResetError(104, 'Connection reset by peer'))
           WARNING  dstack._internal.server.background.tasks.process_prometheus_metrics:120 Failed to connect to job <job_name> to collect
                    Prometheus metrics

Additional information

Prometheus metrics are pulled from the shim, not the runner. As there is no shim on container-based backends, we should skip such jobs.

@runner_ssh_tunnel(ports=[DSTACK_SHIM_HTTP_PORT], retries=1)
def _pull_job_metrics(ports: dict[int, int], task_id: uuid.UUID) -> Optional[str]:
shim_client = client.ShimClient(port=ports[DSTACK_SHIM_HTTP_PORT])
return shim_client.get_task_metrics(task_id)

Metadata

Metadata

Assignees

Labels

bugSomething isn't workingmetrics

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions