-
Notifications
You must be signed in to change notification settings - Fork 188
Closed
Labels
Description
Steps to reproduce
Start a run on any container-based backend, e.g., RunPod
Actual behaviour
No response
Expected behaviour
No response
dstack version
0.19.5
Server logs
DEBUG dstack._internal.server.services.runner.ssh:105 Cannot connect to <x.x.x.x>'s API: ('Connection aborted.',
ConnectionResetError(104, 'Connection reset by peer'))
WARNING dstack._internal.server.background.tasks.process_prometheus_metrics:120 Failed to connect to job <job_name> to collect
Prometheus metrics
Additional information
Prometheus metrics are pulled from the shim, not the runner. As there is no shim on container-based backends, we should skip such jobs.
dstack/src/dstack/_internal/server/background/tasks/process_prometheus_metrics.py
Lines 132 to 135 in 52d113f
@runner_ssh_tunnel(ports=[DSTACK_SHIM_HTTP_PORT], retries=1) | |
def _pull_job_metrics(ports: dict[int, int], task_id: uuid.UUID) -> Optional[str]: | |
shim_client = client.ShimClient(port=ports[DSTACK_SHIM_HTTP_PORT]) | |
return shim_client.get_task_metrics(task_id) |