-
Notifications
You must be signed in to change notification settings - Fork 617
Closed
Labels
Description
Description
https://docs.google.com/document/d/1zNiE7lVZYjhrxlTbh1UXOVpR6hh1GIeSfCfE9Lt5v6Y/edit?tab=t.0
Context
In production, SRE teams typically define Service Level Indicators (SLIs) to ensure that services meet expected performance and reliability standards. However, there are currently no dedicated SLIs for Ray Cluster, Ray Service, and Ray Job, which makes it challenging to monitor their health and performance.
Solution
We propose new metrics to enhance KubeRay's observability and providing better insights into the status and performance of Ray Cluster, Ray Service, and Ray Job.
sub-issues
- [Feature][metrics] ray_cluster_provisioned_duration_seconds #3172
- [Feature][metrics] kuberay_cluster_info #3557
- [SLI Metrics] Add metric kuberay_cluster_condition_provisioned #3635
- [SLI-Metrics] Ray service info #3604
- [Feature][metrics] kuberay_service_ready #3177
- [SLI-Metric] kuberay_service_condition_upgrade_in_progress #3663
- [SLI Metrics] Add metric kuberay_job_info #3621
- [Metric] kuberay_job_deployment_status #3656
- [SLI Metrics] kuberay_job_execution_duration_seconds #3488