Skip to content

Conversation

cchen777
Copy link
Contributor

@cchen777 cchen777 commented Aug 8, 2024

Why are these changes needed?

Enhance the visibility of Ray cluster k8s events during reconciliation, particularly when failures occur during autoscaler reconcile due to service account

Related issue number

#2189

Checks

  • I've made sure the tests are passing.
  • Testing Strategy
    • Unit tests
    • Manual tests
    • This PR is not tested :(

using the yaml file with autoscaling for reproducing error

$ kind create cluster --image=kindest/node:v1.24.0
$ IMG=kuberay/operator:nightly make docker-build 
$ kind load docker-image kuberay/operator:nightly 
$ helm install kuberay-operator --set image.repository=kuberay/operator --set image.tag=nightly --set "featureGates[0].name=RayClusterStatusConditions,featureGates[0].enabled=true" ../helm-chart/kuberay-operator  
$ kubectl apply -f config/samples/ray-cluster.sample.yaml
$ kubectl apply -f config/samples/ray-cluster.sample-wo-as.yaml
$ kubectl describe raycluster raycluster-sample          
...
        Service Account Name:  random-sa
Events:
  Type     Reason                            Age                From                   Message
  ----     ------                            ----               ----                   -------
  Warning  AutoscalerServiceAccountNotFound  7s (x12 over 18s)  raycluster-controller  Failed to reconcile RayCluster default/raycluster-sample. If users specify ServiceAccountName for the head Pod, they need to create a ServiceAccount themselves. However, ServiceAccount random-sa is not found. Please create one. See the PR description of https://github.com/ray-project/kuberay/pull/1128 for more details.

@cchen777 cchen777 changed the title [Feature] Display reconcile failures as events, during reconcileAutoscalerServiceAccount [Feature] Display reconcile failures as events (ServiceAccount) Aug 8, 2024
@@ -1377,12 +1379,14 @@ func (r *RayClusterReconciler) reconcileAutoscalerServiceAccount(ctx context.Con
"If users specify ServiceAccountName for the head Pod, they need to create a ServiceAccount themselves. "+
"However, ServiceAccount %s is not found. Please create one. "+
"See the PR description of https://github.com/ray-project/kuberay/pull/1128 for more details.", namespacedName.Name), "ServiceAccount", namespacedName)
r.Recorder.Eventf(instance, corev1.EventTypeWarning, "Failed", "Failed to reconcile RayCluster %s/%s due to %s", instance.Namespace, instance.Name, err)
Copy link
Contributor

@rueian rueian Aug 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a helpful hint in the event message telling users they should create the service account themselves?

Copy link
Contributor Author

@cchen777 cchen777 Aug 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

currently we only see something like this

Events:
  Type     Reason  Age                   From                   Message
  ----     ------  ----                  ----                   -------
  Warning  Failed  97s (x16 over 4m21s)  raycluster-controller  Failed to reconcile RayCluster default/raycluster-sample due to ServiceAccount "random-sa" not found

do you think we should propagate entire msg

logger.Error(err, fmt.Sprintf(
"If users specify ServiceAccountName for the head Pod, they need to create a ServiceAccount themselves. "+
"However, ServiceAccount %s is not found. Please create one. "+
"See the PR description of https://github.com/ray-project/kuberay/pull/1128 for more details.", namespacedName.Name), "ServiceAccount", namespacedName)

to the event message or we should just add the err msg with "Please create your service account %s in namespace %s beforehand"

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think having a clear message that users should create the account by themselves can save both users and us a lot of time in the future.

@@ -1377,12 +1379,14 @@ func (r *RayClusterReconciler) reconcileAutoscalerServiceAccount(ctx context.Con
"If users specify ServiceAccountName for the head Pod, they need to create a ServiceAccount themselves. "+
"However, ServiceAccount %s is not found. Please create one. "+
"See the PR description of https://github.com/ray-project/kuberay/pull/1128 for more details.", namespacedName.Name), "ServiceAccount", namespacedName)
r.Recorder.Eventf(instance, corev1.EventTypeWarning, "Failed", "Failed to reconcile RayCluster %s/%s due to %s", instance.Namespace, instance.Name, err)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Can you rebase with the master branch? There are some overlaps between this PR and the master branch. L1382 is the only change required, and you can remove all other changes.

  2. Add an actionable message to the event. That is, "If users specify ServiceAccountName for the head Pod, they need to create a ServiceAccount themselves. However, ServiceAccount %s is not found. Please create one."

@cchen777 cchen777 marked this pull request as ready for review September 5, 2024 08:32
@kevin85421 kevin85421 merged commit 13eb7b2 into ray-project:master Sep 5, 2024
27 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants