-
Notifications
You must be signed in to change notification settings - Fork 610
[Feature] Display reconcile failures as events (ServiceAccount) #2290
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…account not available in namespace
@@ -1377,12 +1379,14 @@ func (r *RayClusterReconciler) reconcileAutoscalerServiceAccount(ctx context.Con | |||
"If users specify ServiceAccountName for the head Pod, they need to create a ServiceAccount themselves. "+ | |||
"However, ServiceAccount %s is not found. Please create one. "+ | |||
"See the PR description of https://github.com/ray-project/kuberay/pull/1128 for more details.", namespacedName.Name), "ServiceAccount", namespacedName) | |||
r.Recorder.Eventf(instance, corev1.EventTypeWarning, "Failed", "Failed to reconcile RayCluster %s/%s due to %s", instance.Namespace, instance.Name, err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a helpful hint in the event message telling users they should create the service account themselves?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
currently we only see something like this
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning Failed 97s (x16 over 4m21s) raycluster-controller Failed to reconcile RayCluster default/raycluster-sample due to ServiceAccount "random-sa" not found
do you think we should propagate entire msg
kuberay/ray-operator/controllers/ray/raycluster_controller.go
Lines 1378 to 1381 in 146d5ea
logger.Error(err, fmt.Sprintf( | |
"If users specify ServiceAccountName for the head Pod, they need to create a ServiceAccount themselves. "+ | |
"However, ServiceAccount %s is not found. Please create one. "+ | |
"See the PR description of https://github.com/ray-project/kuberay/pull/1128 for more details.", namespacedName.Name), "ServiceAccount", namespacedName) |
to the event message or we should just add the err msg with
"Please create your service account %s in namespace %s beforehand"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I think having a clear message that users should create the account by themselves can save both users and us a lot of time in the future.
@@ -1377,12 +1379,14 @@ func (r *RayClusterReconciler) reconcileAutoscalerServiceAccount(ctx context.Con | |||
"If users specify ServiceAccountName for the head Pod, they need to create a ServiceAccount themselves. "+ | |||
"However, ServiceAccount %s is not found. Please create one. "+ | |||
"See the PR description of https://github.com/ray-project/kuberay/pull/1128 for more details.", namespacedName.Name), "ServiceAccount", namespacedName) | |||
r.Recorder.Eventf(instance, corev1.EventTypeWarning, "Failed", "Failed to reconcile RayCluster %s/%s due to %s", instance.Namespace, instance.Name, err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-
Can you rebase with the master branch? There are some overlaps between this PR and the master branch. L1382 is the only change required, and you can remove all other changes.
-
Add an actionable message to the event. That is, "If users specify ServiceAccountName for the head Pod, they need to create a ServiceAccount themselves. However, ServiceAccount %s is not found. Please create one."
Why are these changes needed?
Enhance the visibility of Ray cluster k8s events during reconciliation, particularly when failures occur during autoscaler reconcile due to service account
Related issue number
#2189
Checks
using the yaml file with autoscaling for reproducing error