-
Notifications
You must be signed in to change notification settings - Fork 603
[Feat][RayCluster] Use a new RayClusterReplicaFailure condition to reflect the result of reconcilePods #2259
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feat][RayCluster] Use a new RayClusterReplicaFailure condition to reflect the result of reconcilePods #2259
Conversation
1478100
to
42b99fd
Compare
…flect the result of reconcilePods Signed-off-by: Rueian <rueiancsie@gmail.com>
…flect the result of reconcilePods Signed-off-by: Rueian <rueiancsie@gmail.com>
4d8a1e6
to
41462a9
Compare
meta.SetStatusCondition(&newInstance.Status.Conditions, metav1.Condition{ | ||
Type: string(rayv1.RayClusterReplicaFailure), | ||
Status: metav1.ConditionTrue, | ||
Reason: "FailedReconcilePods", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Failing to reconcile pods is different from replica failure, do we need to make this distinction clearer?
For example, replica failure could mean one of the pods are unhealthy and need to be recreated, but a reconcile error could happen for other reasons before the pod is running
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @andrewsykim,
Possible distinctions are those in the previous draft:
- FailedDeleteAllPods
- FailedDeleteHeadPod
- FailedCreateHeadPod
- FailedDeleteWorkerPod
- FailedCreateWorkerPod
Do they make sense to you?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, but my point is that maybe reconcile error and replica failures are better off as separate conditions instead of different reasons within the same condition.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We borrow the idea of the replica failures condition from the Kubernetes ReplicaSet controller. It has only two condition reasons:
- FailedCreate
- FailedDelete
And it doesn't reflect the health of pods either. I guess that is because the health of pods is handled by the pod controller already.
The FailedReconcilePods
reason in this PR may be oversimplified and exceeds the semantic of ReplicaFailure
indeed. I think I can bring the previous draft back by introducing them as individual error instances and wrapping them at the appropriate points. However, those points are still in the reconcilePods
function.
var (
failedDeleteAllPods = errstd.New("...")
failedDeleteHeadPod = errstd.New("...")
failedCreateHeadPod = errstd.New("...")
failedDeleteWorkerPod = errstd.New("...")
failedCreateWorkerPod = errstd.New("...")
)
func (r *RayClusterReconciler) reconcilePods(ctx context.Context, instance *rayv1.RayCluster) error {
...
if err := r.createHeadPod(ctx, *instance); err != nil {
common.FailedClustersCounterInc(instance.Namespace)
return errstd.Join(failedCreateHeadPod, err)
}
...
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
var (
failedDeleteAllPods = errstd.New("...")
failedDeleteHeadPod = errstd.New("...")
failedCreateHeadPod = errstd.New("...")
failedDeleteWorkerPod = errstd.New("...")
failedCreateWorkerPod = errstd.New("...")
)
...
This makes sense to me.
… errors Signed-off-by: Rueian <rueiancsie@gmail.com>
func (r *RayClusterReconciler) reconcilePods(ctx context.Context, instance *rayv1.RayCluster) error { | ||
if err := r.doReconcilePods(ctx, instance); err != nil { | ||
return fmt.Errorf("%w: %w", reconcilePodsErr, err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe use https://pkg.go.dev/errors#Join and Unwrap instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To unwrap a joined error, we need a special trick:
errs := err.(interface{ Unwrap() []error }).Unwrap()
@@ -591,7 +600,17 @@ func (r *RayClusterReconciler) reconcileHeadlessService(ctx context.Context, ins | |||
return nil | |||
} | |||
|
|||
// reconcilePodsErr is a marker used by the calculateStatus() for setting the RayClusterReplicaFailure condition. | |||
var reconcilePodsErr = errstd.New("reconcile pods error") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
move to constant.go
?
@@ -391,6 +394,12 @@ func (r *RayClusterReconciler) inconsistentRayClusterStatus(ctx context.Context, | |||
oldStatus.Endpoints, newStatus.Endpoints, oldStatus.Head, newStatus.Head)) | |||
return true | |||
} | |||
if !reflect.DeepEqual(oldStatus.Conditions, newStatus.Conditions) { | |||
logger.Info("inconsistentRayClusterStatus", "detect inconsistency", fmt.Sprintf( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Avoid using fmt.Sprintf
. Please use
logger.Info(..., ..., "old conditions", oldStatus.Conditions, "new conditions", newStatus.Conditions)
instead. See this doc for more details.
meta.SetStatusCondition(&newInstance.Status.Conditions, metav1.Condition{ | ||
Type: string(rayv1.RayClusterReplicaFailure), | ||
Status: metav1.ConditionTrue, | ||
Reason: "FailedReconcilePods", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
var (
failedDeleteAllPods = errstd.New("...")
failedDeleteHeadPod = errstd.New("...")
failedCreateHeadPod = errstd.New("...")
failedDeleteWorkerPod = errstd.New("...")
failedCreateWorkerPod = errstd.New("...")
)
...
This makes sense to me.
Signed-off-by: Rueian <rueiancsie@gmail.com>
…flect the result of reconcilePods Signed-off-by: Rueian <rueiancsie@gmail.com>
…flect the result of reconcilePods Signed-off-by: Rueian <rueiancsie@gmail.com>
) | ||
|
||
func RayClusterReplicaFailureReason(err error) string { | ||
if e, ok := err.(interface{ Unwrap() []error }); ok { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
err.(interface{ Unwrap() []error }).Unwrap()
is the only special trick to unwrap a joined error.
https://cs.opensource.google/go/go/+/refs/tags/go1.22.5:src/errors/join.go;l=60-62
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Just left a nit comment.
@@ -523,3 +523,11 @@ env_vars: | |||
}) | |||
} | |||
} | |||
|
|||
func TestErrRayClusterReplicaFailureReason(t *testing.T) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we test the case that the error is not a RayClusterReplicaFailure
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure. I will add a test that the RayClusterReplicaFailureReason() should return an empty string for a random error.
Signed-off-by: Rueian <rueiancsie@gmail.com>
This is a revised draft of #2245 based on the review.
In this version, a
RayClusterReplicaFailure
condition will be set whenever thereconcilePods()
returns areconcileErr
. This is done by thecalculateStatus()
based on thereconcileErr
solely. It assumes thereconcilePods()
is the last reconciliation function.We also consolidate the fine-grained condition reasons in the previous draft into one
FailedReconcilePods
for simplicity.Tests are conducted in three parts:
reconcilePods()
will return areconcileErr
.calculateStatus()
will set the condition only when the feature gate is enabled.inconsistentRayClusterStatus()
will reporttrue
when conditions are changed.Checks