Skip to content

[Feat][RayCluster] Use a new RayClusterReplicaFailure condition to reflect the result of reconcilePods #2259

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

rueian
Copy link
Contributor

@rueian rueian commented Jul 19, 2024

This is a revised draft of #2245 based on the review.

In this version, a RayClusterReplicaFailure condition will be set whenever the reconcilePods() returns a reconcileErr. This is done by the calculateStatus() based on the reconcileErr solely. It assumes the reconcilePods() is the last reconciliation function.

We also consolidate the fine-grained condition reasons in the previous draft into one FailedReconcilePods for simplicity.

Tests are conducted in three parts:

  1. Assert the reconcilePods() will return a reconcileErr.
  2. Assert the calculateStatus() will set the condition only when the feature gate is enabled.
  3. Assert the inconsistentRayClusterStatus() will report true when conditions are changed.

Checks

  • I've made sure the tests are passing.
  • Testing Strategy
    • Unit tests
    • Manual tests

@rueian rueian force-pushed the replica-failure-condition-by-err branch from 1478100 to 42b99fd Compare July 19, 2024 15:13
rueian added 2 commits July 19, 2024 23:55
…flect the result of reconcilePods

Signed-off-by: Rueian <rueiancsie@gmail.com>
…flect the result of reconcilePods

Signed-off-by: Rueian <rueiancsie@gmail.com>
@rueian rueian force-pushed the replica-failure-condition-by-err branch from 4d8a1e6 to 41462a9 Compare July 19, 2024 15:55
@rueian rueian marked this pull request as ready for review July 19, 2024 16:23
meta.SetStatusCondition(&newInstance.Status.Conditions, metav1.Condition{
Type: string(rayv1.RayClusterReplicaFailure),
Status: metav1.ConditionTrue,
Reason: "FailedReconcilePods",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Failing to reconcile pods is different from replica failure, do we need to make this distinction clearer?

For example, replica failure could mean one of the pods are unhealthy and need to be recreated, but a reconcile error could happen for other reasons before the pod is running

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @andrewsykim,

Possible distinctions are those in the previous draft:

  • FailedDeleteAllPods
  • FailedDeleteHeadPod
  • FailedCreateHeadPod
  • FailedDeleteWorkerPod
  • FailedCreateWorkerPod

Do they make sense to you?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but my point is that maybe reconcile error and replica failures are better off as separate conditions instead of different reasons within the same condition.

Copy link
Contributor Author

@rueian rueian Jul 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We borrow the idea of the replica failures condition from the Kubernetes ReplicaSet controller. It has only two condition reasons:

  • FailedCreate
  • FailedDelete

And it doesn't reflect the health of pods either. I guess that is because the health of pods is handled by the pod controller already.

The FailedReconcilePods reason in this PR may be oversimplified and exceeds the semantic of ReplicaFailure indeed. I think I can bring the previous draft back by introducing them as individual error instances and wrapping them at the appropriate points. However, those points are still in the reconcilePods function.

var (
	failedDeleteAllPods   = errstd.New("...")
	failedDeleteHeadPod   = errstd.New("...")
	failedCreateHeadPod   = errstd.New("...")
	failedDeleteWorkerPod = errstd.New("...")
	failedCreateWorkerPod = errstd.New("...")
)

func (r *RayClusterReconciler) reconcilePods(ctx context.Context, instance *rayv1.RayCluster) error {
		...
		if err := r.createHeadPod(ctx, *instance); err != nil {
			common.FailedClustersCounterInc(instance.Namespace)
			return errstd.Join(failedCreateHeadPod, err)
		}
		...
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

var (
	failedDeleteAllPods   = errstd.New("...")
	failedDeleteHeadPod   = errstd.New("...")
	failedCreateHeadPod   = errstd.New("...")
	failedDeleteWorkerPod = errstd.New("...")
	failedCreateWorkerPod = errstd.New("...")
)
...

This makes sense to me.

@kevin85421 kevin85421 self-assigned this Jul 20, 2024
func (r *RayClusterReconciler) reconcilePods(ctx context.Context, instance *rayv1.RayCluster) error {
if err := r.doReconcilePods(ctx, instance); err != nil {
return fmt.Errorf("%w: %w", reconcilePodsErr, err)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe use https://pkg.go.dev/errors#Join and Unwrap instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can use Join. But Unwrap doesn't work well with the Join.
image

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To unwrap a joined error, we need a special trick:

errs := err.(interface{ Unwrap() []error }).Unwrap()

@@ -591,7 +600,17 @@ func (r *RayClusterReconciler) reconcileHeadlessService(ctx context.Context, ins
return nil
}

// reconcilePodsErr is a marker used by the calculateStatus() for setting the RayClusterReplicaFailure condition.
var reconcilePodsErr = errstd.New("reconcile pods error")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

move to constant.go?

@@ -391,6 +394,12 @@ func (r *RayClusterReconciler) inconsistentRayClusterStatus(ctx context.Context,
oldStatus.Endpoints, newStatus.Endpoints, oldStatus.Head, newStatus.Head))
return true
}
if !reflect.DeepEqual(oldStatus.Conditions, newStatus.Conditions) {
logger.Info("inconsistentRayClusterStatus", "detect inconsistency", fmt.Sprintf(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Avoid using fmt.Sprintf. Please use

logger.Info(..., ..., "old conditions", oldStatus.Conditions, "new conditions", newStatus.Conditions)

instead. See this doc for more details.

meta.SetStatusCondition(&newInstance.Status.Conditions, metav1.Condition{
Type: string(rayv1.RayClusterReplicaFailure),
Status: metav1.ConditionTrue,
Reason: "FailedReconcilePods",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

var (
	failedDeleteAllPods   = errstd.New("...")
	failedDeleteHeadPod   = errstd.New("...")
	failedCreateHeadPod   = errstd.New("...")
	failedDeleteWorkerPod = errstd.New("...")
	failedCreateWorkerPod = errstd.New("...")
)
...

This makes sense to me.

rueian added 3 commits July 22, 2024 23:18
Signed-off-by: Rueian <rueiancsie@gmail.com>
…flect the result of reconcilePods

Signed-off-by: Rueian <rueiancsie@gmail.com>
…flect the result of reconcilePods

Signed-off-by: Rueian <rueiancsie@gmail.com>
)

func RayClusterReplicaFailureReason(err error) string {
if e, ok := err.(interface{ Unwrap() []error }); ok {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

err.(interface{ Unwrap() []error }).Unwrap() is the only special trick to unwrap a joined error.

https://cs.opensource.google/go/go/+/refs/tags/go1.22.5:src/errors/join.go;l=60-62

Copy link
Member

@kevin85421 kevin85421 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Just left a nit comment.

@@ -523,3 +523,11 @@ env_vars:
})
}
}

func TestErrRayClusterReplicaFailureReason(t *testing.T) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we test the case that the error is not a RayClusterReplicaFailure?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. I will add a test that the RayClusterReplicaFailureReason() should return an empty string for a random error.

@kevin85421 kevin85421 merged commit ca39dc9 into ray-project:master Jul 22, 2024
25 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants