Skip to content

[Feat][RayJob] Delete RayJob CR after job termination #2225

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

MortalHappiness
Copy link
Member

Why are these changes needed?

See the description in the corresponding issue for details.

Related issue number

Resolves: #1944

Checks

  • I've made sure the tests are passing.
  • Testing Strategy
    • Unit tests
    • Manual tests
    • This PR is not tested :(

@MortalHappiness MortalHappiness force-pushed the feature/#1944-delete-rayjob-crd-after-termination branch 2 times, most recently from 22fab0c to 5bfcfa5 Compare July 8, 2024 13:47
RayJobDefaultRequeueDuration = 3 * time.Second
RayJobDefaultClusterSelectorKey = "ray.io/cluster"
PythonUnbufferedEnvVarName = "PYTHONUNBUFFERED"
DELETE_RAYJOB_CR_AFTER_JOB_FINISHES = "DELETE_RAYJOB_CR_AFTER_JOB_FINISHES"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest avoiding environment variables for this behavior. I think there are valid reasons why you want either behavior in a single cluster and we shouldn't force kuberay-operator to choose one or the other.

How about a new field in RayJob like spec.deleteAfterJobFinishes or similar? Delete and shutdown after job finishes are two separate use-cases that I think warrant separate fields. They can be mutually exclusive so only one can be set at any time. .spec.ttlSecondsAfterFinished applies for both scenarios.

What do you think @MortalHappiness @kevin85421?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I discussed with @kevin85421 few days ago. I also suggest that it may be better to change the RayJob CRD. However, he told me that he did't want the CRD be changed and this behavior can be implemented in cluster level. @kevin85421 What do you think?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current use case for this feature is to avoid having platform engineers periodically clean up submitter K8s Jobs. This feature is not related to Ray applications. That's why I prefer to set it at the KubeRay operator level. If there are use cases related to Ray applications, we can add a new field in CRD in the future to overwrite the behavior of the KubeRay operator level configuration.

I always try to avoid updating CRDs until there are clear use cases for doing so. It is hard to change, remove, or rename fields in CRDs.

@kevin85421 kevin85421 self-assigned this Jul 8, 2024
// We only need to delete the RayCluster. We don't need to delete the submitter Kubernetes Job so that users can still access
// the driver logs. In addition, a completed Kubernetes Job does not actually use any compute resources.
if _, err = r.deleteClusterResources(ctx, rayJobInstance); err != nil {
if os.Getenv(DELETE_RAYJOB_CR_AFTER_JOB_FINISHES) != "" {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make the usage consistent with other feature flags.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@MortalHappiness MortalHappiness force-pushed the feature/#1944-delete-rayjob-crd-after-termination branch from 5bfcfa5 to 10b5989 Compare July 9, 2024 11:15
RayJobDefaultRequeueDuration = 3 * time.Second
RayJobDefaultClusterSelectorKey = "ray.io/cluster"
PythonUnbufferedEnvVarName = "PYTHONUNBUFFERED"
DELETE_RAYJOB_CR_AFTER_JOB_FINISHES = "DELETE_RAYJOB_CR_AFTER_JOB_FINISHES"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DeleteRayJobAfterJobFinishesEnvVar

Copy link
Member Author

@MortalHappiness MortalHappiness Jul 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am wondering when to use Pascal case and when to use Constant case. This file also uses Constant case.

ENABLE_ZERO_DOWNTIME = "ENABLE_ZERO_DOWNTIME"

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to @kevin85421's suggestion below, I've moved this variable to constant.go.
#2225 (comment)

RayJobDefaultRequeueDuration = 3 * time.Second
RayJobDefaultClusterSelectorKey = "ray.io/cluster"
PythonUnbufferedEnvVarName = "PYTHONUNBUFFERED"
DELETE_RAYJOB_CR_AFTER_JOB_FINISHES = "DELETE_RAYJOB_CR_AFTER_JOB_FINISHES"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider adding this to Config API too: https://github.com/ray-project/kuberay/tree/master/ray-operator/apis/config/v1alpha1

Not a blocker since it's still alpha

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

if _, err = r.deleteClusterResources(ctx, rayJobInstance); err != nil {
if os.Getenv(DELETE_RAYJOB_CR_AFTER_JOB_FINISHES) == "true" {
err = r.Client.Delete(ctx, rayJobInstance)
logger.Info("RayJob CR is deleted", "RayJob", rayJobInstance.Name)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: RayJob resource is deleted

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

if _, err = r.deleteClusterResources(ctx, rayJobInstance); err != nil {
if os.Getenv(DELETE_RAYJOB_CR_AFTER_JOB_FINISHES) == "true" {
err = r.Client.Delete(ctx, rayJobInstance)
logger.Info("RayJob CR is deleted", "RayJob", rayJobInstance.Name)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think every log message should already include the RayJob CR's name. Would you mind checking it? We can remove "RayJob", rayJobInstance.Name if it is redundant.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Finally I removed this logging because I found there is already another place to log the RayJob deletion event.

logger.Info("RayJob is being deleted", "DeletionTimestamp", rayJobInstance.ObjectMeta.DeletionTimestamp)

image

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might also prefer to add a log message here because we can't ensure that:

  • The RayJob will always have a finalizer added in the future. I seldom hear about use cases for the finalizer, so I might make it optional in the future.
  • The RayJob won't be requeued again. We can't promise that it will always have another reconciliation.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. I've added back the logging.

@MortalHappiness MortalHappiness force-pushed the feature/#1944-delete-rayjob-crd-after-termination branch 2 times, most recently from b1f721a to c483413 Compare July 10, 2024 09:38
// We only need to delete the RayCluster. We don't need to delete the submitter Kubernetes Job so that users can still access
// the driver logs. In addition, a completed Kubernetes Job does not actually use any compute resources.
if _, err = r.deleteClusterResources(ctx, rayJobInstance); err != nil {
if os.Getenv(utils.DELETE_RAYJOB_CR_AFTER_JOB_FINISHES) == "true" {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This has a different behavior comparing to other env var feature flags. For example, most env var feature flags are not case-sensitive. If you have time, you can have a follow up PR to make all feature flags' behaviors consistent (e.g. case-insensitive).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. I've changed it to case-insensitive.

@MortalHappiness MortalHappiness force-pushed the feature/#1944-delete-rayjob-crd-after-termination branch 2 times, most recently from 27e417d to 08e3b7a Compare July 11, 2024 10:27
@@ -134,3 +134,6 @@ env:
# Enabling this feature contributes to the robustness of Ray clusters.
# - name: ENABLE_PROBES_INJECTION
# value: "true"
# If set to true, the RayJob CR itself will be deleted if shutdownAfterJobFinishes is set to true. Note that all resources created by the RayJob CR will be deleted, including the K8s Job.
# - name: DELETE_RAYJOB_CR_AFTER_JOB_FINISHES
# value: "true"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the default value is false. Can you (1) add a comment to explain the default value and default behavior (only the RayCluster will be deleted) and (2) change value: "true" to value: "false" (default value)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Resolves: ray-project#1944
Signed-off-by: Chi-Sheng Liu <chishengliu@chishengliu.com>
@MortalHappiness MortalHappiness force-pushed the feature/#1944-delete-rayjob-crd-after-termination branch from 08e3b7a to 82c68f8 Compare July 12, 2024 14:50
@kevin85421 kevin85421 merged commit cc94c6a into ray-project:master Jul 12, 2024
25 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Feature] TTL Delete RayJob CRD After Job Termination
3 participants