-
Notifications
You must be signed in to change notification settings - Fork 603
[Feat][RayJob] Delete RayJob CR after job termination #2225
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feat][RayJob] Delete RayJob CR after job termination #2225
Conversation
22fab0c
to
5bfcfa5
Compare
RayJobDefaultRequeueDuration = 3 * time.Second | ||
RayJobDefaultClusterSelectorKey = "ray.io/cluster" | ||
PythonUnbufferedEnvVarName = "PYTHONUNBUFFERED" | ||
DELETE_RAYJOB_CR_AFTER_JOB_FINISHES = "DELETE_RAYJOB_CR_AFTER_JOB_FINISHES" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest avoiding environment variables for this behavior. I think there are valid reasons why you want either behavior in a single cluster and we shouldn't force kuberay-operator to choose one or the other.
How about a new field in RayJob like spec.deleteAfterJobFinishes
or similar? Delete and shutdown after job finishes are two separate use-cases that I think warrant separate fields. They can be mutually exclusive so only one can be set at any time. .spec.ttlSecondsAfterFinished
applies for both scenarios.
What do you think @MortalHappiness @kevin85421?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I discussed with @kevin85421 few days ago. I also suggest that it may be better to change the RayJob CRD. However, he told me that he did't want the CRD be changed and this behavior can be implemented in cluster level. @kevin85421 What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current use case for this feature is to avoid having platform engineers periodically clean up submitter K8s Jobs. This feature is not related to Ray applications. That's why I prefer to set it at the KubeRay operator level. If there are use cases related to Ray applications, we can add a new field in CRD in the future to overwrite the behavior of the KubeRay operator level configuration.
I always try to avoid updating CRDs until there are clear use cases for doing so. It is hard to change, remove, or rename fields in CRDs.
// We only need to delete the RayCluster. We don't need to delete the submitter Kubernetes Job so that users can still access | ||
// the driver logs. In addition, a completed Kubernetes Job does not actually use any compute resources. | ||
if _, err = r.deleteClusterResources(ctx, rayJobInstance); err != nil { | ||
if os.Getenv(DELETE_RAYJOB_CR_AFTER_JOB_FINISHES) != "" { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
make the usage consistent with other feature flags.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
5bfcfa5
to
10b5989
Compare
RayJobDefaultRequeueDuration = 3 * time.Second | ||
RayJobDefaultClusterSelectorKey = "ray.io/cluster" | ||
PythonUnbufferedEnvVarName = "PYTHONUNBUFFERED" | ||
DELETE_RAYJOB_CR_AFTER_JOB_FINISHES = "DELETE_RAYJOB_CR_AFTER_JOB_FINISHES" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DeleteRayJobAfterJobFinishesEnvVar
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am wondering when to use Pascal case and when to use Constant case. This file also uses Constant case.
ENABLE_ZERO_DOWNTIME = "ENABLE_ZERO_DOWNTIME" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
According to @kevin85421's suggestion below, I've moved this variable to constant.go
.
#2225 (comment)
RayJobDefaultRequeueDuration = 3 * time.Second | ||
RayJobDefaultClusterSelectorKey = "ray.io/cluster" | ||
PythonUnbufferedEnvVarName = "PYTHONUNBUFFERED" | ||
DELETE_RAYJOB_CR_AFTER_JOB_FINISHES = "DELETE_RAYJOB_CR_AFTER_JOB_FINISHES" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider adding this to Config API too: https://github.com/ray-project/kuberay/tree/master/ray-operator/apis/config/v1alpha1
Not a blocker since it's still alpha
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
if _, err = r.deleteClusterResources(ctx, rayJobInstance); err != nil { | ||
if os.Getenv(DELETE_RAYJOB_CR_AFTER_JOB_FINISHES) == "true" { | ||
err = r.Client.Delete(ctx, rayJobInstance) | ||
logger.Info("RayJob CR is deleted", "RayJob", rayJobInstance.Name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: RayJob resource is deleted
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
if _, err = r.deleteClusterResources(ctx, rayJobInstance); err != nil { | ||
if os.Getenv(DELETE_RAYJOB_CR_AFTER_JOB_FINISHES) == "true" { | ||
err = r.Client.Delete(ctx, rayJobInstance) | ||
logger.Info("RayJob CR is deleted", "RayJob", rayJobInstance.Name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think every log message should already include the RayJob CR's name. Would you mind checking it? We can remove "RayJob", rayJobInstance.Name
if it is redundant.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I might also prefer to add a log message here because we can't ensure that:
- The RayJob will always have a finalizer added in the future. I seldom hear about use cases for the finalizer, so I might make it optional in the future.
- The RayJob won't be requeued again. We can't promise that it will always have another reconciliation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it. I've added back the logging.
b1f721a
to
c483413
Compare
// We only need to delete the RayCluster. We don't need to delete the submitter Kubernetes Job so that users can still access | ||
// the driver logs. In addition, a completed Kubernetes Job does not actually use any compute resources. | ||
if _, err = r.deleteClusterResources(ctx, rayJobInstance); err != nil { | ||
if os.Getenv(utils.DELETE_RAYJOB_CR_AFTER_JOB_FINISHES) == "true" { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This has a different behavior comparing to other env var feature flags. For example, most env var feature flags are not case-sensitive. If you have time, you can have a follow up PR to make all feature flags' behaviors consistent (e.g. case-insensitive).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. I've changed it to case-insensitive.
27e417d
to
08e3b7a
Compare
@@ -134,3 +134,6 @@ env: | |||
# Enabling this feature contributes to the robustness of Ray clusters. | |||
# - name: ENABLE_PROBES_INJECTION | |||
# value: "true" | |||
# If set to true, the RayJob CR itself will be deleted if shutdownAfterJobFinishes is set to true. Note that all resources created by the RayJob CR will be deleted, including the K8s Job. | |||
# - name: DELETE_RAYJOB_CR_AFTER_JOB_FINISHES | |||
# value: "true" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the default value is false. Can you (1) add a comment to explain the default value and default behavior (only the RayCluster will be deleted) and (2) change value: "true"
to value: "false"
(default value)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
Resolves: ray-project#1944 Signed-off-by: Chi-Sheng Liu <chishengliu@chishengliu.com>
08e3b7a
to
82c68f8
Compare
Why are these changes needed?
See the description in the corresponding issue for details.
Related issue number
Resolves: #1944
Checks