Skip to content

Conversation

zanieb
Copy link
Contributor

@zanieb zanieb commented Dec 2, 2022

Co-authored with @peytonrunyan
Closes #7739

Converts Kubernetes job PIDs from {cluster_name}:{job_name} to {cluster_uid}:{namespace}:{job_name}. Cluster names can be arbitrarily set in the Kubernetes config and are not a reliable unique identifier. Additionally, the config does not exist when using an in-cluster agent or in cases like #7739. We generate a stable cluster UID from the kube-system namespace UID as described in kubernetes/kubernetes#44954 — which tracks the lack of a Kubernetes API for retrieving a cluster identifier. I attempted to use the job UID instead of the job name because it is guaranteed to be unique across namespaces. However, there is not a clear way to look up a job by its UID later without listing all jobs. Instead, we include the namespace in the PID to ensure we do not kill jobs in the wrong namespace.

Additionally, in #7701 the KubernetesJobResult returns were not updated to include the new identifier. Here, I do some minor refactoring to fix this.

Example

from prefect.infrastructure import KubernetesJob
import anyio


async def main():
   async with anyio.create_task_group() as tg:
       job = KubernetesJob(
           command=["sleep", "1000"], pod_watch_timeout_seconds=5
       )

       pid = await tg.start(job.run)
       print("PID:", pid)

       await job.kill(pid)
       print("Killed :)")


anyio.run(main)

Checklist

  • This pull request references any related issue by including "closes <link to issue>"
    • If no issue exists and your change is not a small fix, please create an issue first.
  • This pull request includes tests or only affects documentation.
  • This pull request includes a label categorizing the change e.g. fix, feature, enhancement

@netlify
Copy link

netlify bot commented Dec 2, 2022

Deploy Preview for prefect-orion ready!

Name Link
🔨 Latest commit 479d3d0
🔍 Latest deploy log https://app.netlify.com/sites/prefect-orion/deploys/638e5e9d1e5fee00085747c9
😎 Deploy Preview https://deploy-preview-7747--prefect-orion.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site settings.

@zanieb zanieb mentioned this pull request Dec 2, 2022
4 tasks
@peytonrunyan peytonrunyan self-assigned this Dec 5, 2022
@peytonrunyan peytonrunyan marked this pull request as ready for review December 5, 2022 21:12
@zanieb zanieb added the enhancement An improvement of an existing feature label Dec 5, 2022
@peytonrunyan peytonrunyan added the v2 label Dec 5, 2022
Copy link
Contributor

@bunchesofdonald bunchesofdonald left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great! Using the namespace UID sure seems like the right way forward.

@zanieb zanieb merged commit 0c9ee08 into main Dec 6, 2022
@zanieb zanieb deleted the kubernetes-pid branch December 6, 2022 17:04
github-actions bot pushed a commit to ddelange/prefect that referenced this pull request Dec 15, 2022
…s job identifiers (PrefectHQ#7747)

Co-authored-by: peytonrunyan <peytonrunyan@gmail.com>
Co-authored-by: Peyton <44583861+peytonrunyan@users.noreply.github.com>
@scottwalshqs
Copy link

FYI, this change appears to have created a dependency that the agent must now have access to the kube-system namespace in order to get the cluster UID.

HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"namespaces \"kube-system\" is forbidden: User \"system:serviceaccount:default:default\" cannot get resource \"namespaces\" in API group \"\" in the namespace \"kube-system\"","reason":"Forbidden","details":{"name":"kube-system","kind":"namespaces"},"code":403}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement An improvement of an existing feature
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Agent fails to cancel kubernetes jobs
4 participants