Skip to content

[argo] Fix eventing for @parallel based workflow #2516

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Jul 29, 2025

Conversation

valayDave
Copy link
Collaborator

@valayDave valayDave commented Jul 23, 2025

Back story behind how this PR came to be ?

When we introduced JobSets into Argo Workflows, we had to implement a few "work-arounds" (hacks) in the JobSet manifest construction. One major hack we HAD to implement was replacing the inputs.parameters.workerCount in the JobSet manifest given to the Argo Workflow Template with an un-quoted templating string. The templating string looks like:{{=asInt(inputs.parameters.workerCount)}}. This templating string is quoted when we dump the manifest to a dictionary BUT having it quoted would mean the parameter gets casted as a string in the JobSet Manifest and jobset controller will reject that resource creation.

Argo on the other hand accepts storing the "malformed" manifest when we submit the workflow template and trigger it via the k8s resource creation API. When we trigger via the k8s API, Argo controller will apply the templating the to the manifest before executing it. Argo does not accept such malformed manifests when we submit workflows via the argo CLI ie. it will crash with errors like Failed to submit workflow: templates.TestEventFlow.tasks.pre-eval-foreach-parallel templates.pre-eval-foreach-parallel.tasks.train templates.train.resource.manifest must be a valid yaml"

When sensors where originally introduced into Metaflow, we used the ArgoWorkflowTrigger as a mean to trigger workflows. The ArgoWorkflowTrigger uses the Argo CLI under the hood to submit workflows. This glitches out when the workflow template involves manifests compiled for Jobsets. The sensor crashes with the logs like :

{"level":"info","ts":1753152856.4700022,"logger":"argo-events.sensor","caller":"argo-workflow/argo-workflow.go:207","msg":"submitting command &exec.Cmd{Path:\"/usr/local/bin/argo\", Args:[]string{\"argo\", \"-n\", \"jobs-default\", \"submit\", \"/tmp/testeventflow-2654745760\", \"--kubeconfig\", \"/etc/obp/data-plane-kubeconfig/config\"}, Env:[]string(nil), Dir:\"\", Stdin:io.Reader(nil), Stdout:(*os.File)(0xc00012c008), Stderr:(*os.File)(0xc00012c010), ExtraFiles:[]*os.File(nil), SysProcAttr:(*syscall.SysProcAttr)(nil), Process:(*os.Process)(nil), ProcessState:(*os.ProcessState)(nil), ctx:context.Context(nil), Err:error(nil), Cancel:(func() error)(nil), WaitDelay:0, childIOFiles:[]io.Closer(nil), parentIOPipes:[]io.Closer(nil), goroutine:[]func() error(nil), goroutineErr:(<-chan error)(nil), ctxResult:(<-chan exec.ctxResult)(nil), createdByStack:[]uint8(nil), lookPathErr:error(nil)}","sensorName":"testeventflow","triggerName":"testeventflow","triggerType":"ArgoWorkflow"}
time="2025-07-22T02:54:16.560Z" level=fatal msg="Failed to submit workflow: templates.TestEventFlow.tasks.pre-eval-foreach-parallel templates.pre-eval-foreach-parallel.tasks.train templates.train.resource.manifest must be a valid yaml"

Since the sensor deployments for workflows using @parallel was broken, we have to ship this commit/PR to resolve the issue.

How did we fix it ?

We fixed this issue with 2 patches:

  1. Patching Argo workflows in Metaflow : Argo events also offers StandardK8sTrigger which allows creating K8s resource directly. Since we are still creating workflows, this change should ideally (hopefully) be a No-op. Since we are just switching the abstraction triggering the workflow. From a metaflow POV, all the labels we propagate down are mostly captures.
  2. Patching Jobsets: Jobsets used to dump to JSON when we compiled the manifest for Argo Workflows. Because we were using JSON, the manifest became malform when it was triggered by event. Instead we now use YAML.

@valayDave valayDave requested review from savingoyal and saikonen July 23, 2025 22:03
@valayDave valayDave changed the title [argo] Fix Eventing for Jobsets [argo] Fix eventing for @parallel based workflow Jul 23, 2025
Copy link
Collaborator

@saikonen saikonen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

quick initial testing on this and it would seem that it does not solve the issue for OSS at least.
Before the changes the malformed manifest YAML error is hidden within the argo sensor logs. After the change, the event trigger itself starts working, and the workflow that is supposed to trigger does launch, but now the malformed YAML error is simply moved to the launched workflows error message instead.

@valayDave valayDave force-pushed the valay/events-js-fix branch 3 times, most recently from 9b624c6 to 4debe54 Compare July 28, 2025 18:45
> Back story behind how this PR came to be ?

When we introduced JobSets into Argo Workflows, we had to implement a few "work-arounds" (hacks) in the JobSet manifest construction. One major hack we **HAD to implement** was replacing the `inputs.parameters.workerCount` in the JobSet manifest given to the Argo Workflow Template with an un-quoted templating string. The templating string looks like:`{{=asInt(inputs.parameters.workerCount)}}`. This templating string is quoted when we dump the manifest to a dictionary BUT having it quoted would mean the parameter gets casted as a string in the JobSet Manifest and jobset controller will reject that resource creation.

Argo on the other hand accepts storing the "malformed" manifest when we submit the workflow template and trigger it via the k8s resource creation API. When we trigger via the k8s API, Argo controller will apply the templating the to the manifest before executing it. Argo does not accept such malformed manifests when we submit workflows via the argo CLI ie. it will crash with errors like `Failed to submit workflow: templates.TestEventFlow.tasks.pre-eval-foreach-parallel templates.pre-eval-foreach-parallel.tasks.train templates.train.resource.manifest must be a valid yaml"`

When sensors where originally introduced into Metaflow, we used the [ArgoWorkflowTrigger](https://pkg.go.dev/github.com/argoproj/argo-events@v1.7.6/pkg/apis/sensor/v1alpha1#ArgoWorkflowTrigger) as a mean to trigger workflows. The ArgoWorkflowTrigger uses the Argo CLI under the hood to submit workflows. This glitches out when the workflow template involves manifests compiled for Jobsets. The sensor crashes with the logs like :
```log
{"level":"info","ts":1753152856.4700022,"logger":"argo-events.sensor","caller":"argo-workflow/argo-workflow.go:207","msg":"submitting command &exec.Cmd{Path:\"/usr/local/bin/argo\", Args:[]string{\"argo\", \"-n\", \"jobs-default\", \"submit\", \"/tmp/testeventflow-2654745760\", \"--kubeconfig\", \"/etc/obp/data-plane-kubeconfig/config\"}, Env:[]string(nil), Dir:\"\", Stdin:io.Reader(nil), Stdout:(*os.File)(0xc00012c008), Stderr:(*os.File)(0xc00012c010), ExtraFiles:[]*os.File(nil), SysProcAttr:(*syscall.SysProcAttr)(nil), Process:(*os.Process)(nil), ProcessState:(*os.ProcessState)(nil), ctx:context.Context(nil), Err:error(nil), Cancel:(func() error)(nil), WaitDelay:0, childIOFiles:[]io.Closer(nil), parentIOPipes:[]io.Closer(nil), goroutine:[]func() error(nil), goroutineErr:(<-chan error)(nil), ctxResult:(<-chan exec.ctxResult)(nil), createdByStack:[]uint8(nil), lookPathErr:error(nil)}","sensorName":"testeventflow","triggerName":"testeventflow","triggerType":"ArgoWorkflow"}
time="2025-07-22T02:54:16.560Z" level=fatal msg="Failed to submit workflow: templates.TestEventFlow.tasks.pre-eval-foreach-parallel templates.pre-eval-foreach-parallel.tasks.train templates.train.resource.manifest must be a valid yaml"
```

Since the sensor deployments for workflows using `@parallel` was broken, we have to ship this commit/PR to resolve the issue.

> How did we fix it ?

Argo events also offers [StandardK8sTrigger](https://pkg.go.dev/github.com/argoproj/argo-events/pkg/apis/sensor/v1alpha1#StandardK8STrigger) which allows creating K8s resource directly. Since we are still creating workflows, this change should ideally (hopefully) be a No-op. Since we are just switching the abstraction triggering the workflow. From a metaflow POV, all the labels we propagate down are mostly captures.
@valayDave valayDave force-pushed the valay/events-js-fix branch from 4debe54 to effe28f Compare July 29, 2025 16:57
@valayDave valayDave requested a review from saikonen July 29, 2025 16:57
@valayDave valayDave marked this pull request as ready for review July 29, 2025 17:37
- This PR contains two key components.
- One is the argo workflows change to use StandardK8sTrigger instead of using a ArgoWorkflowTrigger (Explaination Provided Earlier)
- The second change is the change to Jobset Manifest dumping to YAML instead of JSON. Because we dump to JSON somehow the parameter passdown breaks when the workflow is triggered via events.
- Instead if we YAML dump the jobset, then the substitution works out perfectly.
- using the vendor'd YAML package.
@valayDave valayDave force-pushed the valay/events-js-fix branch 2 times, most recently from a9c1d62 to 82dad56 Compare July 29, 2025 17:52
@savingoyal savingoyal merged commit 01fb592 into Netflix:master Jul 29, 2025
28 of 29 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants