-
Notifications
You must be signed in to change notification settings - Fork 872
[argo] Fix eventing for @parallel
based workflow
#2516
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@parallel
based workflow
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
quick initial testing on this and it would seem that it does not solve the issue for OSS at least.
Before the changes the malformed manifest YAML error is hidden within the argo sensor logs. After the change, the event trigger itself starts working, and the workflow that is supposed to trigger does launch, but now the malformed YAML error is simply moved to the launched workflows error message instead.
9b624c6
to
4debe54
Compare
> Back story behind how this PR came to be ? When we introduced JobSets into Argo Workflows, we had to implement a few "work-arounds" (hacks) in the JobSet manifest construction. One major hack we **HAD to implement** was replacing the `inputs.parameters.workerCount` in the JobSet manifest given to the Argo Workflow Template with an un-quoted templating string. The templating string looks like:`{{=asInt(inputs.parameters.workerCount)}}`. This templating string is quoted when we dump the manifest to a dictionary BUT having it quoted would mean the parameter gets casted as a string in the JobSet Manifest and jobset controller will reject that resource creation. Argo on the other hand accepts storing the "malformed" manifest when we submit the workflow template and trigger it via the k8s resource creation API. When we trigger via the k8s API, Argo controller will apply the templating the to the manifest before executing it. Argo does not accept such malformed manifests when we submit workflows via the argo CLI ie. it will crash with errors like `Failed to submit workflow: templates.TestEventFlow.tasks.pre-eval-foreach-parallel templates.pre-eval-foreach-parallel.tasks.train templates.train.resource.manifest must be a valid yaml"` When sensors where originally introduced into Metaflow, we used the [ArgoWorkflowTrigger](https://pkg.go.dev/github.com/argoproj/argo-events@v1.7.6/pkg/apis/sensor/v1alpha1#ArgoWorkflowTrigger) as a mean to trigger workflows. The ArgoWorkflowTrigger uses the Argo CLI under the hood to submit workflows. This glitches out when the workflow template involves manifests compiled for Jobsets. The sensor crashes with the logs like : ```log {"level":"info","ts":1753152856.4700022,"logger":"argo-events.sensor","caller":"argo-workflow/argo-workflow.go:207","msg":"submitting command &exec.Cmd{Path:\"/usr/local/bin/argo\", Args:[]string{\"argo\", \"-n\", \"jobs-default\", \"submit\", \"/tmp/testeventflow-2654745760\", \"--kubeconfig\", \"/etc/obp/data-plane-kubeconfig/config\"}, Env:[]string(nil), Dir:\"\", Stdin:io.Reader(nil), Stdout:(*os.File)(0xc00012c008), Stderr:(*os.File)(0xc00012c010), ExtraFiles:[]*os.File(nil), SysProcAttr:(*syscall.SysProcAttr)(nil), Process:(*os.Process)(nil), ProcessState:(*os.ProcessState)(nil), ctx:context.Context(nil), Err:error(nil), Cancel:(func() error)(nil), WaitDelay:0, childIOFiles:[]io.Closer(nil), parentIOPipes:[]io.Closer(nil), goroutine:[]func() error(nil), goroutineErr:(<-chan error)(nil), ctxResult:(<-chan exec.ctxResult)(nil), createdByStack:[]uint8(nil), lookPathErr:error(nil)}","sensorName":"testeventflow","triggerName":"testeventflow","triggerType":"ArgoWorkflow"} time="2025-07-22T02:54:16.560Z" level=fatal msg="Failed to submit workflow: templates.TestEventFlow.tasks.pre-eval-foreach-parallel templates.pre-eval-foreach-parallel.tasks.train templates.train.resource.manifest must be a valid yaml" ``` Since the sensor deployments for workflows using `@parallel` was broken, we have to ship this commit/PR to resolve the issue. > How did we fix it ? Argo events also offers [StandardK8sTrigger](https://pkg.go.dev/github.com/argoproj/argo-events/pkg/apis/sensor/v1alpha1#StandardK8STrigger) which allows creating K8s resource directly. Since we are still creating workflows, this change should ideally (hopefully) be a No-op. Since we are just switching the abstraction triggering the workflow. From a metaflow POV, all the labels we propagate down are mostly captures.
4debe54
to
effe28f
Compare
- This PR contains two key components. - One is the argo workflows change to use StandardK8sTrigger instead of using a ArgoWorkflowTrigger (Explaination Provided Earlier) - The second change is the change to Jobset Manifest dumping to YAML instead of JSON. Because we dump to JSON somehow the parameter passdown breaks when the workflow is triggered via events. - Instead if we YAML dump the jobset, then the substitution works out perfectly. - using the vendor'd YAML package.
a9c1d62
to
82dad56
Compare
When we introduced JobSets into Argo Workflows, we had to implement a few "work-arounds" (hacks) in the JobSet manifest construction. One major hack we HAD to implement was replacing the
inputs.parameters.workerCount
in the JobSet manifest given to the Argo Workflow Template with an un-quoted templating string. The templating string looks like:{{=asInt(inputs.parameters.workerCount)}}
. This templating string is quoted when we dump the manifest to a dictionary BUT having it quoted would mean the parameter gets casted as a string in the JobSet Manifest and jobset controller will reject that resource creation.Argo on the other hand accepts storing the "malformed" manifest when we submit the workflow template and trigger it via the k8s resource creation API. When we trigger via the k8s API, Argo controller will apply the templating the to the manifest before executing it. Argo does not accept such malformed manifests when we submit workflows via the argo CLI ie. it will crash with errors like
Failed to submit workflow: templates.TestEventFlow.tasks.pre-eval-foreach-parallel templates.pre-eval-foreach-parallel.tasks.train templates.train.resource.manifest must be a valid yaml"
When sensors where originally introduced into Metaflow, we used the ArgoWorkflowTrigger as a mean to trigger workflows. The ArgoWorkflowTrigger uses the Argo CLI under the hood to submit workflows. This glitches out when the workflow template involves manifests compiled for Jobsets. The sensor crashes with the logs like :
Since the sensor deployments for workflows using
@parallel
was broken, we have to ship this commit/PR to resolve the issue.We fixed this issue with 2 patches: