Skip to content
Permalink

Comparing changes

Choose two branches to see what’s changed or to start a new pull request. If you need to, you can also or learn more about diff comparisons.

Open a pull request

Create a new pull request by comparing changes across two branches. If you need to, you can also . Learn more about diff comparisons here.
base repository: Netflix/metaflow
Failed to load repositories. Confirm that selected base ref is valid, then try again.
Loading
base: 2.16.7
Choose a base ref
...
head repository: Netflix/metaflow
Failed to load repositories. Confirm that selected head ref is valid, then try again.
Loading
compare: 2.16.8
Choose a head ref
  • 2 commits
  • 3 files changed
  • 2 contributors

Commits on Jul 29, 2025

  1. [argo] Fix eventing for @parallel based workflow (#2516)

    > Back story behind how this PR came to be ?
    
    When we introduced JobSets into Argo Workflows, we had to implement a
    few "work-arounds" (hacks) in the JobSet manifest construction. One
    major hack we **HAD to implement** was replacing the
    `inputs.parameters.workerCount` in the JobSet manifest given to the Argo
    Workflow Template with an un-quoted templating string. The templating
    string looks like:`{{=asInt(inputs.parameters.workerCount)}}`. This
    templating string is quoted when we dump the manifest to a dictionary
    BUT having it quoted would mean the parameter gets casted as a string in
    the JobSet Manifest and jobset controller will reject that resource
    creation.
    
    Argo on the other hand accepts storing the "malformed" manifest when we
    submit the workflow template and trigger it via the k8s resource
    creation API. When we trigger via the k8s API, Argo controller will
    apply the templating the to the manifest before executing it. Argo does
    not accept such malformed manifests when we submit workflows via the
    argo CLI ie. it will crash with errors like `Failed to submit workflow:
    templates.TestEventFlow.tasks.pre-eval-foreach-parallel
    templates.pre-eval-foreach-parallel.tasks.train
    templates.train.resource.manifest must be a valid yaml"`
    
    When sensors where originally introduced into Metaflow, we used the
    [ArgoWorkflowTrigger](https://pkg.go.dev/github.com/argoproj/argo-events@v1.7.6/pkg/apis/sensor/v1alpha1#ArgoWorkflowTrigger)
    as a mean to trigger workflows. The ArgoWorkflowTrigger uses the Argo
    CLI under the hood to submit workflows. This glitches out when the
    workflow template involves manifests compiled for Jobsets. The sensor
    crashes with the logs like :
    ```log
    {"level":"info","ts":1753152856.4700022,"logger":"argo-events.sensor","caller":"argo-workflow/argo-workflow.go:207","msg":"submitting command &exec.Cmd{Path:\"/usr/local/bin/argo\", Args:[]string{\"argo\", \"-n\", \"jobs-default\", \"submit\", \"/tmp/testeventflow-2654745760\", \"--kubeconfig\", \"/etc/obp/data-plane-kubeconfig/config\"}, Env:[]string(nil), Dir:\"\", Stdin:io.Reader(nil), Stdout:(*os.File)(0xc00012c008), Stderr:(*os.File)(0xc00012c010), ExtraFiles:[]*os.File(nil), SysProcAttr:(*syscall.SysProcAttr)(nil), Process:(*os.Process)(nil), ProcessState:(*os.ProcessState)(nil), ctx:context.Context(nil), Err:error(nil), Cancel:(func() error)(nil), WaitDelay:0, childIOFiles:[]io.Closer(nil), parentIOPipes:[]io.Closer(nil), goroutine:[]func() error(nil), goroutineErr:(<-chan error)(nil), ctxResult:(<-chan exec.ctxResult)(nil), createdByStack:[]uint8(nil), lookPathErr:error(nil)}","sensorName":"testeventflow","triggerName":"testeventflow","triggerType":"ArgoWorkflow"}
    time="2025-07-22T02:54:16.560Z" level=fatal msg="Failed to submit workflow: templates.TestEventFlow.tasks.pre-eval-foreach-parallel templates.pre-eval-foreach-parallel.tasks.train templates.train.resource.manifest must be a valid yaml"
    ```
    
    Since the sensor deployments for workflows using `@parallel` was broken,
    we have to ship this commit/PR to resolve the issue.
    
    > How did we fix it ?
    
    We fixed this issue with 2 patches:
    
    1. Patching Argo workflows in Metaflow : Argo events also offers
    [StandardK8sTrigger](https://pkg.go.dev/github.com/argoproj/argo-events/pkg/apis/sensor/v1alpha1#StandardK8STrigger)
    which allows creating K8s resource directly. Since we are still creating
    workflows, this change should ideally (hopefully) be a No-op. Since we
    are just switching the abstraction triggering the workflow. From a
    metaflow POV, all the labels we propagate down are mostly captures.
    2. Patching Jobsets: Jobsets used to dump to JSON when we compiled the
    manifest for Argo Workflows. Because we were using JSON, the manifest
    became malform when it was triggered by event. Instead we now use YAML.
    valayDave authored Jul 29, 2025
    Configuration menu
    Copy the full SHA
    01fb592 View commit details
    Browse the repository at this point in the history
  2. patch release (#2529)

    savingoyal authored Jul 29, 2025
    Configuration menu
    Copy the full SHA
    6138942 View commit details
    Browse the repository at this point in the history
Loading