Skip to content

Trial job is succeeded but metrics are not reported, reconcile requeued #1795

@ccastro-pedro

Description

@ccastro-pedro

/kind bug

What steps did you take and what happened:
I just tried to run the random experiment example, through the Katib UI (I also tried creating an experiment using python, but the same error occurs).

Following the experiment creation with the UI, I only changed the trial template (YAML), with this:

apiVersion: batch/v1
kind: Job
metadata:
  annotations:
    sidecar.istio.io/inject: "false"
    katib-metricscollector-injection: enabled
    katib-metrics-collector-injection: enabled
spec:
  template:
    metadata:
      annotations:
        sidecar.istio.io/inject: "false"
        katib-metricscollector-injection: enabled
        katib-metrics-collector-injection: enabled     
    spec: 
      containers:
        - name: training-container
          image: docker.io/kubeflowkatib/mxnet-mnist:latest
          command:
            - "python3"
            - "/opt/mxnet-mnist/mnist.py"
            - "--batch-size=64"
            - "--lr=${trialParameters.learningRate}"
            - "--num-layers=${trialParameters.numberLayers}"
            - "--optimizer=${trialParameters.optimizer}"
      restartPolicy: Never

After a couple of minutes, the pods created by the job terminated, with the status Completed, and printed my objective metric as this:

2022-01-25T20:26:59Z INFO     Epoch[9] Train-accuracy=0.993770
2022-01-25T20:26:59Z INFO     Epoch[9] Time cost=5.344
2022-01-25T20:26:59Z INFO     Epoch[9] Validation-accuracy=0.978802

But the experiment, suggestions, and trials keep with status Running, and new trials are not created.

When I check the katib-controller logs, I get the following msg:

{"level":"info","ts":1643142603.5533006,"logger":"trial-controller","msg":"Trial job is succeeded but metrics are not reported, reconcile requeued","Trial":"kubeflow-user-example-com/random-experiment-vzkjcznm"}
{"level":"info","ts":1643142603.633143,"logger":"trial-controller","msg":"Trial job is succeeded but metrics are not reported, reconcile requeued","Trial":"kubeflow-user-example-com/random-experiment-c9qr67ww"}
{"level":"info","ts":1643142603.655875,"logger":"trial-controller","msg":"Trial job is succeeded but metrics are not reported, reconcile requeued","Trial":"kubeflow-user-example-com/random-experiment-smw6p6rg"}

Additional Information:

kubectl get experiment random-experiment -o yaml -n kubeflow-user-example-com

Results in:

Output

apiVersion: kubeflow.org/v1beta1
kind: Experiment
metadata:
  creationTimestamp: "2022-01-25T20:25:22Z"
  finalizers:
  - update-prometheus-metrics
  generation: 1
  name: random-experiment
  namespace: kubeflow-user-example-com
  resourceVersion: "126860285"
  uid: 91283c82-46e4-4b8b-9a3a-5cb730ad41d6
spec:
  algorithm:
    algorithmName: random
  maxFailedTrialCount: 3
  maxTrialCount: 12
  metricsCollectorSpec:
    collector:
      kind: StdOut
  objective:
    additionalMetricNames:
    - Train-accuracy
    goal: 0.05
    objectiveMetricName: Validation-accuracy
    type: maximize
  parallelTrialCount: 3
  parameters:
  - feasibleSpace:
      max: "0.03"
      min: "0.01"
      step: "0.01"
    name: lr
    parameterType: double
  - feasibleSpace:
      max: "64"
      min: "1"
      step: "1"
    name: num-layers
    parameterType: int
  - feasibleSpace:
      list:
      - sgd
      - adam
      - ftrl
    name: optimizer
    parameterType: categorical
  resumePolicy: LongRunning
  trialTemplate:
    failureCondition: status.conditions.#(type=="Failed")#|#(status=="True")#
    primaryContainerName: training-container
    successCondition: status.conditions.#(type=="Complete")#|#(status=="True")#
    trialParameters:
    - name: learningRate
      reference: lr
    - name: numberLayers
      reference: num-layers
    - name: optimizer
      reference: optimizer
    trialSpec:
      apiVersion: batch/v1
      kind: Job
      metadata:
        annotations:
          katib-metrics-collector-injection: enabled
          katib-metricscollector-injection: enabled
          sidecar.istio.io/inject: "false"
        labels:
          katib-metrics-collector-injection: enabled
          katib-metricscollector-injection: enabled
          sidecar.istio.io/inject: "false"
      spec:
        template:
          metadata:
            annotations:
              katib-metrics-collector-injection: enabled
              katib-metricscollector-injection: enabled
              sidecar.istio.io/inject: "false"
            labels:
              katib-metrics-collector-injection: enabled
              katib-metricscollector-injection: enabled
              sidecar.istio.io/inject: "false"
          spec:
            containers:
            - command:
              - python3
              - /opt/mxnet-mnist/mnist.py
              - --batch-size=64
              - --lr=${trialParameters.learningRate}
              - --num-layers=${trialParameters.numberLayers}
              - --optimizer=${trialParameters.optimizer}
              image: docker.io/kubeflowkatib/mxnet-mnist:latest
              name: training-container
            restartPolicy: Never
status:
  conditions:
  - lastTransitionTime: "2022-01-25T20:25:22Z"
    lastUpdateTime: "2022-01-25T20:25:22Z"
    message: Experiment is created
    reason: ExperimentCreated
    status: "True"
    type: Created
  - lastTransitionTime: "2022-01-25T20:25:44Z"
    lastUpdateTime: "2022-01-25T20:25:44Z"
    message: Experiment is running
    reason: ExperimentRunning
    status: "True"
    type: Running
  currentOptimalTrial:
    observation: {}
  runningTrialList:
  - random-experiment-smw6p6rg
  - random-experiment-c9qr67ww
  - random-experiment-vzkjcznm
  startTime: "2022-01-25T20:25:22Z"
  trials: 3
  trialsRunning: 3

and

kubectl get trial random-experiment-c9qr67ww -n  kubeflow-user-example-com  -o yaml 

Results in:

Output

apiVersion: kubeflow.org/v1beta1
kind: Trial
metadata:
  creationTimestamp: "2022-01-25T20:25:44Z"
  finalizers:
  - clean-metrics-in-db
  generation: 1
  labels:
    katib.kubeflow.org/experiment: random-experiment
  name: random-experiment-c9qr67ww
  namespace: kubeflow-user-example-com
  ownerReferences:
  - apiVersion: kubeflow.org/v1beta1
    blockOwnerDeletion: true
    controller: true
    kind: Experiment
    name: random-experiment
    uid: 91283c82-46e4-4b8b-9a3a-5cb730ad41d6
  resourceVersion: "126860266"
  uid: 24a7d825-2737-4d6f-8ba8-5e22d776443f
spec:
  failureCondition: status.conditions.#(type=="Failed")#|#(status=="True")#
  metricsCollector:
    collector:
      kind: StdOut
  objective:
    additionalMetricNames:
    - Train-accuracy
    goal: 0.05
    objectiveMetricName: Validation-accuracy
    type: maximize
  parameterAssignments:
  - name: lr
    value: "0.018768621111940782"
  - name: num-layers
    value: "7"
  - name: optimizer
    value: sgd
  primaryContainerName: training-container
  runSpec:
    apiVersion: batch/v1
    kind: Job
    metadata:
      annotations:
        katib-metrics-collector-injection: enabled
        katib-metricscollector-injection: enabled
        sidecar.istio.io/inject: "false"
      labels:
        katib-metrics-collector-injection: enabled
        katib-metricscollector-injection: enabled
        sidecar.istio.io/inject: "false"
      name: random-experiment-c9qr67ww
      namespace: kubeflow-user-example-com
    spec:
      template:
        metadata:
          annotations:
            katib-metrics-collector-injection: enabled
            katib-metricscollector-injection: enabled
            sidecar.istio.io/inject: "false"
          labels:
            katib-metrics-collector-injection: enabled
            katib-metricscollector-injection: enabled
            sidecar.istio.io/inject: "false"
        spec:
          containers:
          - command:
            - python3
            - /opt/mxnet-mnist/mnist.py
            - --batch-size=64
            - --lr=0.018768621111940782
            - --num-layers=7
            - --optimizer=sgd
            image: docker.io/kubeflowkatib/mxnet-mnist:latest
            name: training-container
          restartPolicy: Never
  successCondition: status.conditions.#(type=="Complete")#|#(status=="True")#
status:
  conditions:
  - lastTransitionTime: "2022-01-25T20:25:44Z"
    lastUpdateTime: "2022-01-25T20:25:44Z"
    message: Trial is created
    reason: TrialCreated
    status: "True"
    type: Created
  - lastTransitionTime: "2022-01-25T20:25:44Z"
    lastUpdateTime: "2022-01-25T20:25:44Z"
    message: Trial is running
    reason: TrialRunning
    status: "True"
    type: Running
  startTime: "2022-01-25T20:25:44Z"

What did you expect to happen:
Ideally, once the metrics are captured and the goal/maxTrial is reached, the trial status should change to succeeded.

What am I missing?

Thanks

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions