Trial job is succeeded but metrics are not reported, reconcile requeued

/kind bug

**What steps did you take and what happened:**
I just tried to run the random experiment example, through the Katib UI (I also tried creating an experiment using python, but the same error occurs). 

Following the experiment creation with the UI, I only changed the trial template (YAML), with this:

```
apiVersion: batch/v1
kind: Job
metadata:
  annotations:
    sidecar.istio.io/inject: "false"
    katib-metricscollector-injection: enabled
    katib-metrics-collector-injection: enabled
spec:
  template:
    metadata:
      annotations:
        sidecar.istio.io/inject: "false"
        katib-metricscollector-injection: enabled
        katib-metrics-collector-injection: enabled     
    spec: 
      containers:
        - name: training-container
          image: docker.io/kubeflowkatib/mxnet-mnist:latest
          command:
            - "python3"
            - "/opt/mxnet-mnist/mnist.py"
            - "--batch-size=64"
            - "--lr=${trialParameters.learningRate}"
            - "--num-layers=${trialParameters.numberLayers}"
            - "--optimizer=${trialParameters.optimizer}"
      restartPolicy: Never
```

After a couple of minutes, the pods created by the job terminated, with the status Completed, and printed my objective metric as this: 
```
2022-01-25T20:26:59Z INFO     Epoch[9] Train-accuracy=0.993770
2022-01-25T20:26:59Z INFO     Epoch[9] Time cost=5.344
2022-01-25T20:26:59Z INFO     Epoch[9] Validation-accuracy=0.978802
```

But the experiment, suggestions, and trials keep with status Running, *and new trials are not created.*

When I check the katib-controller logs, I get the following msg:

```
{"level":"info","ts":1643142603.5533006,"logger":"trial-controller","msg":"Trial job is succeeded but metrics are not reported, reconcile requeued","Trial":"kubeflow-user-example-com/random-experiment-vzkjcznm"}
{"level":"info","ts":1643142603.633143,"logger":"trial-controller","msg":"Trial job is succeeded but metrics are not reported, reconcile requeued","Trial":"kubeflow-user-example-com/random-experiment-c9qr67ww"}
{"level":"info","ts":1643142603.655875,"logger":"trial-controller","msg":"Trial job is succeeded but metrics are not reported, reconcile requeued","Trial":"kubeflow-user-example-com/random-experiment-smw6p6rg"}
```

**Additional Information:**

```
kubectl get experiment random-experiment -o yaml -n kubeflow-user-example-com
```

Results in:

<details>
  <summary>
    Output
  </summary>
<pre><code>
apiVersion: kubeflow.org/v1beta1
kind: Experiment
metadata:
  creationTimestamp: "2022-01-25T20:25:22Z"
  finalizers:
  - update-prometheus-metrics
  generation: 1
  name: random-experiment
  namespace: kubeflow-user-example-com
  resourceVersion: "126860285"
  uid: 91283c82-46e4-4b8b-9a3a-5cb730ad41d6
spec:
  algorithm:
    algorithmName: random
  maxFailedTrialCount: 3
  maxTrialCount: 12
  metricsCollectorSpec:
    collector:
      kind: StdOut
  objective:
    additionalMetricNames:
    - Train-accuracy
    goal: 0.05
    objectiveMetricName: Validation-accuracy
    type: maximize
  parallelTrialCount: 3
  parameters:
  - feasibleSpace:
      max: "0.03"
      min: "0.01"
      step: "0.01"
    name: lr
    parameterType: double
  - feasibleSpace:
      max: "64"
      min: "1"
      step: "1"
    name: num-layers
    parameterType: int
  - feasibleSpace:
      list:
      - sgd
      - adam
      - ftrl
    name: optimizer
    parameterType: categorical
  resumePolicy: LongRunning
  trialTemplate:
    failureCondition: status.conditions.#(type=="Failed")#|#(status=="True")#
    primaryContainerName: training-container
    successCondition: status.conditions.#(type=="Complete")#|#(status=="True")#
    trialParameters:
    - name: learningRate
      reference: lr
    - name: numberLayers
      reference: num-layers
    - name: optimizer
      reference: optimizer
    trialSpec:
      apiVersion: batch/v1
      kind: Job
      metadata:
        annotations:
          katib-metrics-collector-injection: enabled
          katib-metricscollector-injection: enabled
          sidecar.istio.io/inject: "false"
        labels:
          katib-metrics-collector-injection: enabled
          katib-metricscollector-injection: enabled
          sidecar.istio.io/inject: "false"
      spec:
        template:
          metadata:
            annotations:
              katib-metrics-collector-injection: enabled
              katib-metricscollector-injection: enabled
              sidecar.istio.io/inject: "false"
            labels:
              katib-metrics-collector-injection: enabled
              katib-metricscollector-injection: enabled
              sidecar.istio.io/inject: "false"
          spec:
            containers:
            - command:
              - python3
              - /opt/mxnet-mnist/mnist.py
              - --batch-size=64
              - --lr=${trialParameters.learningRate}
              - --num-layers=${trialParameters.numberLayers}
              - --optimizer=${trialParameters.optimizer}
              image: docker.io/kubeflowkatib/mxnet-mnist:latest
              name: training-container
            restartPolicy: Never
status:
  conditions:
  - lastTransitionTime: "2022-01-25T20:25:22Z"
    lastUpdateTime: "2022-01-25T20:25:22Z"
    message: Experiment is created
    reason: ExperimentCreated
    status: "True"
    type: Created
  - lastTransitionTime: "2022-01-25T20:25:44Z"
    lastUpdateTime: "2022-01-25T20:25:44Z"
    message: Experiment is running
    reason: ExperimentRunning
    status: "True"
    type: Running
  currentOptimalTrial:
    observation: {}
  runningTrialList:
  - random-experiment-smw6p6rg
  - random-experiment-c9qr67ww
  - random-experiment-vzkjcznm
  startTime: "2022-01-25T20:25:22Z"
  trials: 3
  trialsRunning: 3
</code></pre>
</details>

and

```
kubectl get trial random-experiment-c9qr67ww -n  kubeflow-user-example-com  -o yaml 
```

Results in:

<details>
  <summary>
    Output
  </summary>
<pre><code>
apiVersion: kubeflow.org/v1beta1
kind: Trial
metadata:
  creationTimestamp: "2022-01-25T20:25:44Z"
  finalizers:
  - clean-metrics-in-db
  generation: 1
  labels:
    katib.kubeflow.org/experiment: random-experiment
  name: random-experiment-c9qr67ww
  namespace: kubeflow-user-example-com
  ownerReferences:
  - apiVersion: kubeflow.org/v1beta1
    blockOwnerDeletion: true
    controller: true
    kind: Experiment
    name: random-experiment
    uid: 91283c82-46e4-4b8b-9a3a-5cb730ad41d6
  resourceVersion: "126860266"
  uid: 24a7d825-2737-4d6f-8ba8-5e22d776443f
spec:
  failureCondition: status.conditions.#(type=="Failed")#|#(status=="True")#
  metricsCollector:
    collector:
      kind: StdOut
  objective:
    additionalMetricNames:
    - Train-accuracy
    goal: 0.05
    objectiveMetricName: Validation-accuracy
    type: maximize
  parameterAssignments:
  - name: lr
    value: "0.018768621111940782"
  - name: num-layers
    value: "7"
  - name: optimizer
    value: sgd
  primaryContainerName: training-container
  runSpec:
    apiVersion: batch/v1
    kind: Job
    metadata:
      annotations:
        katib-metrics-collector-injection: enabled
        katib-metricscollector-injection: enabled
        sidecar.istio.io/inject: "false"
      labels:
        katib-metrics-collector-injection: enabled
        katib-metricscollector-injection: enabled
        sidecar.istio.io/inject: "false"
      name: random-experiment-c9qr67ww
      namespace: kubeflow-user-example-com
    spec:
      template:
        metadata:
          annotations:
            katib-metrics-collector-injection: enabled
            katib-metricscollector-injection: enabled
            sidecar.istio.io/inject: "false"
          labels:
            katib-metrics-collector-injection: enabled
            katib-metricscollector-injection: enabled
            sidecar.istio.io/inject: "false"
        spec:
          containers:
          - command:
            - python3
            - /opt/mxnet-mnist/mnist.py
            - --batch-size=64
            - --lr=0.018768621111940782
            - --num-layers=7
            - --optimizer=sgd
            image: docker.io/kubeflowkatib/mxnet-mnist:latest
            name: training-container
          restartPolicy: Never
  successCondition: status.conditions.#(type=="Complete")#|#(status=="True")#
status:
  conditions:
  - lastTransitionTime: "2022-01-25T20:25:44Z"
    lastUpdateTime: "2022-01-25T20:25:44Z"
    message: Trial is created
    reason: TrialCreated
    status: "True"
    type: Created
  - lastTransitionTime: "2022-01-25T20:25:44Z"
    lastUpdateTime: "2022-01-25T20:25:44Z"
    message: Trial is running
    reason: TrialRunning
    status: "True"
    type: Running
  startTime: "2022-01-25T20:25:44Z"
</code></pre>
</details>

**What did you expect to happen:**
Ideally, once the metrics are captured and the goal/maxTrial is reached, the trial status should change to succeeded.

What am I missing? 

Thanks




Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Trial job is succeeded but metrics are not reported, reconcile requeued #1795

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Trial job is succeeded but metrics are not reported, reconcile requeued #1795

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions