Skip to content

cml runner early shutdown from idle-timeout with active job #808

@dacbd

Description

@dacbd

Cloud provider: gcp
SCM / CICD: github / actions

I have been playing with short --idle-timeout values and I was a bit baffled when I encountered this. One workflow shutdown earlier before the GitHub action completes the final job, and the other completes as expected, but they are using nearly identical workflows.

Successful workflow:

name: Build and run Inference - DEV
on:
  push:
    branches: [dev]
jobs:
  build:
    environment: dev
    runs-on: ubuntu-latest
    steps:
      - uses: dacbd/gcr-build-push@main
        with:
          tags: latest
          project-id: ${{ secrets.GCP_PROJECT_ID }}
          GCR-key: ${{ secrets.GCP_GCR_KEY }}
          container-name: ***
  deploy-runner:
    runs-on: ubuntu-latest
    environment: dev
    steps:
      - uses: iterative/setup-cml@v1
      - uses: actions/checkout@v2
      - name: Deploy runner on GCP
        env:
          GOOGLE_APPLICATION_CREDENTIALS_DATA: ${{ secrets.GCP_CML_RUNNER_KEY }}
        run: |
          cml-runner \
            --single \
            --idle-timeout=360 \
            --token=${{ secrets.PAT_DCB }} \
            --cloud=gcp \
            --cloud-region=us-west \
            --cloud-type=e2-standard-16
  run-inference:
    needs: [build, deploy-runner]
    runs-on: [self-hosted, cml]
    environment: dev
    steps:
      - uses: actions/checkout@v2
      - name: Setup Container image name
        run: echo "ML_CONTAINER=gcr.io/${{ secrets.GCP_PROJECT_ID }}/***" >> "${GITHUB_ENV}"
      - name: Docker GCR Login
        uses: docker/login-action@v1
        with: 
          registry: gcr.io
          username: _json_key
          password: ${{ secrets.GCP_GCR_KEY }}
      - name: Load container from GCR
        run: docker pull ${{ env.ML_CONTAINER }}:latest
      - name: Run Inference
        run: |
          docker run ****

Failed workflow:

name: Build and run Inference - PROD
on:
  schedule: # Saturday at 1400 UTC
    - cron: '0 14 * * 6'
  push:
    branches: [prod]
jobs:
  build:
    environment: prod
    runs-on: ubuntu-latest
    steps:
      - uses: dacbd/gcr-build-push@v1
        with:
          tags: latest
          project-id: ${{ secrets.GCP_PROJECT_ID }}
          GCR-key: ${{ secrets.GCP_GCR_KEY }}
          container-name: ***
  deploy-runner:
    runs-on: ubuntu-latest
    environment: prod
    steps:
      - uses: iterative/setup-cml@v1
      - uses: actions/checkout@v2
      - name: Deploy runner on GCP
        env:
          GOOGLE_APPLICATION_CREDENTIALS_DATA: ${{ secrets.GCP_CML_RUNNER_KEY }}
        run: |
          cml-runner \
            --single \
            --idle-timeout=360 \
            --token=${{ secrets.PAT_DCB }} \
            --cloud=gcp \
            --cloud-region=us-west \
            --cloud-type=e2-standard-16
  run-inference:
    needs: [build, deploy-runner]
    runs-on: [self-hosted, cml]
    environment: prod
    steps:
      - uses: actions/checkout@v2
      - name: Setup Container image name
        run: echo "ML_CONTAINER=gcr.io/${{ secrets.GCP_PROJECT_ID }}/***" >> "${GITHUB_ENV}"
      - name: Docker GCR Login
        uses: docker/login-action@v1
        with: 
          registry: gcr.io
          username: _json_key
          password: ${{ secrets.GCP_GCR_KEY }}
      - name: Load container from GCR
        run: docker pull ${{ env.ML_CONTAINER }}
      - name: Run Inference
        run: |
          docker run ***

Successful cml.service log

daniel_barnes@cml-z53bz3oz1t:~$ sudo journalctl -u cml.service -f
-- Logs begin at Tue 2021-11-09 20:03:04 UTC. --
Nov 09 20:05:47 cml-z53bz3oz1t systemd[1]: Started cml.service.
Nov 09 20:05:52 cml-z53bz3oz1t cml.sh[17039]: {"level":"info","message":"Preparing workdir /tmp/tmp.7PolL2jgVJ/.cml/cml-z53bz3oz1t..."}
Nov 09 20:05:52 cml-z53bz3oz1t cml.sh[17039]: {"level":"info","message":"Launching github runner"}
Nov 09 20:06:02 cml-z53bz3oz1t cml.sh[17039]: {"level":"warn","message":"SpotNotifier can not be started."}
Nov 09 20:06:03 cml-z53bz3oz1t cml.sh[17039]: {"date":"2021-11-09T20:06:03.618Z","level":"info","message":"runner status","repo":"https://github.com/yyyy/xxxx"}
Nov 09 20:06:03 cml-z53bz3oz1t cml.sh[17039]: {"date":"2021-11-09T20:06:03.619Z","level":"info","message":"runner status √ Connected to GitHub","repo":"https://github.com/yyyy/xxxx"}
Nov 09 20:06:04 cml-z53bz3oz1t cml.sh[17039]: {"date":"2021-11-09T20:06:04.057Z","level":"info","message":"runner status Listening for Jobs","repo":"https://github.com/yyyy/xxxx","status":"ready"}
Nov 09 20:06:16 cml-z53bz3oz1t cml.sh[17039]: {"date":"2021-11-09T20:06:15.944Z","job":4157205540,"level":"info","message":"runner status Running job: run-inference","repo":"https://github.com/yyyy/xxxx","status":"job_started"}
Nov 09 20:18:21 cml-z53bz3oz1t cml.sh[17039]: {"date":"2021-11-09T20:18:21.355Z","job":"","level":"info","message":"runner status Job run-inference completed with result: Succeeded","repo":"https://github.com/yyyy/xxxx","status":"job_ended","success":true}
Nov 09 20:18:21 cml-z53bz3oz1t cml.sh[17039]: {"date":"2021-11-09T20:18:21.595Z","level":"info","message":"runner status √ Removed .credentials","repo":"https://github.com/yyyy/xxxx"}
Nov 09 20:18:21 cml-z53bz3oz1t cml.sh[17039]: {"date":"2021-11-09T20:18:21.596Z","level":"info","message":"runner status √ Removed .runner","repo":"https://github.com/yyyy/xxxx"}
Nov 09 20:18:21 cml-z53bz3oz1t cml.sh[17039]: {"level":"info","message":"runner status","reason":"proc_exit","status":"terminated"}
Nov 09 20:18:21 cml-z53bz3oz1t cml.sh[17039]: {"level":"info","message":"waiting 20 seconds before exiting..."}
Nov 09 20:18:41 cml-z53bz3oz1t cml.sh[17039]: {"level":"info","message":"Unregistering runner cml-z53bz3oz1t..."}
Nov 09 20:18:41 cml-z53bz3oz1t cml.sh[17039]: {"level":"error","message":"\tFailed: Cannot destructure property 'id' of '(intermediate value)' as it is undefined."}
Nov 09 20:18:44 cml-z53bz3oz1t systemd[1]: cml.service: Succeeded.

Failed cml.service log

daniel_barnes@cml-uawryesbyo:~$ sudo journalctl -u cml.service -f
-- Logs begin at Tue 2021-11-09 19:23:49 UTC. --
Nov 09 19:26:34 cml-uawryesbyo systemd[1]: Started cml.service.
Nov 09 19:26:39 cml-uawryesbyo cml.sh[17135]: {"level":"info","message":"Preparing workdir /tmp/tmp.TAz3gTOugR/.cml/cml-uawryesbyo..."}
Nov 09 19:26:39 cml-uawryesbyo cml.sh[17135]: {"level":"info","message":"Launching github runner"}
Nov 09 19:26:49 cml-uawryesbyo cml.sh[17135]: {"level":"warn","message":"SpotNotifier can not be started."}
Nov 09 19:26:50 cml-uawryesbyo cml.sh[17135]: {"date":"2021-11-09T19:26:50.163Z","level":"info","message":"runner status","repo":"https://github.com/yyyy/xxxx"}
Nov 09 19:26:50 cml-uawryesbyo cml.sh[17135]: {"date":"2021-11-09T19:26:50.163Z","level":"info","message":"runner status √ Connected to GitHub","repo":"https://github.com/yyyy/xxxx"}
Nov 09 19:26:50 cml-uawryesbyo cml.sh[17135]: {"date":"2021-11-09T19:26:50.583Z","level":"info","message":"runner status Listening for Jobs","repo":"https://github.com/yyyy/xxxx","status":"ready"}
Nov 09 19:27:02 cml-uawryesbyo cml.sh[17135]: {"level":"warn","message":"Failed parsing log: Reduce of empty array with no initial value"}
Nov 09 19:27:02 cml-uawryesbyo cml.sh[17135]: {"level":"warn","message":"Original log bytes, as Base64: 2021-11-09 19:27:01Z: Running job: run-inference\n"}
Nov 09 19:32:51 cml-uawryesbyo cml.sh[17135]: {"level":"info","message":"runner status","reason":"timeout:360","status":"terminated"}
Nov 09 19:32:51 cml-uawryesbyo cml.sh[17135]: {"level":"info","message":"waiting 20 seconds before exiting..."}
Nov 09 19:33:11 cml-uawryesbyo cml.sh[17135]: {"level":"info","message":"Unregistering runner cml-uawryesbyo..."}
Nov 09 19:33:12 cml-uawryesbyo cml.sh[17135]: {"level":"error","message":"\tFailed: Bad request - Runner \"cml-uawryesbyo\" is still running a job\""}
Nov 09 19:33:22 cml-uawryesbyo systemd[1]: cml.service: Succeeded.

Metadata

Metadata

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions