-
Notifications
You must be signed in to change notification settings - Fork 37.7k
ci: Migrate CI to hosted Cirrus Runners #32989
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
The following sections might be updated with supplementary metadata relevant to reviewers and maintainers. Code Coverage & BenchmarksFor details see: https://corecheck.dev/bitcoin/bitcoin/pulls/32989. ReviewsSee the guideline for information on the review process.
If your review is incorrectly listed, please react with 👎 to this comment and the bot will ignore it on the next update. ConflictsReviewers, this pull request conflicts with the following ones:
If you consider this pull request important, please also help to review the conflicting pull requests. Ideally, start with the one that should be merged first. |
Concept ACK. This will also need to go back to |
Testing a backport to 29.x here: https://github.com/testing-cirrus-runners/bitcoin2/actions/runs/16320536543 I think the best course of action could be to look for a little more conceptual review here, and after that squash the "ci: port x" commits in this changeset down to a single one, to make backporting to the multiple supported branches easier. |
Concept ACK |
I don't think this is true. A pull request that modifies a core header (like serialize.h) will now always start from a cold cache. The current persistent workers have a high ccache hit rate for pulls that are (force) pushed for minor fixups (https://0xb10c.github.io/bitcoin-core-ci-stats/graph/ccache/). Also, before CI runs, pull requests are rebased/merged with master, so the age of a pull request alone shouldn't affect cache hit rate. However, the trade-offs here are probably worth to go forward and try to optimize the ccache hit rate later. Concept ACK.
This seems a bit scary. Are you saying that a proprietary third party outside of our control can now push directly to the repo? My assumption was that the tokens would be added to CI in this repo and CI had write access to the registry, not the other way round. Why would the registry need write access here? Edit: We may reconsider #31850 and drop container image caching, and just accept the intermittent network IO errors or network speed issues. |
Yes, force pushes for minor fixups is the tradeoff we have in the current implementation. As you say, we can set the ccache to save on pull requests too (in the future) if necessary.
Sorry for not being clearer! The robot account gets read/write access to the Quay.io (docker) repo, not this code repository!
Yes, I would love to get #31850 working in any case, as it would simply avoid long rebuilds in the worst cases; most docker images are rebuilding in < 2 minutes, except MSAN... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess you want review here and then address it, as it comes in? Once review is finished, the app will be installed and reviewers can also look at a "real" run in this repo?
looked at c0ad2b6~23 🎬
Show signature
Signature:
untrusted comment: signature from minisign secret key on empty file; verify via: minisign -Vm "${path_to_any_empty_file}" -P RWTRmVTMeKV5noAMqVlsMugDDCyyTSbA3Re5AkUrhvLVln0tSaFWglOw -x "${path_to_this_whole_four_line_signature_blob}"
RUTRmVTMeKV5npGrKx1nqXCw5zeVHdtdYURB/KlyA/LMFgpNCs+SkW9a8N95d+U4AP1RJMi+krxU1A3Yux4bpwZNLvVBKy0wLgM=
trusted comment: looked at c0ad2b6aa8e8c31c9f9c9ea2b35ca86f7985c490~23 🎬
gylJtm++jv0E+65SRoLFPC+ef+fwpVJftiMQ+ziB1uRZAF2MwE7TW3JaEv8iJIpZsRmForPR0jik8/6QvUs+BQ==
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Concept ACK.
we qualify for an open source discount of 50%.
We would be dependent on Cirrus infra...
We shouldn't be surprised when Cirrus suddenly changes its modus operandi, including its advertised open-source discount or general availability.
Certainly, it is good to be wary of that. I think this is equally true for all cloud providers though. It's my belief that if we complete this migration we are resonably well protected against this risk for the following reasons:
We seem to have a good working relationship so far with Cirrus, @m3dwards has a responsive and helpful contact there. Of course, we are in the tendering stage so there is perhaps extra impetus to be helpful to us, but I don't see any reason that a historical precedence of limiting free runners (which are allegedly being abused for crypto mining), should appear any more risky to paid/premium customers (than any other provider). |
Agree that this may happen with any third party (including GitHub itself). If we want to switch back to the self-hosted runners, it should be as trivial as |
c0ad2b6
to
c126475
Compare
Pushed c126475 with a CI run on master branch at https://github.com/testing-cirrus-runners/bitcoin2/actions/runs/16368410249 |
Concept ACK! I think finishing the migration would close #31965 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looked at c126475~20 💇
Show signature
Signature:
untrusted comment: signature from minisign secret key on empty file; verify via: minisign -Vm "${path_to_any_empty_file}" -P RWTRmVTMeKV5noAMqVlsMugDDCyyTSbA3Re5AkUrhvLVln0tSaFWglOw -x "${path_to_this_whole_four_line_signature_blob}"
RUTRmVTMeKV5npGrKx1nqXCw5zeVHdtdYURB/KlyA/LMFgpNCs+SkW9a8N95d+U4AP1RJMi+krxU1A3Yux4bpwZNLvVBKy0wLgM=
trusted comment: looked at c126475ed7a17ec9030066056e31846c7124dcf~20 💇
kep8ZK4UJEamaLijXtMFwgjSmf1fhSJuF49dbZ/NHDe/5jmIZR2EzJa0ewjjGov4n3xWZMN5f3LKRekrMZ6HAw==
fe0906f
to
b4e85f5
Compare
Thanks for the review on this so far. Whilst we had tested the docker registry caching on PRs successfully, because I was opening them myself (and was owner of the parent repo) repo-level variables were available to me which were not available to 3rd party pull requests. The short of this is that this meant the docker registry cache setup could not pull from the registry on pull requests and we didn't think it was therefore suitable for our purposes. We have switched to the A push to master can be found here: https://github.com/testing-cirrus-runners/bitcoin2/actions/runs/16529929752 And a pull request (from a 3rd party account) here: https://github.com/testing-cirrus-runners/bitcoin2/actions/runs/16531270443?pr=3 A new commit has been added, 0bd758e, to fix a (new?) issue we experienced with the asan job where the runner host appeared to update it's image and kernel. The cached docker image for the ASAN job then had the incorrect Marking as ready for review now, as I think this is conceptually ready. |
Co-authored-by: Max Edwards <youwontforgetthis@gmail.com>
Co-authored-by: Max Edwards <youwontforgetthis@gmail.com>
Co-authored-by: Max Edwards <youwontforgetthis@gmail.com>
Co-authored-by: Max Edwards <youwontforgetthis@gmail.com>
Co-authored-by: Max Edwards <youwontforgetthis@gmail.com>
Co-authored-by: Max Edwards <youwontforgetthis@gmail.com>
Co-authored-by: Max Edwards <youwontforgetthis@gmail.com>
Co-authored-by: Max Edwards <youwontforgetthis@gmail.com>
Co-authored-by: Max Edwards <youwontforgetthis@gmail.com>
Co-authored-by: Max Edwards <youwontforgetthis@gmail.com>
Removed as unused.
Previously jobs were running on a large multi-core server where 10 jobs as default made sense (or may even have been on the low side). Using hosted runners with fixed (and lower) numbers of vCPUs we should adapt compilation to match the number of cpus we have dynamically. This is cross-platform compatible with macos and linux only.
Print the ccache hit-rate for the job using a GitHub annotation if it was below 75%.
Docker currently warns that we are missing a default value. Set this to scratch which will error if an appropriate image tag is not passed in to silence the warning.
ci/lint_run.sh: Only used in .cirrus.yml. Refer to test/lint/README.md on how to run locally. ci/lint_run_all.sh: Only used in .cirrus.yml for stale re-runs of old pull request tasks.
8a7eb6e
to
3c5da69
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
re-ACK 3c5da69 🏗 Show signatureSignature:
|
ACK 3c5da69 |
ACK 3c5da69 Not particularly familiar with github actions or the cirrus, but it seems like all of the tasks that we were running on cirrus previously are correctly ported over and working. |
@@ -26,6 +26,7 @@ if [ -z "$DANGER_RUN_CI_ON_HOST" ]; then | |||
--file "${BASE_READ_ONLY_DIR}/ci/test_imagefile" \ | |||
--build-arg "CI_IMAGE_NAME_TAG=${CI_IMAGE_NAME_TAG}" \ | |||
--build-arg "FILE_ENV=${FILE_ENV}" \ | |||
--build-arg "BASE_ROOT_DIR=${BASE_ROOT_DIR}" \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit in 9c2b96e: Actually, on a second thought, this seems a bit fragile. It seems this would break the use-case of using several different local folders to run the CI, because the CI images have different, non-deterministic, paths embedded?
I presume the correct fix would be to download the sdk to a hard-coded path in the container image, and then adjust the pre-existing rsync call to copy it at runtime in the container, if needed.
27.x looks close to EOL in one month, according to https://bitcoincore.org/en/lifecycle/, so I guess it could just die with the old CI? |
Yea. I think it's fine to leave |
@@ -284,7 +301,8 @@ jobs: | |||
windows-cross: | |||
name: 'Linux->Windows cross, no tests' | |||
runs-on: ubuntu-latest | |||
needs: runners | |||
runs-on: ${{ needs.runners.outputs.use-cirrus-runners == 'true' && 'ghcr.io/cirruslabs/ubuntu-runner-amd64:24.04-md' || 'ubuntu-24.04' }} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Any reason to use md
here? Most of the time 7/8 CPUs will be idle (to download the caches or to upload the artifact). Also, the similar mac-cross build is using sm
.
Suggested diff:
diff --git a/.github/workflows/ci.yml b/.github/workflows/ci.yml
index 6876b8328d..2efe17a04e 100644
--- a/.github/workflows/ci.yml
+++ b/.github/workflows/ci.yml
@@ -302,7 +302,7 @@ jobs:
windows-cross:
name: 'Linux->Windows cross, no tests'
needs: runners
- runs-on: ${{ needs.runners.outputs.use-cirrus-runners == 'true' && 'ghcr.io/cirruslabs/ubuntu-runner-amd64:24.04-md' || 'ubuntu-24.04' }}
+ runs-on: ${{ needs.runners.outputs.use-cirrus-runners == 'true' && 'ghcr.io/cirruslabs/ubuntu-runner-amd64:24.04-sm' || 'ubuntu-24.04' }}
if: ${{ vars.SKIP_BRANCH_PUSH != 'true' || github.event_name == 'pull_request' }}
env:
@@ -433,7 +433,7 @@ jobs:
file-env: './ci/test/00_setup_env_arm.sh'
- name: 'ASan + LSan + UBSan + integer, no depends, USDT'
- cirrus-runner: 'ghcr.io/cirruslabs/ubuntu-runner-amd64:24.04-lg' # has to match container in ci/test/00_setup_env_native_asan.sh for tracing tools
+ cirrus-runner: 'ghcr.io/cirruslabs/ubuntu-runner-amd64:24.04-md' # has to match container in ci/test/00_setup_env_native_asan.sh for tracing tools
fallback-runner: 'ubuntu-24.04'
timeout-minutes: 120
file-env: './ci/test/00_setup_env_native_asan.sh'
@@ -445,7 +445,7 @@ jobs:
file-env: './ci/test/00_setup_env_mac_cross.sh'
- name: 'No wallet, libbitcoinkernel'
- cirrus-runner: 'ghcr.io/cirruslabs/ubuntu-runner-amd64:24.04-md'
+ cirrus-runner: 'ghcr.io/cirruslabs/ubuntu-runner-amd64:24.04-sm'
fallback-runner: 'ubuntu-24.04'
timeout-minutes: 120
file-env: './ci/test/00_setup_env_native_nowallet_libbitcoinkernel.sh'
@@ -481,7 +481,7 @@ jobs:
file-env: './ci/test/00_setup_env_native_tidy.sh'
- name: 'TSan, depends, no gui'
- cirrus-runner: 'ghcr.io/cirruslabs/ubuntu-runner-amd64:24.04-lg'
+ cirrus-runner: 'ghcr.io/cirruslabs/ubuntu-runner-amd64:24.04-md'
fallback-runner: 'ubuntu-24.04'
timeout-minutes: 120
file-env: './ci/test/00_setup_env_native_tsan.sh'
@@ -528,7 +528,7 @@ jobs:
lint:
name: 'lint'
needs: runners
- runs-on: ${{ needs.runners.outputs.use-cirrus-runners == 'true' && 'ghcr.io/cirruslabs/ubuntu-runner-amd64:24.04-sm' || 'ubuntu-24.04' }}
+ runs-on: ${{ needs.runners.outputs.use-cirrus-runners == 'true' && 'ghcr.io/cirruslabs/ubuntu-runner-amd64:24.04-xs' || 'ubuntu-24.04' }}
if: ${{ vars.SKIP_BRANCH_PUSH != 'true' || github.event_name == 'pull_request' }}
timeout-minutes: 20
env:
This will also reduce the machine for the asan task, which similarly spends 20 (!) minutes in a single test on a single CPU (tool_signet_miner). (Using a larger runner won't help here speed this test up). I'd say maybe the test should be adjusted to be faster.
Also, the linters are single-threaded, so are good with the smallest instance.
And tsan and nowallet seem to be fine with a smaller instance as well.
The patch overall should free up 1.5 instances to be used in pull requests in parallel. Thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Runner size wasn't extensively tested for performance, your logic for going for smaller runners seems sound to me.
This changeset migrates all current self-hosted CI jobs over to hosted Cirrus Runners.
These runners cost a flat rate of $150/month, and we qualify for an open source discount of 50%. Therefore they are $75/month/runner.
One "runner" should more accurately be thought of in terms of the number of vCPU you are purchasing: https://cirrus-runners.app/pricing/ or in terms of "concurrency", where 1 runners gets you 1.0 concurrency.
e.g. a Linux x86 Runner gets you 16 vCPU (1.0 concurrency) and 64GB RAM to be provisioned as you choose, amongst one or more jobs.
Cirrus Runners currently only support Linux (x86 and Arm64) and MacOS (Arm64).
This changeset does not move the existing Github Actions native MacOS runners away from being run on Github's infrastructure. This could be a follow up optimisation.
Runs from this changeset using Cirrus Runners can be found at: https://github.com/testing-cirrus-runners/bitcoin2/actions which shows an uncached run on master (CI#1), an outside pull request (CI#3) and an updated push to master (CI#4).
These workflows were run on 10 runners, and we would recommend purchasing a similar number for our CI in this repo to achieve the speed and concurrency we expect.
We include some optional performance commits, but these could be split out and made into followups or dropped entirely.
Benefits
Maintenance
As we are not self-hosting, nobody needs to maintain servers, disks etc.
Bus factor
Currently we have a very small number of people with the know-how working on server setup and maintenance. This setup fixes that so that "anyone" familiar with GitHub-style CI systems can work on it.
Scaling
These do not "auto-scale"/have "unlimited concurrency" like some solutions, but if we want more workers/cpu to increase parallism or increase the runner size of certain jobs for a speed-up we can simply buy more concurrency using the web interface.
Speed
Runtimes aproximate current runtimes pretty well, with some jobs being faster.
Caching improvements on pull request (re-runs) are left as future optimisations from the current changeset (see below).
GitHub workflow syntax
With a migration to the more-commonly-used GitHub workflow syntax, migration to other providers in the future is often as simple as a one-line change (and installing a new GitHub app to the repo).
If we decide to self-host again, then we can also self-host GitHub runners (using https://github.com/actions/runner) and maintain new GH-style CI syntax.
Reporting
GitHub workflows provide nicer built-in reporting directly on the "Checks" page of a pr. This includes more-detailed action reporting, and a host of pretty nice integrated features, such as Workflow Commands for creating annotations that can print messages during runs. See for example at the bottom of this window where we report
ccache
hitrate, if it was below 90%: https://github.com/testing-cirrus-runners/bitcoin/actions/runs/16163449125?pr=1These could be added conditionally into our CI scripts to report interesting or other information.
Costs
Financial
Relative to competitors Cirrus runners are cheap for the hosted CI-world. However these are likely more expensive than our current setup, or a well-configured (new) self-hosted setup.
If we started with 10 runners to be shared amongst all migrated jobs, this would total $750/mo = $9000/yr.
Note that we are not trying to comptete here on cost directly.
Dependencies
We would be dependent on Cirrus infra.
Forks
runs-on:
directive.env
github context in this field in particular, for some reason).runs-on:
field in the ci.yml file if they want to use Cirrus Runners too.All jobs work on forks, but will run (slowly) on GitHub native free hosted runners, instead of Cirrus runners. They will also suffer from poor cache hit-rates, but there's nothing that can be done about that, and the situtation is an improvement on today.
Migration process
The main org should also, in addition to pulling code changes:
docker/setup-buildx-action@v3
anddocker/login-action@v3
to be run in this repo.Caching
For the number of CI jobs we have, cache usage on GitHub would be an issue as GH only provides 10GB of cache space, per repo. However cirrus provides 10 GB per runner, which scales better with the number of runners.
The
cirruslabs/action/[restore|save]
action we use here redirects this to Cirrus' own cache and is both faster and larger.In the case that user is running CI on a fork, the cirrus cache falls back transparently to GitHub default cache without error.
ccache, depends-sources, built-depends
cirruslabs/actions/cache
action.push
: restores and saves caches.pull_request
: restores but does not save caches.This means a new pull request should hit a pretty relevant cache.
Old pull requests which are not being rebased on master may suffer from lower cache hit-rate.
If we save caches on all pull request runs we run the risk of evicting recent (and more relevant) cache blobs.
It may be possible in a future optimisation to widen this to save on pull request runs too, but it will also depend on how many runners we provision and what cache churn rates are like in the main repo.
Docker build layer caching
gha
cache backendccache
,depends-sources
anddepends-built
cachesgha
cache allows--cache-from
to be used from pull requests, which does not work using a registry cache type (technically we could use a public read-only token to get this working, but that feels wrong)This backend does network i/o and so are marginally slower than our current disk i/o cache.
But what about...
x
?We have tested many other providers, including Runs-on, Buildjet, WarpBuild, and GitHub hosted runners (and investigated even more). But they all fall short in one-way or another.
Administration: Read|Write
) for our use-case.TODO:
To complete migration from self-hosted to hosted for this repo, the backport branches
27.x
,28.x
and29.x
would also need their CI ported, but these are left for followups to this change (and pending review/changes here first).Work and experimentation undertaken with m3dwards