Skip to content

[CI] Better utilization of CI resources and other CI improvements #5891

@hcho3

Description

@hcho3

Run-away cloud cost of our Jenkins CI server has been a pressing issue (#5176). It is hosted on AWS, which charges by the hour. #5884 finally created the mechanism to enforce a daily budget via throttling.

We now have a dashboard page to keep track of daily spending: https://xgboost-ci.net/dashboard/
Screen Shot 2020-07-14 at 3 54 49 AM

Now it is time to extract savings and ensure that we are using limited CI resources on where it matters.

State of the CI: The free credits from AWS ran out this month, so we now have to start drawing from the Open Collective account, which currently has 10531.16 USD. If we limit ourselves to spending 33 USD per day, the balance will last 287 days.

  • Make all tests conditional on the presence of a GitHub comment (e.g. run tests). Right now, tests run automatically, and there are many cases where automatically starting tests is wasteful.
  • Skip tests for all draft pull requests.
  • Allow "rolling over" left-over allowance from previous day. For example, if the daily budget is 33 USD and we spent only 10 USD yesterday, we should be able to spend up to 56 USD (33 USD + 23 USD roll over). The spending pattern is quite spiky:
    Screen Shot 2020-07-14 at 4 21 34 AM
  • Migrate some tests to free services. I suspect this may have limited impact, since any tests using GPUs need to run in Jenkins.
  • Pin point which tests cost the most. I was surprised to learn that Windows jobs cost more than 50% of the expenses, even though we run more tests in Linux:
    Screen Shot 2020-07-14 at 4 23 14 AM
  • Related: write a summary report of the AWS expenses for the last 6 months.
  • Speed up C++ builds. It takes ~ 10 min on Linux and ~ 15 min on Windows to build XGBoost with GPU support.
  • Try to get more funding, which is easier said than done.

Other CI improvements, outside of Jenkins

  • Migrate from AppVeyor to GitHub actions. AppVeyor tests are often a bottleneck because it only runs one test at a time.
  • Remove CPU-only tests from Jenkins CI pipeline. This is especially important for Windows targets, since Windows instances tend to cost more.

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions