-
Notifications
You must be signed in to change notification settings - Fork 647
Description
Summary
This ticket lays out a design for a new feature in Rush, tentatively called "cooperative builds" or "cobuilds".
Imagine a CI pipeline defined as follows:
jobs:
- job: primary
steps:
- git clone
- rush install
- rush build
- publish artifacts
- job: extra1
steps:
- git clone
- rush install
- rush build
- job: extra2
steps:
- git clone
- rush install
- rush build
This pipeline runs on 3 separate VMs in parallel, each one ending up with a fully built copy of the entire monorepo, and then only one (the primary) publishes artifacts (such as test results, code coverage, client binaries, tarballs, whatever else the pipeline might produce).
If you have enabled the build cache and phased builds, then each of the 3 jobs can already "communicate" with each other -- if extra2
has built project A, and primary
needs project A, it can pull the build that was just cached by extra2
without rebuilding it. The only thing that's missing is some way to prevent all 3 jobs from building the same things in parallel -- a locking mechanism.
Details
During a build, when Rush is deciding what operation to perform next out of the graph, add some new steps:
- First, pick the best operation to perform (as normal)
- Then, check if that script is a no-op -- if it is, skip it (as normal)
- Then, check if it can be pulled from cache -- if it can, continue (as normal)
- Finally, if you are actually going to start building it, instead, ask to
CobuildLock.getLock(context, cacheKey)
. If this returnstrue
, go ahead with the build -- if it returnsfalse
, instead, add the operation to a list of pending operations being built by a cobuild. If this happens, you should return to the top and attempt to pick the next-best operation to perform. - If there is nothing for you to build, cycle through the list of pending operations and attempt to get the lock again.
- If there is nothing for you to build, and you cannot get the lock on any pending operation to build it, pause for 1-5 seconds and start over.
While peforming an operation, if cobuilds are enabled, there is one additional step to add:
- While building any operation, every 10 seconds, perform
CobuildLock.renew(context, cacheKey)
. This ensures your lock on the operation remains fresh. Locks will automatically expire after some period of time (perhaps 30 seconds) to account for a catastrophic failure on one of the cobuild VMs, at which point, another build will be able to get the lock and perform that operation.
With these two changes, you can make efficient use of the VMs from all 3 cobuilds, almost as well as if you were building on a single machine with 3x the cores.
Implementing CobuildLock
The tentatively named CobuildLock
requires two API methods:
getLock(context, cacheKey)
takes some unique context string (like a BuildID provided by your CI runner potentially) and a cache key, which could be the same cache key used in the Build Cache. It should returntrue
if you were able to obtain the lock, orfalse
if it is owned by someone else.renewLock(context, cacheKey)
refreshes the expiration time of an existing lock, and should be called by the owner of the lock periodically so that someone else does not start performing the same operation.
Note that context
needs to be provided by the monorepo maintainer, so it is likely read from an environment variable, such as RUSH_COBUILD_CONTEXT
. If you wanted to ensure that only the builds from a given BuildID cooperated, you could set RUSH_COBUILD_CONTEXT=$(BuildID)
. If you were more permissive, and wanted all PR and CI builds building anything with the same cache key to cooperate, then you could set it to a static value, or an empty string.
The internal implementation of the CobuildLock
might vary (maybe even using plugins in the future), and probably requires its own dedicated config file in common/config/rush
. One natural option for the backing store is redis
: Azure Cache, Google Memorystore, and Amazon Elasticache are all cloud-hosted Redis services, and developers can also run one locally or on a separate server of their choosing with trivial effort, so it's extremely versatile.
With redis
, the implementations of these functions might look like this:
getLock(context: string, cacheKey: string): Promise<boolean> {
const results = await redisClient.multi()
.incr(`cobuild-lock:${context}:${cacheKey}`, 1)
.expire(`cobuild-lock:${context}:${cacheKey}`, 30)
.exec();
return results[0] === 1;
}
renewLock(context: string, cacheKey: string): Promise<void> {
await redisClient.expire(`cobuild-lock:${context}:${cacheKey}`, 30);
}
Alternatives
Build Farms
Note that a Rush "cobuild" has the same goals, but is not the same as, a build farm (for example, dedicated Bazel-esque build farm machines). The idea behind this "cobuilds" design is that you can take an existing, working Rush build, and just add another parallel leg and a simple locking mechanism to get additional parallelism "for free", without investing in more complicated infrastructure. It probably cannot compete in efficiency with a build farm if Rush does support Bazel-esque build farms (which is currently being investigated as well).
Redis
Redis seems like a natural fit because it offers (1) built-in atomic increments designed to be used as locks/semaphores and (2) built-in expiration management, so there's very little extra work for Rush to do. Would be interesting to see if there are other options that would be even easier for monorepo maintainers to configure (possibly with a more complicated internal implementation, as Rush might have to do more micromanagement of the lock).
Configuration Notes
To use effectively, this feature needs a way to turn on and off via environment variables, and also to provide credentials for Redis (if that is the store used) via environment variables.
It's possible that you could turn on cobuilds exclusively via environment variables, without any need for a JSON config file, for example:
RUSH_COBUILD_ENABLED=1
RUSH_COBUILD_REDIS_HOST=rediss://<ip>
RUSH_COBUILD_REDIS_PASS=<pass>
RUSH_COBUILD_CONTEXT=$BUILD_ID
Handling Failures
The biggest outstanding design issue is the approach for handling failures.
Normally, if an operation fails, we do not save a cache entry. Without special handling, this could result in a possible worst case scenario:
- Job
extra1
gets the lock on project A, spends 3 minutes building it, fails. - Job
extra2
gets the lock after 30 second expiration, spends 3 minutes building it, fails. - Job
primary
finally gets the lock, spends 3 minutes building it, fails.
All told, what should be a 3 minute failure (perhaps it's a long Jest test run that contains several broken unit tests), instead takes 3:00 + 0:30 + 3:00 + 0:30 + 3:00
= 10 minutes.
The only workable solution is to allow a failing build to be cached and retrieved... but without impacting the normal behavior of the build cache, where (in general) "re-running" a failed build will always reattempt portions that have failed in the past.
Possible approaches:
-
(Approach 1) Cache failed builds in a separate cache key,
<context>:<key>:failed
. This ensures we'd never conflict with regular successful cached builds, so it minimizes the additional logic required. The downside is that we need 2 cache checks for each project when selecting a project to build. Note that what is stored for a failed build needs to have some additional stuff (like a log of the failure), so we can assemble a normal build failure within Rush. -
(Approach 2) Cache failed builds in a separate cache key, but also make cobuild communication more robust. Instead of relying on the build cache for transferring builds and a separate locking mechanism, we expand the redis interaction so that rush can periodically check on the status of projects that are being cobuilt, and the result of that check is a pointer to a cache key -- so 4 possible answers might be
project is still building
,lock expired
,success - get the build at <key>
,failed - get the build at <context>:<key>:failed
. This streamlines the checking behavior but will require more code.
Note that both of these approaches make an idea floated above, of using a blank context, rather dangerous... it would ensure that a transient failure is made the permanent result of a particular project build until a change is made. To avoid this we'd need to either always use a context (like Build ID), or make failures have some type of shelf life (perhaps the redis result key will have a shorter expiry, like 30 minutes, if it represents a failure, ensuring that transient failures would eventually disappear even if all builds are being performed as cobuilds).
Also note that there is an opportunity here to update the build cache format. If what we save in a build cache entry is not just a tarball, but rather a success/failure flag, a console log, and the tarball, that would reduce the custom logic we need to write for the cobuilds feature.
Handling rebuilds
A "rush rebuild" or "rush retest" command will save operations to the build cache once completed, but does not retrieve them (since the "incremental" flag is disabled). However, we still need a cobuild to be able to retrieve a finished build from another cobuild in this case.
The solution in Approach 2 above should work for this as well -- the result of the redis check can be "success - get the build at ", and this would prompt the build to do a cache pull even though normally that operation would not do a cache pull.
A potential implementation of this more complicated interface would look like this (NOTE: totally untested and likely needs some logic edits, but shows some building blocks):
type LockState = { state: 'obtained' } | { state: 'pending' } | { state: 'completed', cacheKey: string };
getLockOrState(context: string, cacheKey: string): Promise<LockState> {
const incrResult = await redisClient.incr(`cobuild:${context}:${cacheKey}:lock`, 1);
if (incrResult === '1') {
await renewLock(context, cacheKey);
return { state: 'obtained' };
}
const result = await redisClient.get(`cobuild:${context}:${cacheKey}:result`);
if (result.startsWith('completed')) {
return { state: 'completed', cacheKey: result.split(';')[1] };
} else {
return { state: 'pending' };
}
}
renewLock(context: string, cacheKey: string): Promise<void> {
await redisClient.expire(`cobuild:${context}:${cacheKey}:lock`, 30);
}
markPending(context: string, cacheKey: string): Promise<void> {
await renewLock(context, cacheKey);
await redisClient.set(`cobuild:${context}:${cacheKey}:result`, 'pending');
}
markCompleted(context: string, cacheKey: string, buildCacheKey): Promise<void> {
await redisClient.set(`cobuild:${context}:${cacheKey}:result`, 'completed;' + buildCacheKey);
}
Metadata
Metadata
Assignees
Labels
Type
Projects
Status