[rush] Design: Rush "Cobuilds" - A cheap way to get distributed builds

## Summary

This ticket lays out a design for a new feature in Rush, tentatively called "cooperative builds" or "cobuilds".

Imagine a CI pipeline defined as follows:

```yaml
jobs:
  - job: primary
    steps:
      - git clone
      - rush install
      - rush build
      - publish artifacts
  - job: extra1
    steps:
      - git clone
      - rush install
      - rush build
  - job: extra2
    steps:
      - git clone
      - rush install
      - rush build
```

This pipeline runs on 3 separate VMs in parallel, each one ending up with a fully built copy of the entire monorepo, and then only one (the primary) publishes artifacts (such as test results, code coverage, client binaries, tarballs, whatever else the pipeline might produce).

If you have enabled the build cache and phased builds, then each of the 3 jobs can already "communicate" with each other -- if `extra2` has built project A, and `primary` needs project A, it can pull the build that was just cached by `extra2` without rebuilding it. The only thing that's missing is some way to prevent all 3 jobs from building the _same things_ in parallel -- a locking mechanism.

## Details

During a build, when Rush is deciding what operation to perform next out of the graph, add some new steps:

 - First, pick the best operation to perform (as normal)
 - Then, check if that script is a no-op -- if it is, skip it (as normal)
 - Then, check if it can be pulled from cache -- if it can, continue (as normal)
 - Finally, if you are actually going to start building it, _instead_, ask to `CobuildLock.getLock(context, cacheKey)`.  If this returns `true`, go ahead with the build -- if it returns `false`, instead, add the operation to a list of pending operations being built by a cobuild. If this happens, you should return to the top and attempt to pick the next-best operation to perform.
 - If there is nothing for you to build, cycle through the list of pending operations and attempt to get the lock again.
 - If there is nothing for you to build, and you cannot get the lock on any pending operation to build it, pause for 1-5 seconds and start over.

While peforming an operation, if cobuilds are enabled, there is one additional step to add:

 - While building any operation, every 10 seconds, perform `CobuildLock.renew(context, cacheKey)`. This ensures your lock on the operation remains fresh. Locks will automatically expire after some period of time (perhaps 30 seconds) to account for a catastrophic failure on one of the cobuild VMs, at which point, another build will be able to get the lock and perform that operation.

With these two changes, you can make efficient use of the VMs from all 3 cobuilds, almost as well as if you were building on a single machine with 3x the cores.

## Implementing `CobuildLock`

The tentatively named `CobuildLock` requires two API methods:

 - `getLock(context, cacheKey)` takes some unique context string (like a BuildID provided by your CI runner potentially) and a cache key, which could be the same cache key used in the Build Cache.  It should return `true` if you were able to obtain the lock, or `false` if it is owned by someone else.
 - `renewLock(context, cacheKey)` refreshes the expiration time of an existing lock, and should be called by the owner of the lock periodically so that someone else does not start performing the same operation.

Note that `context` needs to be provided by the monorepo maintainer, so it is likely read from an environment variable, such as `RUSH_COBUILD_CONTEXT`.  If you wanted to ensure that only the builds from a given BuildID cooperated, you could set `RUSH_COBUILD_CONTEXT=$(BuildID)`.  If you were more permissive, and wanted all PR and CI builds building anything with the same cache key to cooperate, then you could set it to a static value, or an empty string.

The internal implementation of the `CobuildLock` might vary (maybe even using plugins in the future), and probably requires its own dedicated config file in `common/config/rush`.  One natural option for the backing store is `redis`: Azure Cache, Google Memorystore, and Amazon Elasticache are all cloud-hosted Redis services, and developers can also run one locally or on a separate server of their choosing with trivial effort, so it's extremely versatile.

With `redis`, the implementations of these functions might look like this:

```javascript
getLock(context: string, cacheKey: string): Promise<boolean> {
  const results = await redisClient.multi()
    .incr(`cobuild-lock:${context}:${cacheKey}`, 1)
    .expire(`cobuild-lock:${context}:${cacheKey}`, 30)
    .exec();
  return results[0] === 1;
}

renewLock(context: string, cacheKey: string): Promise<void> {
  await redisClient.expire(`cobuild-lock:${context}:${cacheKey}`, 30);
}
```

## Alternatives

### Build Farms

Note that a Rush "cobuild" has the same goals, but is not the same as, a build farm (for example, dedicated Bazel-esque build farm machines).  The idea behind this "cobuilds" design is that you can take an existing, working Rush build, and just add another parallel leg and a simple locking mechanism to get additional parallelism "for free", without investing in more complicated infrastructure. It probably cannot compete in efficiency with a build farm if Rush does support Bazel-esque build farms (which is currently being investigated as well).

### Redis

Redis seems like a natural fit because it offers (1) built-in atomic increments designed to be used as locks/semaphores and (2) built-in expiration management, so there's very little extra work for Rush to do. Would be interesting to see if there are other options that would be even easier for monorepo maintainers to configure (possibly with a more complicated internal implementation, as Rush might have to do more micromanagement of the lock).

## Configuration Notes

To use effectively, this feature needs a way to turn on and off via environment variables, and also to provide credentials for Redis (if that is the store used) via environment variables.

It's possible that you could turn on cobuilds exclusively via environment variables, without any need for a JSON config file, for example:

```console
RUSH_COBUILD_ENABLED=1
RUSH_COBUILD_REDIS_HOST=rediss://<ip>
RUSH_COBUILD_REDIS_PASS=<pass>
RUSH_COBUILD_CONTEXT=$BUILD_ID
```

## Handling Failures

The biggest outstanding design issue is the approach for handling failures.

Normally, if an operation fails, we do not save a cache entry. Without special handling, this could result in a possible worst case scenario:

 - Job `extra1` gets the lock on project A, spends 3 minutes building it, fails.
 - Job `extra2` gets the lock after 30 second expiration, spends 3 minutes building it, fails.
 - Job `primary` finally gets the lock, spends 3 minutes building it, fails.

All told, what should be a 3 minute failure (perhaps it's a long Jest test run that contains several broken unit tests), instead takes `3:00 + 0:30 + 3:00 + 0:30 + 3:00` = 10 minutes.

The only workable solution is to allow a _failing build_ to be cached and retrieved... but without impacting the normal behavior of the build cache, where (in general) "re-running" a failed build will always reattempt portions that have failed in the past.

Possible approaches:

- *(Approach 1)* Cache failed builds in a separate cache key, `<context>:<key>:failed`. This ensures we'd never conflict with regular successful cached builds, so it minimizes the additional logic required. The downside is that we need 2 cache checks for each project when selecting a project to build. Note that what is stored for a failed build needs to have some additional stuff (like a log of the failure), so we can assemble a normal build failure within Rush.

- *(Approach 2)* Cache failed builds in a separate cache key, but also make cobuild communication more robust.  Instead of relying on the build cache for transferring builds and a separate locking mechanism, we expand the redis interaction so that rush can periodically check on the _status_ of projects that are being cobuilt, and the _result_ of that check is a pointer to a cache key -- so 4 possible answers might be `project is still building`, `lock expired`, `success - get the build at <key>`, `failed - get the build at <context>:<key>:failed`. This streamlines the checking behavior but will require more code.

Note that both of these approaches make an idea floated above, of using a blank context, rather dangerous... it would ensure that a transient failure is made the permanent result of a particular project build until a change is made. To avoid this we'd need to either always use a context (like Build ID), or make failures have some type of shelf life (perhaps the redis result key will have a shorter expiry, like 30 minutes, if it represents a failure, ensuring that transient failures would eventually disappear even if all builds are being performed as cobuilds).

Also note that there is an opportunity here to update the build cache format. If what we save in a build cache entry is not _just_ a tarball, but rather a _success/failure_ flag, a _console log_, and the tarball, that would reduce the custom logic we need to write for the cobuilds feature.

## Handling rebuilds

A "rush rebuild" or "rush retest" command will save operations to the build cache once completed, but does not retrieve them (since the "incremental" flag is disabled). However, we still need a cobuild to be able to retrieve a finished build from another cobuild in this case.

The solution in Approach 2 above should work for this as well -- the result of the redis check can be "success - get the build at <key>", and this would prompt the build to do a cache pull even though _normally_ that operation would not do a cache pull.

A potential implementation of this more complicated interface would look like this (NOTE: totally untested and likely needs some logic edits, but shows some building blocks):

```javascript
type LockState = { state: 'obtained' } | { state: 'pending' } | { state: 'completed', cacheKey: string };

getLockOrState(context: string, cacheKey: string): Promise<LockState> {
  const incrResult = await redisClient.incr(`cobuild:${context}:${cacheKey}:lock`, 1);

  if (incrResult === '1') {
    await renewLock(context, cacheKey);
    return { state: 'obtained' };
  }

  const result = await redisClient.get(`cobuild:${context}:${cacheKey}:result`);

  if (result.startsWith('completed')) {
    return { state: 'completed', cacheKey: result.split(';')[1] };
  } else {
    return { state: 'pending' };
  }
}

renewLock(context: string, cacheKey: string): Promise<void> {
  await redisClient.expire(`cobuild:${context}:${cacheKey}:lock`, 30);
}

markPending(context: string, cacheKey: string): Promise<void> {
  await renewLock(context, cacheKey);
  await redisClient.set(`cobuild:${context}:${cacheKey}:result`, 'pending');
}

markCompleted(context: string, cacheKey: string, buildCacheKey): Promise<void> {
  await redisClient.set(`cobuild:${context}:${cacheKey}:result`, 'completed;' + buildCacheKey);
}
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[rush] Design: Rush "Cobuilds" - A cheap way to get distributed builds #3485

Summary

Details

Implementing `CobuildLock`

Alternatives

Build Farms

Redis

Configuration Notes

Handling Failures

Handling rebuilds

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[rush] Design: Rush "Cobuilds" - A cheap way to get distributed builds #3485

Description

Summary

Details

Implementing CobuildLock

Alternatives

Build Farms

Redis

Configuration Notes

Handling Failures

Handling rebuilds

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Implementing `CobuildLock`