RFC: Private CUDA memory pools

## 🚀 Feature

Allow users to specify regions where CUDA memory allocations are satisfied from a private pool.

## Motivation

CUDA graph capture is our main motivation. But it seems like a handy thing, there may be other uses I don't anticipate.

[CUDA graph capture](https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#creating-a-graph-using-stream-capture) performs a dry run of a region of execution, freezing all CUDA work (and virtual addresses used during that work) into a "graph." The graph may be "replayed" like a single giant kernel, with greatly reduced CPU overhead as well as modestly improved GPU performance.

Because capture bakes in memory addresses, the memory used during capture must be available for the graph to use during replay. But Pytorch's current caching allocator assigns and frees memory eagerly and dynamically, so when a graph is replayed, those memory addresses may be in use by other tensors. One way to guarantee a graph's baked in addresses are always safe to reuse is to satisfy allocation requests from a graph-private memory pool during capture.

## Pitch

### Strawman API

The simplest API that comes to mind is something like
```python
pool = torch.cuda.MemPool() # MemPool would be a simple Python object, its only data member would be an integer uuid.
with torch.cuda.mempool(pool):
    # all tensors created in this region have their allocations satisfied from the private pool.
    # Capture a graph here
```
During capture, temporary internal allocations can be assigned, released back to the cache, and reassigned as usual, as long as the high-water mark of memory blocks used during capture
1. isn't in use by any other tensors and
2. has survived without being cudaFreed

when the graph is later replayed.

1 can hold because the pool is private.
To ensure 2 (if the same private pool is also used for some later ops or captures) the pools could optionally be told to error on internal calls to cudaFree (which might mistakenly free addresses out from under graphs), ie `pool.set_error_on_free()`.

The simplest implementation that comes to mind (serving the above API) would be a per-pool-id list of [`THCCachingAllocator`](https://github.com/pytorch/pytorch/blob/a347c747df8302acc0007a26f23ecf3355a5bef9/c10/cuda/CUDACachingAllocator.cpp#L945)s (or [`DeviceCachingAllocator`](https://github.com/pytorch/pytorch/blob/a347c747df8302acc0007a26f23ecf3355a5bef9/c10/cuda/CUDACachingAllocator.cpp#L183)s).

@arslan-zulfiqar made some modifications to the current allocator in a local fork to implement region-private memory pools. We've used his build to run graph-captured BERT and Mask-RCNN, and it works, including for complex cases like running some segments of the model as graphed and some segments eagerly (which is essential to if some segments are uncapturable, for example, if they contain data dependent control flow). So we've demonstrated enabling region-specific private pools can make it safe to capture the current allocator.

### Restrictions of the Strawman API
Avoiding memory corruption and race conditions during replay imposes some restrictions.

Two graphs captured with memory from the same pool should not be replayed concurrently in parallel streams.

If several graphs are captured with memory from the same pool, and some graphs use memory/results populated by earlier graphs
```python
with torch.cuda.mempool(pool):
    # capture graphA
    # capture graphB, which uses (and frees) some tensors created in A
    # capture graphC, which can use any memory freed during B's capture.  For fun let's say it also consumes some tensors created in A.
```
These are safe:
```
graphA.replay()
graphB.replay()
graphC.replay()

graphA.replay()
graphC.replay()
```
This isn't:
```
graphA.replay()
graphC.replay() # C has no problem with its own numerics, but may overwrite some of the memory A populated on behalf of B.
graphB.replay() # Danger of bad numerics. Data expected from A may have been overwritten by C.
```

A conservative but general rule to ensure safe replay is that graphs captured with memory from the same pool must be replayed in the same order they were captured. Still, it's a lot for users to think about.

(Something to look forward to: graph-captured cudaMallocAsynced memory is not vulnerable to any of the above gotchas, see **4. cudaMallocAsync**)

## Alternatives
#### 1 (API alternative): reduce API surface by having graphs request the pool
Initiating region-private allocation behavior could be folded into `torch.cuda._Graph.capture_begin()`, so the user wouldn't need `with torch.cuda.mempool(pool)`.

#### 2 (API and implementation alternative): ability to request a unique stream
The current allocator silos allocations per-stream. So running a graph-capture on a side stream (which the capture API requires anyway) effectively gives you a private pool, as long as you can be sure nothing else uses the stream. Right now you don't have that certainty: Pytorch uses a pool of streams under the hood, so there's no guarantee your side stream won't alias another side stream requested elsewhere in the script. However, it wouldn't be hard to expose an API that constructs and returns a unique (not from the stream pool) stream that no other stream will alias. Then you could say
```python
stream = torch.cuda.Stream(unique=True)
with torch.cuda.stream(stream):
    # capture a graph here, memory is siloed for you
```

#### 3 (Implementation alternative): Silo allocations by pool id within one Allocator object

Instead of a list of per-pool-id THCCachingAllocators (or DeviceCachingAllocators), we could silo allocations per-pool-id by minimally extending the current stream siloing logic. Specifically, we could add
```c++
if (a->pool_id != b->pool_id) {
  return a->pool_id < b->pool_id;
}
```
[here](https://github.com/pytorch/pytorch/blob/a347c747df8302acc0007a26f23ecf3355a5bef9/c10/cuda/CUDACachingAllocator.cpp#L131) to make `pool_id` the most significant bit of the comparator used when finding a block suitable for a particular stream and pool id.

I don't like this idea as much as lists of per-pool-id Allocators. `BlockComparator` is used in many [BlockPool](https://github.com/pytorch/pytorch/blob/a347c747df8302acc0007a26f23ecf3355a5bef9/c10/cuda/CUDACachingAllocator.cpp#L102) (`std::set`) lookups. Each lookup is logN in the number of blocks the pool contains, so several smaller, distinct Allocators with distinct `BlockPool`s seems a bit better for performance than one Allocator with all blocks from all streams and pool ids in the same `BlockPool`.

#### 4. cudaMallocAsync

Cuda 11.2 released a built-in fast asynchronous allocator ([cudaMallocAsync](https://developer.nvidia.com/blog/enhancing-memory-allocation-with-new-cuda-11-2-features/)). Some feature gaps unrelated to cuda graphs prevent its immediate integration into Pytorch, but it should be upstreamable by 11.3. Also, in an upcoming cuda release (likely 11.4, but we're not 100% sure) it will be capture-safe out of the box _without manual per-region pools_.

Therefore, private memory pools are in some sense throwaway work because once cudaMallocAsync has been upstreamed (I expect we'll add it as an alternative backend to the current allocator), memory allocations in any region will be safe to capture, as long as they're are satisfied by cudaMallocAsync under the hood.

However, it's not throwaway work if people want to choose between the cudaMallocAsync and current allocator backends in the future, and want to use graphs with both. They'll still need private pools to use the current allocator.

You could say cudaMallocAsync is an "implementation alternative",  but it also alleviates the API need for manual pools around graph capture, and alleviates _all_ **Restrictions of the Strawman API**. Capturable cudaMallocAsync will great once we have it, very nearly fire and forget, but we won't have it for several months.

#### 5. ????

As I said earlier, CUDA graphs are the only major motivation for private pools I can think of right now. I'm not even sure private memory pools are the best approach to letting graphs interact safely with the allocator. Further suggestions are welcome.

## Additional context

https://github.com/pytorch/pytorch/pull/48875 added primitive graph capture and replay bindings
https://github.com/pytorch/pytorch/issues/15623 another user requested cuda graph bindings

I'll post a strawman PR implementing the "Pitch". Writing it is not the hard part.

cc @ngimel

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

RFC: Private CUDA memory pools #51075

🚀 Feature

Motivation

Pitch

Strawman API

Restrictions of the Strawman API

Alternatives

1 (API alternative): reduce API surface by having graphs request the pool

2 (API and implementation alternative): ability to request a unique stream

3 (Implementation alternative): Silo allocations by pool id within one Allocator object

4. cudaMallocAsync

5. ????

Additional context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

RFC: Private CUDA memory pools #51075

Description

🚀 Feature

Motivation

Pitch

Strawman API

Restrictions of the Strawman API

Alternatives

1 (API alternative): reduce API surface by having graphs request the pool

2 (API and implementation alternative): ability to request a unique stream

3 (Implementation alternative): Silo allocations by pool id within one Allocator object

4. cudaMallocAsync

5. ????

Additional context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions