-
Notifications
You must be signed in to change notification settings - Fork 25.2k
Description
🚀 Feature
Allow users to specify regions where CUDA memory allocations are satisfied from a private pool.
Motivation
CUDA graph capture is our main motivation. But it seems like a handy thing, there may be other uses I don't anticipate.
CUDA graph capture performs a dry run of a region of execution, freezing all CUDA work (and virtual addresses used during that work) into a "graph." The graph may be "replayed" like a single giant kernel, with greatly reduced CPU overhead as well as modestly improved GPU performance.
Because capture bakes in memory addresses, the memory used during capture must be available for the graph to use during replay. But Pytorch's current caching allocator assigns and frees memory eagerly and dynamically, so when a graph is replayed, those memory addresses may be in use by other tensors. One way to guarantee a graph's baked in addresses are always safe to reuse is to satisfy allocation requests from a graph-private memory pool during capture.
Pitch
Strawman API
The simplest API that comes to mind is something like
pool = torch.cuda.MemPool() # MemPool would be a simple Python object, its only data member would be an integer uuid.
with torch.cuda.mempool(pool):
# all tensors created in this region have their allocations satisfied from the private pool.
# Capture a graph here
During capture, temporary internal allocations can be assigned, released back to the cache, and reassigned as usual, as long as the high-water mark of memory blocks used during capture
- isn't in use by any other tensors and
- has survived without being cudaFreed
when the graph is later replayed.
1 can hold because the pool is private.
To ensure 2 (if the same private pool is also used for some later ops or captures) the pools could optionally be told to error on internal calls to cudaFree (which might mistakenly free addresses out from under graphs), ie pool.set_error_on_free()
.
The simplest implementation that comes to mind (serving the above API) would be a per-pool-id list of THCCachingAllocator
s (or DeviceCachingAllocator
s).
@arslan-zulfiqar made some modifications to the current allocator in a local fork to implement region-private memory pools. We've used his build to run graph-captured BERT and Mask-RCNN, and it works, including for complex cases like running some segments of the model as graphed and some segments eagerly (which is essential to if some segments are uncapturable, for example, if they contain data dependent control flow). So we've demonstrated enabling region-specific private pools can make it safe to capture the current allocator.
Restrictions of the Strawman API
Avoiding memory corruption and race conditions during replay imposes some restrictions.
Two graphs captured with memory from the same pool should not be replayed concurrently in parallel streams.
If several graphs are captured with memory from the same pool, and some graphs use memory/results populated by earlier graphs
with torch.cuda.mempool(pool):
# capture graphA
# capture graphB, which uses (and frees) some tensors created in A
# capture graphC, which can use any memory freed during B's capture. For fun let's say it also consumes some tensors created in A.
These are safe:
graphA.replay()
graphB.replay()
graphC.replay()
graphA.replay()
graphC.replay()
This isn't:
graphA.replay()
graphC.replay() # C has no problem with its own numerics, but may overwrite some of the memory A populated on behalf of B.
graphB.replay() # Danger of bad numerics. Data expected from A may have been overwritten by C.
A conservative but general rule to ensure safe replay is that graphs captured with memory from the same pool must be replayed in the same order they were captured. Still, it's a lot for users to think about.
(Something to look forward to: graph-captured cudaMallocAsynced memory is not vulnerable to any of the above gotchas, see 4. cudaMallocAsync)
Alternatives
1 (API alternative): reduce API surface by having graphs request the pool
Initiating region-private allocation behavior could be folded into torch.cuda._Graph.capture_begin()
, so the user wouldn't need with torch.cuda.mempool(pool)
.
2 (API and implementation alternative): ability to request a unique stream
The current allocator silos allocations per-stream. So running a graph-capture on a side stream (which the capture API requires anyway) effectively gives you a private pool, as long as you can be sure nothing else uses the stream. Right now you don't have that certainty: Pytorch uses a pool of streams under the hood, so there's no guarantee your side stream won't alias another side stream requested elsewhere in the script. However, it wouldn't be hard to expose an API that constructs and returns a unique (not from the stream pool) stream that no other stream will alias. Then you could say
stream = torch.cuda.Stream(unique=True)
with torch.cuda.stream(stream):
# capture a graph here, memory is siloed for you
3 (Implementation alternative): Silo allocations by pool id within one Allocator object
Instead of a list of per-pool-id THCCachingAllocators (or DeviceCachingAllocators), we could silo allocations per-pool-id by minimally extending the current stream siloing logic. Specifically, we could add
if (a->pool_id != b->pool_id) {
return a->pool_id < b->pool_id;
}
here to make pool_id
the most significant bit of the comparator used when finding a block suitable for a particular stream and pool id.
I don't like this idea as much as lists of per-pool-id Allocators. BlockComparator
is used in many BlockPool (std::set
) lookups. Each lookup is logN in the number of blocks the pool contains, so several smaller, distinct Allocators with distinct BlockPool
s seems a bit better for performance than one Allocator with all blocks from all streams and pool ids in the same BlockPool
.
4. cudaMallocAsync
Cuda 11.2 released a built-in fast asynchronous allocator (cudaMallocAsync). Some feature gaps unrelated to cuda graphs prevent its immediate integration into Pytorch, but it should be upstreamable by 11.3. Also, in an upcoming cuda release (likely 11.4, but we're not 100% sure) it will be capture-safe out of the box without manual per-region pools.
Therefore, private memory pools are in some sense throwaway work because once cudaMallocAsync has been upstreamed (I expect we'll add it as an alternative backend to the current allocator), memory allocations in any region will be safe to capture, as long as they're are satisfied by cudaMallocAsync under the hood.
However, it's not throwaway work if people want to choose between the cudaMallocAsync and current allocator backends in the future, and want to use graphs with both. They'll still need private pools to use the current allocator.
You could say cudaMallocAsync is an "implementation alternative", but it also alleviates the API need for manual pools around graph capture, and alleviates all Restrictions of the Strawman API. Capturable cudaMallocAsync will great once we have it, very nearly fire and forget, but we won't have it for several months.
5. ????
As I said earlier, CUDA graphs are the only major motivation for private pools I can think of right now. I'm not even sure private memory pools are the best approach to letting graphs interact safely with the allocator. Further suggestions are welcome.
Additional context
#48875 added primitive graph capture and replay bindings
#15623 another user requested cuda graph bindings
I'll post a strawman PR implementing the "Pitch". Writing it is not the hard part.
cc @ngimel