-
Notifications
You must be signed in to change notification settings - Fork 25.2k
Add Pluggable CUDA allocator backend #86786
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/86786
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit 96ee1dd: This comment was automatically generated by Dr. CI and updates every 15 minutes. |
Maybe related: #1529 (comment) |
Also, one option may be enabling tracing dynamically via this interface. |
I like the idea of having virtual class interface to allocators. It is cleaner and more idiomatic than having a switch with x-macros. Having both is confusing though. Is it possible to just replace the entire switch statement thing with virtual classes for {cudacachingallocator, cudaMallocAsyncAllocator, something else}? |
I'm really excited to see this finally happening! Since you mention RMM compatibility as a goal, I think that the deallocator interface should also take a stream. This would ensure compatibility not only with RMM memory resources, but also CUDA's own stream-ordered allocator ( Passing a stream to the deallocator is necessary for efficient stream-ordered reuse. The stream passed to deallocate tells the allocator on which stream a freed block is available for immediate reuse. Without providing the stream, a stream-ordered allocator must synchronize before reusing the block. Second, ideally the deallocator interface should also take the size of the allocation. This is to match the Third, ideally the deallocator would also take the alignment of the allocation, to match the Implementations that don't need the stream and size on deallocation are of course free to ignore these parameters, but requiring them enables a lot of efficiency and flexibility in allocator implementations. To summarize, I'm proposing changing this:
To this:
Also, if PyTorch wants to enable non-asynchronous memory allocators, you may want to provide the option of registering allocators that do not take streams. **
|
@zdevito Yeah, I fully agree with this. It seems that the switch & macros approach was designed to have the allocator used statically and eliminate possibly dynamic undesired behavior such as virtual functions overhead, etc. I think that the overhead of moving everything to an interface with virtual methods should be negligible, but I am not sure if this would be acceptable by the core maintainers given the direction that #82682 took. |
e7c69f7
to
5501640
Compare
Understood. I am going to take a shot at moving the current code over to a virtual method approach in a way that won't sacrifice performance. Then it should be easy to integrate the pluggable allocators you have here on top of that patch. |
5501640
to
8ce8c71
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is much cleaner now. This makes it possible to switch the allocator via a Python API so that there is no need to do complicated linking against lib torch, which is a pretty useful thing to do. But it additionally requires setting the environment variable at runtime to opt into it, which is another source of friction for being able to use it, similar to having to link libtorch. I was hoping it would be possible to just swap the existing allocator out when this Python API is called rather than having to put the plugging API components as an environment options (or equivalent, wait to initialize the allocator until the first time it is required). For instance, we could allow the allocator to be swapped as long as no allocation have yet occurred. The benefit would be that it would allow python-only ways to configure the allocator (including between native and cudaMallocAsync), and only the Module.cpp wrapper function would need to know about the pluggable allocator option.
8ce8c71
to
1f1b69a
Compare
@zdevito , Yeah, that design currently made no sense at all. Just moved all the pluggable machinery to Python and left c10 untouched. Now it looks much better and the design makes more sense. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, this is a good improvement. I have a couple specific things inline to ensure correctness, but generally things now seem pretty self-contained, and the next steps really depend on what you have in mind for the first uses of such an API.
} | ||
|
||
void changeCurrentAllocator(std::shared_ptr<c10::cuda::CUDACachingAllocator::CUDAAllocator> allocator) { | ||
c10::cuda::CUDACachingAllocator::allocator.store(allocator.get()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the allocator no longer has references in Python, then the object is going to be destructed but still have a dangling reference here. Maybe add a static variable that holds on the shared pointer here in addition to setting the allocator.
Also it is unsafe to change the allocator after some allocations have happened because a raw_alloc
'd pointer might later get passed to a raw_free
of a different allocator. The use of the raw_
APIs is rare but are used in the codebase. What to do depends more on the use cases you might have in mind for this API. Ideally, we'd try to get rid of the raw_deletes and instead have them use the DataPtr interface so that even if the allocator is different at the point of free, the DataPtr still directs to the correct allocator.
A simpler choice is to try to figure out if we have allocated any memory yet, and just refuse to change the allocator if we have.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I went with your approach and forbid allocator changes after initialization. However, I would like to remove the raw_
apis so we can dynamically change allocators in different parts of the application (p.e. used managed memory in one place and then resort to regular cuda mallocs).
protected: | ||
std::function<void*(size_t, int, cudaStream_t)> alloc_fn_; | ||
std::function<void(void*)> free_fn_; | ||
// TODO Populate all these functions |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think for the first iteration of this, we can just support the minimal API that will make most of PyTorch work (and whatever parts you are interested in modifying). This can throw errors are return default values for stuff like:
- CUDAGraphs, mempool stuff - just error, its unlikely a custom allocator works with them unless built specific for that.
- Statistics, snapshotting - it is ok to have them return 0s, stats for other allocators are not the same anyway
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added support for adding these functions (probably rapids guys are interested) but left it as an undocumented feature for now.
1f1b69a
to
6e70ef9
Compare
docs/source/notes/cuda.rst
Outdated
#include <sys/types.h> | ||
#include <cuda_runtime_api.h> | ||
#include <iostream> | ||
// Compile with nvcc alloc.cc -o alloc.so -shared --compiler-options '-fPIC |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or with g++ alloc.cc -o alloc.so -I/usr/local/cuda/include -shared -fPIC
(There's no CUDA device code in this example, so NVCC is not required)
Also, missing a ':
// Compile with nvcc alloc.cc -o alloc.so -shared --compiler-options '-fPIC | |
// Compile with nvcc alloc.cc -o alloc.so -shared --compiler-options '-fPIC' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I was too lazy to write the -I
flag :)
A naive question: is it useful for allocator methods to accept a dummy int64 correlation id? So that it's easier to do tracing / stack analysis / operation grouping together? Or are these tasks well solved by just stack sample analysis? without resorting to such hacks for operation tracking Some usecases for customized allocators: tracing; pre-allocation; arena allocation/deallocation |
edf9a3d
to
ab790d3
Compare
sorry @zdevito, I think we may need to move the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good -- I only have a couple minor nits inline about naming or documentation. I saw the comment about hipify but I am not an expert on how it works, so I am not sure what is happening there. It is likely fixable so if you send more details I can forward it to someone who understands more.
void changeCurrentAllocator( | ||
std::shared_ptr<c10::cuda::CUDACachingAllocator::CUDAAllocator> allocator) { | ||
TORCH_CHECK( | ||
!getCurrentAllocator()->initialized(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a race here where we can still end up with
- Thread 1: getCurrentAllocator()
- Thread 2: allocate some memory with raw_alloc
- Thread 1: store new allocator
- Thread 2: raw_delete (calls wrong one and crashes)
However, I don't think this scenario is common. We should document that changing the allocator after allocations have occurred may result in undefined behavior. The best resolution to this longer term is to remove the raw_alloc/raw_delete interfaces.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The pluggable allocator holds a history of active memory allocations, so it will hard error when it receives a pointer that it hasn't allocated. This list is needed to pass the free
function the size and stream of the allocation as requested by the nvidia guys.
dabf810
to
7a16c13
Compare
I think all should be green and ready now, thanks! |
@zdevito how does this looks to you now? :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very exciting. I'm sure there are small incremental improvements we can add moving forward but this is really good already. I didn't look into the allocator itself in great details because Zach did it.
import torch | ||
|
||
# Load the allocator | ||
new_alloc = torch.cuda.memory.CUDAPluggableAllocator( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: could we have an example her that uses the torch.utils.cpp_extension.load_inline()
code? That might be simpler for users?
7a16c13
to
dfc538b
Compare
dfc538b
to
96ee1dd
Compare
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Fixes pytorch#43144 This uses the Backend system added by [82682](pytorch#82682) to change allocators dynamically during the code execution. This will allow us to use RMM, use CUDA managed memory for some portions of the code that do not fit in GPU memory. Write static memory allocators to reduce fragmentation while training models and improve interoperability with external DL compilers/libraries. For example, we could have the following allocator in c++ ```c++ #include <sys/types.h> #include <cuda_runtime_api.h> #include <iostream> extern "C" { void* my_malloc(ssize_t size, int device, cudaStream_t stream) { void *ptr; std::cout<<"alloc "<< size<<std::endl; cudaMalloc(&ptr, size); return ptr; } void my_free(void* ptr) { std::cout<<"free "<<std::endl; cudaFree(ptr); } } ``` Compile it as a shared library ``` nvcc allocator.cc -o alloc.so -shared --compiler-options '-fPIC' ``` And use it from PyTorch as follows ```python import torch # Init caching # b = torch.zeros(10, device='cuda') new_alloc = torch.cuda.memory.CUDAPluggableAllocator('alloc.so', 'my_malloc', 'my_free') old = torch.cuda.memory.get_current_allocator() torch.cuda.memory.change_current_allocator(new_alloc) b = torch.zeros(10, device='cuda') # This will error since the current allocator was already instantiated torch.cuda.memory.change_current_allocator(old) ``` Things to discuss - How to test this, needs compiling external code ... Pull Request resolved: pytorch#86786 Approved by: https://github.com/albanD
…ol initialized()` (pytorch#89687)
Fixes #43144
This uses the Backend system added by 82682 to change allocators dynamically during the code execution. This will allow us to use RMM, use CUDA managed memory for some portions of the code that do not fit in GPU memory. Write static memory allocators to reduce fragmentation while training models and improve interoperability with external DL compilers/libraries.
For example, we could have the following allocator in c++
Compile it as a shared library
And use it from PyTorch as follows
Things to discuss
cc @jakirkham @ptrblck @albanD @ngimel @leofang @harrism @mcarilli