Add Pluggable CUDA allocator backend #86786

emcastillo · 2022-10-12T08:16:12Z

This uses the Backend system added by 82682 to change allocators dynamically during the code execution. This will allow us to use RMM, use CUDA managed memory for some portions of the code that do not fit in GPU memory. Write static memory allocators to reduce fragmentation while training models and improve interoperability with external DL compilers/libraries.

For example, we could have the following allocator in c++

#include <sys/types.h>
#include <cuda_runtime_api.h>
#include <iostream>

extern "C" {
void* my_malloc(ssize_t size, int device, cudaStream_t stream) {
   void *ptr;
   std::cout<<"alloc "<< size<<std::endl;
   cudaMalloc(&ptr, size);
   return ptr;
}


void my_free(void* ptr) {
   std::cout<<"free "<<std::endl;
   cudaFree(ptr);
}
}

Compile it as a shared library

nvcc allocator.cc -o alloc.so -shared --compiler-options '-fPIC'

And use it from PyTorch as follows

import torch

# Init caching
# b = torch.zeros(10, device='cuda')
new_alloc = torch.cuda.memory.CUDAPluggableAllocator('alloc.so', 'my_malloc', 'my_free')
old = torch.cuda.memory.get_current_allocator()
torch.cuda.memory.change_current_allocator(new_alloc)
b = torch.zeros(10, device='cuda')
# This will error since the current allocator was already instantiated
torch.cuda.memory.change_current_allocator(old)

Things to discuss

How to test this, needs compiling external code ...

cc @jakirkham @ptrblck @albanD @ngimel @leofang @harrism @mcarilli

pytorch-bot · 2022-10-12T08:16:14Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/86786

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 96ee1dd:
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

c10/cuda/CUDAPlugableAllocator.cpp

c10/cuda/CUDACachingAllocator.h

vadimkantorov · 2022-10-12T08:47:44Z

Maybe related: #1529 (comment)

vadimkantorov · 2022-10-12T08:49:44Z

Also, one option may be enabling tracing dynamically via this interface.

zdevito · 2022-10-12T18:41:23Z

I like the idea of having virtual class interface to allocators. It is cleaner and more idiomatic than having a switch with x-macros. Having both is confusing though. Is it possible to just replace the entire switch statement thing with virtual classes for {cudacachingallocator, cudaMallocAsyncAllocator, something else}?

harrism · 2022-10-12T21:51:10Z

I'm really excited to see this finally happening! Since you mention RMM compatibility as a goal, I think that the deallocator interface should also take a stream. This would ensure compatibility not only with RMM memory resources, but also CUDA's own stream-ordered allocator (cudaMallocAsync()/cudaFreeAsync() -- blog post), and the upcoming cuda::memory_resource**.

Passing a stream to the deallocator is necessary for efficient stream-ordered reuse. The stream passed to deallocate tells the allocator on which stream a freed block is available for immediate reuse. Without providing the stream, a stream-ordered allocator must synchronize before reusing the block.

Second, ideally the deallocator interface should also take the size of the allocation. This is to match the std::pmr::memory_resource interface (and the upcoming cuda::mr::async_resource** interface). This is especially important for GPU allocators. CPU allocators typically store the size and other metadata in a small region within the allocation itself. But doing so in a device allocator requires reading the data back from device to host on every deallocation, which is expensive. If the user doesn't pass the size to the deallocator, then many implementations will need to store a map of allocated pointers to sizes and look the size up on each deallocation.

Third, ideally the deallocator would also take the alignment of the allocation, to match the std::pmr::memory_resource and cuda::mr** interface.

Implementations that don't need the stream and size on deallocation are of course free to ignore these parameters, but requiring them enables a lot of efficiency and flexibility in allocator implementations.

To summarize, I'm proposing changing this:

std::shared_ptr<CudaAllocator> createCustomAllocator(
  std::function<void*(size_t, int, cudaStream_t)> alloc_fn,
  std::function<void(void*)> free_fn)

To this:

std::shared_ptr<CudaAllocator> createCustomAllocator(
  std::function<void*(size_t /*size*/, size_t /*alignment*/, cudaStream_t)> alloc_fn,
  std::function<void(void*, size_t /*size*/, size_t /*alignment*/, cudaStream_t)> free_fn)

Also, if PyTorch wants to enable non-asynchronous memory allocators, you may want to provide the option of registering allocators that do not take streams.

** cuda::mr::async_resource is an NVIDIA effort to provide a standards-based interface to device (and host) memory resources and asynchronous allocation/deallocation as part of libcudac++. I don't have a good public link to the spec, but there is an open PR. Here is the concept definition for cuda::mr::resource:

template <class Resource>
concept resource = equality_comparable<Resource>
                && requires(Resource& res, void* ptr, size_t size, size_t alignment) {
    { res.allocate(size, alignment) } -> same_as<void*>;
    { res.deallocate(ptr, size, alignment) } -> same_as<void>;
};

And cuda::mr::async_resource:

template <class Resource>
concept async_resource = resource<Resource>
                      && requires(Resource& res, void* ptr, size_t size, size_t alignment, cuda_stream_ref stream) {
    { res.allocate_async(size, alignment, stream) } -> same_as<void*>;
    { res.deallocate_async(ptr, size, alignment, stream) } -> same_as<void>;
};

We also plan to update RMM to be compatible with cuda::mr.

emcastillo · 2022-10-13T05:31:56Z

I like the idea of having virtual class interface to allocators. It is cleaner and more idiomatic than having a switch with x-macros. Having both is confusing though. Is it possible to just replace the entire switch statement thing with virtual classes for {cudacachingallocator, cudaMallocAsyncAllocator, something else}?

@zdevito Yeah, I fully agree with this. It seems that the switch & macros approach was designed to have the allocator used statically and eliminate possibly dynamic undesired behavior such as virtual functions overhead, etc. I think that the overhead of moving everything to an interface with virtual methods should be negligible, but I am not sure if this would be acceptable by the core maintainers given the direction that #82682 took.

zdevito · 2022-10-18T17:45:16Z

I think that the overhead of moving everything to an interface with virtual methods should be negligible, but I am not sure if this would be acceptable by the core maintainers given the direction that #82682 took.

Understood. I am going to take a shot at moving the current code over to a virtual method approach in a way that won't sacrifice performance. Then it should be easy to integrate the pluggable allocators you have here on top of that patch.

emcastillo · 2022-10-20T06:03:02Z

@zdevito rebased on top of #87251 and it became much more cleaner now :)

zdevito

This is much cleaner now. This makes it possible to switch the allocator via a Python API so that there is no need to do complicated linking against lib torch, which is a pretty useful thing to do. But it additionally requires setting the environment variable at runtime to opt into it, which is another source of friction for being able to use it, similar to having to link libtorch. I was hoping it would be possible to just swap the existing allocator out when this Python API is called rather than having to put the plugging API components as an environment options (or equivalent, wait to initialize the allocator until the first time it is required). For instance, we could allow the allocator to be swapped as long as no allocation have yet occurred. The benefit would be that it would allow python-only ways to configure the allocator (including between native and cudaMallocAsync), and only the Module.cpp wrapper function would need to know about the pluggable allocator option.

emcastillo · 2022-10-24T02:52:59Z

@zdevito , Yeah, that design currently made no sense at all. Just moved all the pluggable machinery to Python and left c10 untouched. Now it looks much better and the design makes more sense.
If everyone is ok with this direction I will finish the unimplemented method and work on mark suggestions.

zdevito

Yeah, this is a good improvement. I have a couple specific things inline to ensure correctness, but generally things now seem pretty self-contained, and the next steps really depend on what you have in mind for the first uses of such an API.

zdevito · 2022-10-25T00:20:44Z

torch/csrc/cuda/CUDAPluggableAllocator.cpp

+}
+
+void changeCurrentAllocator(std::shared_ptr<c10::cuda::CUDACachingAllocator::CUDAAllocator> allocator) {
+  c10::cuda::CUDACachingAllocator::allocator.store(allocator.get());


If the allocator no longer has references in Python, then the object is going to be destructed but still have a dangling reference here. Maybe add a static variable that holds on the shared pointer here in addition to setting the allocator.

Also it is unsafe to change the allocator after some allocations have happened because a raw_alloc'd pointer might later get passed to a raw_free of a different allocator. The use of the raw_ APIs is rare but are used in the codebase. What to do depends more on the use cases you might have in mind for this API. Ideally, we'd try to get rid of the raw_deletes and instead have them use the DataPtr interface so that even if the allocator is different at the point of free, the DataPtr still directs to the correct allocator.

A simpler choice is to try to figure out if we have allocated any memory yet, and just refuse to change the allocator if we have.

I went with your approach and forbid allocator changes after initialization. However, I would like to remove the raw_ apis so we can dynamically change allocators in different parts of the application (p.e. used managed memory in one place and then resort to regular cuda mallocs).

zdevito · 2022-10-25T00:23:22Z

torch/csrc/cuda/CUDAPluggableAllocator.cpp

+protected:
+  std::function<void*(size_t, int, cudaStream_t)> alloc_fn_;
+  std::function<void(void*)> free_fn_;
+  // TODO Populate all these functions


I think for the first iteration of this, we can just support the minimal API that will make most of PyTorch work (and whatever parts you are interested in modifying). This can throw errors are return default values for stuff like:

CUDAGraphs, mempool stuff - just error, its unlikely a custom allocator works with them unless built specific for that.

Statistics, snapshotting - it is ok to have them return 0s, stats for other allocators are not the same anyway

Added support for adding these functions (probably rapids guys are interested) but left it as an undocumented feature for now.

emcastillo · 2022-11-09T08:21:06Z

This should be ready to review!
@zdevito @ngimel @albanD @harrism PTAL :)

harrism · 2022-11-09T10:54:52Z

docs/source/notes/cuda.rst

+   #include <sys/types.h>
+   #include <cuda_runtime_api.h>
+   #include <iostream>
+   // Compile with nvcc alloc.cc -o alloc.so -shared --compiler-options '-fPIC 


Or with g++ alloc.cc -o alloc.so -I/usr/local/cuda/include -shared -fPIC (There's no CUDA device code in this example, so NVCC is not required)

Also, missing a ':

Suggested change

// Compile with nvcc alloc.cc -o alloc.so -shared --compiler-options '-fPIC

// Compile with nvcc alloc.cc -o alloc.so -shared --compiler-options '-fPIC'

Yeah, I was too lazy to write the -I flag :)

vadimkantorov · 2022-11-09T11:04:06Z

A naive question: is it useful for allocator methods to accept a dummy int64 correlation id? So that it's easier to do tracing / stack analysis / operation grouping together? Or are these tasks well solved by just stack sample analysis? without resorting to such hacks for operation tracking

Some usecases for customized allocators: tracing; pre-allocation; arena allocation/deallocation

emcastillo · 2022-11-10T10:29:05Z

sorry @zdevito, I think we may need to move the CUDAPluggableAllocator definition from torch/csrc/cuda to c10/cuda for the hipify thing to correctly work.

zdevito

This looks good -- I only have a couple minor nits inline about naming or documentation. I saw the comment about hipify but I am not an expert on how it works, so I am not sure what is happening there. It is likely fixable so if you send more details I can forward it to someone who understands more.

docs/source/notes/cuda.rst

torch/csrc/cuda/CUDAPluggableAllocator.h

zdevito · 2022-11-10T18:55:08Z

torch/csrc/cuda/CUDAPluggableAllocator.cpp

+void changeCurrentAllocator(
+    std::shared_ptr<c10::cuda::CUDACachingAllocator::CUDAAllocator> allocator) {
+  TORCH_CHECK(
+      !getCurrentAllocator()->initialized(),


There is a race here where we can still end up with

Thread 1: getCurrentAllocator()

Thread 2: allocate some memory with raw_alloc

Thread 1: store new allocator

Thread 2: raw_delete (calls wrong one and crashes)

However, I don't think this scenario is common. We should document that changing the allocator after allocations have occurred may result in undefined behavior. The best resolution to this longer term is to remove the raw_alloc/raw_delete interfaces.

The pluggable allocator holds a history of active memory allocations, so it will hard error when it receives a pointer that it hasn't allocated. This list is needed to pass the free function the size and stream of the allocation as requested by the nvidia guys.

emcastillo · 2022-11-11T08:12:31Z

I think all should be green and ready now, thanks!

emcastillo · 2022-11-14T06:48:45Z

@zdevito how does this looks to you now? :)

albanD

Very exciting. I'm sure there are small incremental improvements we can add moving forward but this is really good already. I didn't look into the allocator itself in great details because Zach did it.

albanD · 2022-11-16T15:47:30Z

docs/source/notes/cuda.rst

+   import torch
+
+   # Load the allocator
+   new_alloc = torch.cuda.memory.CUDAPluggableAllocator(


nit: could we have an example her that uses the torch.utils.cpp_extension.load_inline() code? That might be simpler for users?

emcastillo · 2022-11-23T14:45:13Z

@pytorchbot merge

pytorchmergebot · 2022-11-23T14:47:37Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

…ialized()` (#89687)

Fixes pytorch#43144 This uses the Backend system added by [82682](pytorch#82682) to change allocators dynamically during the code execution. This will allow us to use RMM, use CUDA managed memory for some portions of the code that do not fit in GPU memory. Write static memory allocators to reduce fragmentation while training models and improve interoperability with external DL compilers/libraries. For example, we could have the following allocator in c++ ```c++ #include <sys/types.h> #include <cuda_runtime_api.h> #include <iostream> extern "C" { void* my_malloc(ssize_t size, int device, cudaStream_t stream) { void *ptr; std::cout<<"alloc "<< size<<std::endl; cudaMalloc(&ptr, size); return ptr; } void my_free(void* ptr) { std::cout<<"free "<<std::endl; cudaFree(ptr); } } ``` Compile it as a shared library ``` nvcc allocator.cc -o alloc.so -shared --compiler-options '-fPIC' ``` And use it from PyTorch as follows ```python import torch # Init caching # b = torch.zeros(10, device='cuda') new_alloc = torch.cuda.memory.CUDAPluggableAllocator('alloc.so', 'my_malloc', 'my_free') old = torch.cuda.memory.get_current_allocator() torch.cuda.memory.change_current_allocator(new_alloc) b = torch.zeros(10, device='cuda') # This will error since the current allocator was already instantiated torch.cuda.memory.change_current_allocator(old) ``` Things to discuss - How to test this, needs compiling external code ... Pull Request resolved: pytorch#86786 Approved by: https://github.com/albanD

…ol initialized()` (pytorch#89687)

pytorchbot added the open source label Oct 12, 2022

emcastillo commented Oct 12, 2022

View reviewed changes

c10/cuda/CUDAPlugableAllocator.cpp Outdated Show resolved Hide resolved

emcastillo commented Oct 12, 2022

View reviewed changes

c10/cuda/CUDACachingAllocator.h Outdated Show resolved Hide resolved

emcastillo force-pushed the new-custom-allocator branch from e7c69f7 to 5501640 Compare October 18, 2022 07:25

emcastillo force-pushed the new-custom-allocator branch from 5501640 to 8ce8c71 Compare October 20, 2022 06:02

zdevito reviewed Oct 21, 2022

View reviewed changes

emcastillo force-pushed the new-custom-allocator branch from 8ce8c71 to 1f1b69a Compare October 24, 2022 02:47

zdevito reviewed Oct 25, 2022

View reviewed changes

shwina mentioned this pull request Oct 27, 2022

Test PyTorch's Pluggable CUDA allocator with RMM rapidsai/rmm#1144

Closed

emcastillo force-pushed the new-custom-allocator branch from 1f1b69a to 6e70ef9 Compare November 8, 2022 06:43

emcastillo changed the title ~~[WIP] Add Pluggable CUDA allocator backend~~ Add Pluggable CUDA allocator backend Nov 9, 2022

harrism reviewed Nov 9, 2022

View reviewed changes

emcastillo force-pushed the new-custom-allocator branch from edf9a3d to ab790d3 Compare November 10, 2022 04:22

albanD added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Nov 10, 2022

zdevito reviewed Nov 10, 2022

View reviewed changes

emcastillo force-pushed the new-custom-allocator branch 2 times, most recently from dabf810 to 7a16c13 Compare November 11, 2022 07:49

albanD approved these changes Nov 16, 2022

View reviewed changes

emcastillo force-pushed the new-custom-allocator branch from 7a16c13 to dfc538b Compare November 23, 2022 09:22

Add Pluggable CUDA allocator backend

96ee1dd

emcastillo force-pushed the new-custom-allocator branch from dfc538b to 96ee1dd Compare November 23, 2022 12:22

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Nov 23, 2022

shwina mentioned this pull request Nov 23, 2022

[FEA] Use RMM memory pool by default rapidsai/cudf#12235

Closed

pytorchmergebot added the Merged label Nov 23, 2022

pytorchmergebot closed this in c9d4390 Nov 23, 2022

jeanschmidt added a commit that referenced this pull request Nov 25, 2022

upress Werror introduced by no override #86786 on bool initialized()

32f72d7

jeanschmidt added a commit that referenced this pull request Nov 28, 2022

supress Werror introduced by lack of override by #86786 on `bool init…

d089fbd

…ialized()` (#89687)

atalman mentioned this pull request Nov 28, 2022

upress Werror introduced by no override #86786 on bool initialized() #89687

Merged

emcastillo mentioned this pull request Nov 29, 2022

[WIP] Expose CuPy allocator cupy/cupy#7210

Closed

kulinseth pushed a commit to kulinseth/pytorch that referenced this pull request Dec 10, 2022

supress Werror introduced by lack of override by pytorch#86786 on `bo…

82016bd

…ol initialized()` (pytorch#89687)

VibhuJawa mentioned this pull request Dec 14, 2022

Add RMM PyTorch allocator rapidsai/rmm#1168

Merged

kmaehashi mentioned this pull request May 11, 2023

[RFC] Memory allocator interoperability with PyTorch cupy/cupy#7556

Open

wence- mentioned this pull request Dec 12, 2023

[BUG] Unexpected memory usage on GPU0 rapidsai/rmm#1405

Closed

syed-ahmed mentioned this pull request Apr 24, 2024

[RFC] Mix and Match CUDA Allocators using Private Pools #124807

Open

syed-ahmed mentioned this pull request Jul 10, 2024

Uses context pointer for deleter to enable multiple CUDAPluggableAllocator usage #130472

Closed

	// Compile with nvcc alloc.cc -o alloc.so -shared --compiler-options '-fPIC
	// Compile with nvcc alloc.cc -o alloc.so -shared --compiler-options '-fPIC'

Add Pluggable CUDA allocator backend #86786

Add Pluggable CUDA allocator backend #86786

Uh oh!

Conversation

emcastillo commented Oct 12, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Oct 12, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/86786

✅ No Failures

Uh oh!

Uh oh!

Uh oh!

vadimkantorov commented Oct 12, 2022

Uh oh!

vadimkantorov commented Oct 12, 2022

Uh oh!

zdevito commented Oct 12, 2022

Uh oh!

harrism commented Oct 12, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

emcastillo commented Oct 13, 2022

Uh oh!

zdevito commented Oct 18, 2022

Uh oh!

emcastillo commented Oct 20, 2022

Uh oh!

zdevito left a comment

Choose a reason for hiding this comment

Uh oh!

emcastillo commented Oct 24, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zdevito left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

emcastillo commented Nov 9, 2022

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vadimkantorov commented Nov 9, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

emcastillo commented Nov 10, 2022

Uh oh!

zdevito left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

emcastillo commented Nov 11, 2022

Uh oh!

emcastillo commented Nov 14, 2022

Uh oh!

albanD left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

emcastillo commented Nov 23, 2022

Uh oh!

pytorchmergebot commented Nov 23, 2022

Merge started

emcastillo commented Oct 12, 2022 •

edited

Loading

pytorch-bot bot commented Oct 12, 2022 •

edited

Loading

harrism commented Oct 12, 2022 •

edited

Loading

emcastillo commented Oct 24, 2022 •

edited

Loading

vadimkantorov commented Nov 9, 2022 •

edited

Loading