Skip to content

Conversation

emcastillo
Copy link
Collaborator

@emcastillo emcastillo commented Oct 12, 2022

Fixes #43144

This uses the Backend system added by 82682 to change allocators dynamically during the code execution. This will allow us to use RMM, use CUDA managed memory for some portions of the code that do not fit in GPU memory. Write static memory allocators to reduce fragmentation while training models and improve interoperability with external DL compilers/libraries.

For example, we could have the following allocator in c++

#include <sys/types.h>
#include <cuda_runtime_api.h>
#include <iostream>

extern "C" {
void* my_malloc(ssize_t size, int device, cudaStream_t stream) {
   void *ptr;
   std::cout<<"alloc "<< size<<std::endl;
   cudaMalloc(&ptr, size);
   return ptr;
}


void my_free(void* ptr) {
   std::cout<<"free "<<std::endl;
   cudaFree(ptr);
}
}

Compile it as a shared library

nvcc allocator.cc -o alloc.so -shared --compiler-options '-fPIC'

And use it from PyTorch as follows

import torch

# Init caching
# b = torch.zeros(10, device='cuda')
new_alloc = torch.cuda.memory.CUDAPluggableAllocator('alloc.so', 'my_malloc', 'my_free')
old = torch.cuda.memory.get_current_allocator()
torch.cuda.memory.change_current_allocator(new_alloc)
b = torch.zeros(10, device='cuda')
# This will error since the current allocator was already instantiated
torch.cuda.memory.change_current_allocator(old)

Things to discuss

  • How to test this, needs compiling external code ...

cc @jakirkham @ptrblck @albanD @ngimel @leofang @harrism @mcarilli

@pytorch-bot
Copy link

pytorch-bot bot commented Oct 12, 2022

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/86786

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 96ee1dd:
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@vadimkantorov
Copy link
Contributor

Maybe related: #1529 (comment)

@vadimkantorov
Copy link
Contributor

Also, one option may be enabling tracing dynamically via this interface.

@zdevito
Copy link
Contributor

zdevito commented Oct 12, 2022

I like the idea of having virtual class interface to allocators. It is cleaner and more idiomatic than having a switch with x-macros. Having both is confusing though. Is it possible to just replace the entire switch statement thing with virtual classes for {cudacachingallocator, cudaMallocAsyncAllocator, something else}?

@harrism
Copy link

harrism commented Oct 12, 2022

I'm really excited to see this finally happening! Since you mention RMM compatibility as a goal, I think that the deallocator interface should also take a stream. This would ensure compatibility not only with RMM memory resources, but also CUDA's own stream-ordered allocator (cudaMallocAsync()/cudaFreeAsync() -- blog post), and the upcoming cuda::memory_resource**.

Passing a stream to the deallocator is necessary for efficient stream-ordered reuse. The stream passed to deallocate tells the allocator on which stream a freed block is available for immediate reuse. Without providing the stream, a stream-ordered allocator must synchronize before reusing the block.

Second, ideally the deallocator interface should also take the size of the allocation. This is to match the std::pmr::memory_resource interface (and the upcoming cuda::mr::async_resource** interface). This is especially important for GPU allocators. CPU allocators typically store the size and other metadata in a small region within the allocation itself. But doing so in a device allocator requires reading the data back from device to host on every deallocation, which is expensive. If the user doesn't pass the size to the deallocator, then many implementations will need to store a map of allocated pointers to sizes and look the size up on each deallocation.

Third, ideally the deallocator would also take the alignment of the allocation, to match the std::pmr::memory_resource and cuda::mr** interface.

Implementations that don't need the stream and size on deallocation are of course free to ignore these parameters, but requiring them enables a lot of efficiency and flexibility in allocator implementations.

To summarize, I'm proposing changing this:

std::shared_ptr<CudaAllocator> createCustomAllocator(
  std::function<void*(size_t, int, cudaStream_t)> alloc_fn,
  std::function<void(void*)> free_fn) 

To this:

std::shared_ptr<CudaAllocator> createCustomAllocator(
  std::function<void*(size_t /*size*/, size_t /*alignment*/, cudaStream_t)> alloc_fn,
  std::function<void(void*, size_t /*size*/, size_t /*alignment*/, cudaStream_t)> free_fn) 

Also, if PyTorch wants to enable non-asynchronous memory allocators, you may want to provide the option of registering allocators that do not take streams.

** cuda::mr::async_resource is an NVIDIA effort to provide a standards-based interface to device (and host) memory resources and asynchronous allocation/deallocation as part of libcudac++. I don't have a good public link to the spec, but there is an open PR. Here is the concept definition for cuda::mr::resource:

template <class Resource>
concept resource = equality_comparable<Resource>
                && requires(Resource& res, void* ptr, size_t size, size_t alignment) {
    { res.allocate(size, alignment) } -> same_as<void*>;
    { res.deallocate(ptr, size, alignment) } -> same_as<void>;
};

And cuda::mr::async_resource:

template <class Resource>
concept async_resource = resource<Resource>
                      && requires(Resource& res, void* ptr, size_t size, size_t alignment, cuda_stream_ref stream) {
    { res.allocate_async(size, alignment, stream) } -> same_as<void*>;
    { res.deallocate_async(ptr, size, alignment, stream) } -> same_as<void>;
};

We also plan to update RMM to be compatible with cuda::mr.

@emcastillo
Copy link
Collaborator Author

I like the idea of having virtual class interface to allocators. It is cleaner and more idiomatic than having a switch with x-macros. Having both is confusing though. Is it possible to just replace the entire switch statement thing with virtual classes for {cudacachingallocator, cudaMallocAsyncAllocator, something else}?

@zdevito Yeah, I fully agree with this. It seems that the switch & macros approach was designed to have the allocator used statically and eliminate possibly dynamic undesired behavior such as virtual functions overhead, etc. I think that the overhead of moving everything to an interface with virtual methods should be negligible, but I am not sure if this would be acceptable by the core maintainers given the direction that #82682 took.

@emcastillo emcastillo force-pushed the new-custom-allocator branch from e7c69f7 to 5501640 Compare October 18, 2022 07:25
@zdevito
Copy link
Contributor

zdevito commented Oct 18, 2022

I think that the overhead of moving everything to an interface with virtual methods should be negligible, but I am not sure if this would be acceptable by the core maintainers given the direction that #82682 took.

Understood. I am going to take a shot at moving the current code over to a virtual method approach in a way that won't sacrifice performance. Then it should be easy to integrate the pluggable allocators you have here on top of that patch.

@emcastillo emcastillo force-pushed the new-custom-allocator branch from 5501640 to 8ce8c71 Compare October 20, 2022 06:02
@emcastillo
Copy link
Collaborator Author

@zdevito rebased on top of #87251 and it became much more cleaner now :)

Copy link
Contributor

@zdevito zdevito left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is much cleaner now. This makes it possible to switch the allocator via a Python API so that there is no need to do complicated linking against lib torch, which is a pretty useful thing to do. But it additionally requires setting the environment variable at runtime to opt into it, which is another source of friction for being able to use it, similar to having to link libtorch. I was hoping it would be possible to just swap the existing allocator out when this Python API is called rather than having to put the plugging API components as an environment options (or equivalent, wait to initialize the allocator until the first time it is required). For instance, we could allow the allocator to be swapped as long as no allocation have yet occurred. The benefit would be that it would allow python-only ways to configure the allocator (including between native and cudaMallocAsync), and only the Module.cpp wrapper function would need to know about the pluggable allocator option.

@emcastillo emcastillo force-pushed the new-custom-allocator branch from 8ce8c71 to 1f1b69a Compare October 24, 2022 02:47
@emcastillo
Copy link
Collaborator Author

emcastillo commented Oct 24, 2022

@zdevito , Yeah, that design currently made no sense at all. Just moved all the pluggable machinery to Python and left c10 untouched. Now it looks much better and the design makes more sense.
If everyone is ok with this direction I will finish the unimplemented method and work on mark suggestions.

Copy link
Contributor

@zdevito zdevito left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this is a good improvement. I have a couple specific things inline to ensure correctness, but generally things now seem pretty self-contained, and the next steps really depend on what you have in mind for the first uses of such an API.

}

void changeCurrentAllocator(std::shared_ptr<c10::cuda::CUDACachingAllocator::CUDAAllocator> allocator) {
c10::cuda::CUDACachingAllocator::allocator.store(allocator.get());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the allocator no longer has references in Python, then the object is going to be destructed but still have a dangling reference here. Maybe add a static variable that holds on the shared pointer here in addition to setting the allocator.

Also it is unsafe to change the allocator after some allocations have happened because a raw_alloc'd pointer might later get passed to a raw_free of a different allocator. The use of the raw_ APIs is rare but are used in the codebase. What to do depends more on the use cases you might have in mind for this API. Ideally, we'd try to get rid of the raw_deletes and instead have them use the DataPtr interface so that even if the allocator is different at the point of free, the DataPtr still directs to the correct allocator.

A simpler choice is to try to figure out if we have allocated any memory yet, and just refuse to change the allocator if we have.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went with your approach and forbid allocator changes after initialization. However, I would like to remove the raw_ apis so we can dynamically change allocators in different parts of the application (p.e. used managed memory in one place and then resort to regular cuda mallocs).

protected:
std::function<void*(size_t, int, cudaStream_t)> alloc_fn_;
std::function<void(void*)> free_fn_;
// TODO Populate all these functions
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think for the first iteration of this, we can just support the minimal API that will make most of PyTorch work (and whatever parts you are interested in modifying). This can throw errors are return default values for stuff like:

  • CUDAGraphs, mempool stuff - just error, its unlikely a custom allocator works with them unless built specific for that.
  • Statistics, snapshotting - it is ok to have them return 0s, stats for other allocators are not the same anyway

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added support for adding these functions (probably rapids guys are interested) but left it as an undocumented feature for now.

@emcastillo emcastillo force-pushed the new-custom-allocator branch from 1f1b69a to 6e70ef9 Compare November 8, 2022 06:43
@emcastillo emcastillo changed the title [WIP] Add Pluggable CUDA allocator backend Add Pluggable CUDA allocator backend Nov 9, 2022
@emcastillo
Copy link
Collaborator Author

This should be ready to review!
@zdevito @ngimel @albanD @harrism PTAL :)

#include <sys/types.h>
#include <cuda_runtime_api.h>
#include <iostream>
// Compile with nvcc alloc.cc -o alloc.so -shared --compiler-options '-fPIC
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or with g++ alloc.cc -o alloc.so -I/usr/local/cuda/include -shared -fPIC (There's no CUDA device code in this example, so NVCC is not required)

Also, missing a ':

Suggested change
// Compile with nvcc alloc.cc -o alloc.so -shared --compiler-options '-fPIC
// Compile with nvcc alloc.cc -o alloc.so -shared --compiler-options '-fPIC'

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I was too lazy to write the -I flag :)

@vadimkantorov
Copy link
Contributor

vadimkantorov commented Nov 9, 2022

A naive question: is it useful for allocator methods to accept a dummy int64 correlation id? So that it's easier to do tracing / stack analysis / operation grouping together? Or are these tasks well solved by just stack sample analysis? without resorting to such hacks for operation tracking

Some usecases for customized allocators: tracing; pre-allocation; arena allocation/deallocation

@emcastillo
Copy link
Collaborator Author

sorry @zdevito, I think we may need to move the CUDAPluggableAllocator definition from torch/csrc/cuda to c10/cuda for the hipify thing to correctly work.

@albanD albanD added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Nov 10, 2022
Copy link
Contributor

@zdevito zdevito left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good -- I only have a couple minor nits inline about naming or documentation. I saw the comment about hipify but I am not an expert on how it works, so I am not sure what is happening there. It is likely fixable so if you send more details I can forward it to someone who understands more.

void changeCurrentAllocator(
std::shared_ptr<c10::cuda::CUDACachingAllocator::CUDAAllocator> allocator) {
TORCH_CHECK(
!getCurrentAllocator()->initialized(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a race here where we can still end up with

  • Thread 1: getCurrentAllocator()
  • Thread 2: allocate some memory with raw_alloc
  • Thread 1: store new allocator
  • Thread 2: raw_delete (calls wrong one and crashes)

However, I don't think this scenario is common. We should document that changing the allocator after allocations have occurred may result in undefined behavior. The best resolution to this longer term is to remove the raw_alloc/raw_delete interfaces.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The pluggable allocator holds a history of active memory allocations, so it will hard error when it receives a pointer that it hasn't allocated. This list is needed to pass the free function the size and stream of the allocation as requested by the nvidia guys.

@emcastillo emcastillo force-pushed the new-custom-allocator branch 2 times, most recently from dabf810 to 7a16c13 Compare November 11, 2022 07:49
@emcastillo
Copy link
Collaborator Author

I think all should be green and ready now, thanks!

@emcastillo
Copy link
Collaborator Author

@zdevito how does this looks to you now? :)

Copy link
Collaborator

@albanD albanD left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very exciting. I'm sure there are small incremental improvements we can add moving forward but this is really good already. I didn't look into the allocator itself in great details because Zach did it.

import torch

# Load the allocator
new_alloc = torch.cuda.memory.CUDAPluggableAllocator(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: could we have an example her that uses the torch.utils.cpp_extension.load_inline() code? That might be simpler for users?

@emcastillo
Copy link
Collaborator Author

@pytorchbot merge

@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Nov 23, 2022
@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

jeanschmidt added a commit that referenced this pull request Nov 28, 2022
kulinseth pushed a commit to kulinseth/pytorch that referenced this pull request Dec 10, 2022
Fixes pytorch#43144

This uses the Backend system added by [82682](pytorch#82682) to change allocators dynamically during the code execution. This will allow us to use RMM, use CUDA managed memory for some portions of the code that do not fit in GPU memory. Write static memory allocators to reduce fragmentation while training models and improve interoperability with external DL compilers/libraries.

For example, we could have the following allocator in c++

```c++
#include <sys/types.h>
#include <cuda_runtime_api.h>
#include <iostream>

extern "C" {
void* my_malloc(ssize_t size, int device, cudaStream_t stream) {
   void *ptr;
   std::cout<<"alloc "<< size<<std::endl;
   cudaMalloc(&ptr, size);
   return ptr;
}

void my_free(void* ptr) {
   std::cout<<"free "<<std::endl;
   cudaFree(ptr);
}
}
```

Compile it as a shared library
```
nvcc allocator.cc -o alloc.so -shared --compiler-options '-fPIC'
```

And use it from PyTorch as follows

```python
import torch

# Init caching
# b = torch.zeros(10, device='cuda')
new_alloc = torch.cuda.memory.CUDAPluggableAllocator('alloc.so', 'my_malloc', 'my_free')
old = torch.cuda.memory.get_current_allocator()
torch.cuda.memory.change_current_allocator(new_alloc)
b = torch.zeros(10, device='cuda')
# This will error since the current allocator was already instantiated
torch.cuda.memory.change_current_allocator(old)
```

Things to discuss
- How to test this, needs compiling external code ...

Pull Request resolved: pytorch#86786
Approved by: https://github.com/albanD
kulinseth pushed a commit to kulinseth/pytorch that referenced this pull request Dec 10, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ciflow/trunk Trigger trunk jobs on your pull request Merged open source triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Using external memory allocator with PyTorch
7 participants