Make context handling in GPU runtimes more consistent and robust. #5474

zvookin · 2020-11-24T08:35:36Z

Move to using common code for kernel compilation caching for CUDA, OpenCL, Metal, and D3D12 GPU runtimes. New caching endeavors to be robust in not using a kernel compiled for one context on another and uses a hash table to avoid small allocations across multiple pages of VM. OpenCL was particularly broken in that code using two contexts was almost guaranteed to fail. This PR also opens the door to allowing better client control of caching, such as setting a size limit or allowing eviction of specific kernels, and is pretty close to allowing runtime overloads of the kernel compilation itself to allow persistent caching across process invocations for GPU APIs that allow this. (The compile_kernel function in multiple files needs to be promoted to a client visible runtime overload for each GPU API.)

Tests are added to cover many kernels and more than one context. A test using multiple contexts across multiple threads both tests things that didn't necessarily work before and provides an example for a common use case.

Two small fixes to CUDA prevent a crash in a very rare error case and make device release work if the CUDA library is linked directly into the app. (The latter would have shown up as a crash due to allocation caching for static linking as the code to release allocations when freeing a context did not run.)

OpenGL and OpenGLCompute were not addressed in this PR due to both time limitations and because there are more significant issues in these runtimes around this area. OpenGL is basically a Superfund site at this point and should be deleted. OpenGLCompute may or may not be worth preserving, though similar work is needed re: how kernels are communicated to the runtime and compiled.

(shader/kernel/etc.) compilations.

the intial support and adds tests. Currently the gpu_multi test only has context creation code fo CUDA and OpenCL. This shoulkd be added for other GPU runteims, but some coverage is provided via using the default context for these APIs. Fixes a bug in CUDA runtime where some error message text in cuda_do_multidimensional_copy was not initialized. Fixes a bug in CUDA runtime where device release code did not run if CUDA libraries are directly linked into the executable. (This would have caused crashes due to the device allocation caching among other issues.)

Add initial commits explaining what tests do.

src/runtime/metal.cpp

it to stick closer to naming pattern and work with CMake rules code.

steven-johnson · 2020-11-24T18:50:19Z

OpenGL is basically a Superfund site at this point and should be deleted. OpenGLCompute may or may not be worth preserving

See #5475

steven-johnson

Looks good from a quick skim -- gonna wait for buildbots to look clean(er) before reviewing more.

Makefile

src/runtime/d3d12compute.cpp

src/runtime/gpu_context_common.h

test/correctness/gpu_many_kernels.cpp

passing more than it needed to.

test/correctness/CMakeLists.txt

src/runtime/gpu_context_common.h

steven-johnson

LGTM with nits

Makefile

src/runtime/cuda.cpp

src/runtime/gpu_context_common.h

This reverts commit 0a57c4b.

zvookin · 2020-12-02T22:42:37Z

super-duper-style-nit: seems like constexpr bool HAS_MULTIPLE_CONTEXTS = true; (etc) would be preferable?

Possibly will be used to control conditional compilation at some point. Really it should probably nerf the test entirely outside of GPU APIs it can make contexts for. But it does get a little coverage on e.g. Metal and Direct3d so...

…ust. (#5474)" This reverts commit f47c5c9.

…ust. (#5474)" (#5515) This reverts commit f47c5c9.

… and robust. (#5474)" (#5515)" This reverts commit 2ddd0b0.

Z Stern added 8 commits November 18, 2020 16:56

Initial commit of common code for caching GPU program

18461ae

(shader/kernel/etc.) compilations.

Merge branch 'master' into gpu_context_consistency

b5540e6

New GPU compilation cache.

05cf15e

Add cmake support.

48cf2be

Add initial commits explaining what tests do.

Address clang format lint issues.

f9ba360

Merge branch 'master' into gpu_context_consistency

f7c7a65

More formatting fixes.

1b63357

shoaibkamil reviewed Nov 24, 2020

View reviewed changes

src/runtime/metal.cpp Outdated Show resolved Hide resolved

Z Stern added 3 commits November 24, 2020 09:31

Fix spelling error.

a406376

Clang format fix.

d5e60af

Rename gpu_multi_generator.cpp to match the name of hte test that uses

61cabe3

it to stick closer to naming pattern and work with CMake rules code.

steven-johnson mentioned this pull request Nov 24, 2020

Future of OpenGL/OpenGLCompute needs deciding #5475

Closed

steven-johnson reviewed Nov 24, 2020

View reviewed changes

Makefile Outdated Show resolved Hide resolved

src/runtime/d3d12compute.cpp Outdated Show resolved Hide resolved

src/runtime/gpu_context_common.h Show resolved Hide resolved

test/correctness/gpu_many_kernels.cpp Outdated Show resolved Hide resolved

Z Stern added 4 commits November 25, 2020 14:37

Was using the arong kind of context value. Worked fine, but was

ffdcb35

passing more than it needed to.

Merge branch 'master' into gpu_context_consistency

895e680

Merge branch 'master' into gpu_context_consistency

a57bf25

Address review comments, CMake mistake.

4948453

steven-johnson reviewed Nov 30, 2020

View reviewed changes

test/correctness/CMakeLists.txt Outdated Show resolved Hide resolved

steven-johnson reviewed Nov 30, 2020

View reviewed changes

src/runtime/gpu_context_common.h Outdated Show resolved Hide resolved

steven-johnson reviewed Nov 30, 2020

View reviewed changes

src/runtime/gpu_context_common.h Outdated Show resolved Hide resolved

Z Stern added 2 commits November 30, 2020 17:42

Typo fix.

db39f4f

Address review feedback. Fix cmake semantic comment.

7cfb51e

steven-johnson approved these changes Dec 2, 2020

View reviewed changes

Makefile Show resolved Hide resolved

src/runtime/cuda.cpp Outdated Show resolved Hide resolved

src/runtime/gpu_context_common.h Outdated Show resolved Hide resolved

Z Stern added 5 commits December 2, 2020 12:02

Merge branch 'master' into gpu_context_consistency

82afaf0

Address review feedback.

0a57c4b

Revert "Address review feedback."

0ccd159

This reverts commit 0a57c4b.

Revert clang-format damage. Address review feedback.

9c1a4b8

Make test work without GPU support.

b2d0bca

zvookin merged commit f47c5c9 into master Dec 2, 2020

zvookin deleted the gpu_context_consistency branch December 2, 2020 22:40

steven-johnson added a commit that referenced this pull request Dec 3, 2020

Revert "Make context handling in GPU runtimes more consistent and rob…

e74663c

…ust. (#5474)" This reverts commit f47c5c9.

steven-johnson mentioned this pull request Dec 3, 2020

Revert "Make context handling in GPU runtimes more consistent and robust." #5515

Merged

steven-johnson added a commit that referenced this pull request Dec 3, 2020

Revert "Make context handling in GPU runtimes more consistent and rob…

2ddd0b0

…ust. (#5474)" (#5515) This reverts commit f47c5c9.

zvookin pushed a commit that referenced this pull request Dec 10, 2020

Revert "Revert "Make context handling in GPU runtimes more consistent…

d6f6053

… and robust. (#5474)" (#5515)" This reverts commit 2ddd0b0.

zvookin mentioned this pull request Dec 11, 2020

Make GPU kernel compilation caching consistent across GPU backends. #5546

Merged

alexreinking added this to the v11.0.0 milestone Dec 11, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Make context handling in GPU runtimes more consistent and robust. #5474

Make context handling in GPU runtimes more consistent and robust. #5474

Uh oh!

zvookin commented Nov 24, 2020 •

edited

Loading

Uh oh!

Uh oh!

steven-johnson commented Nov 24, 2020

Uh oh!

steven-johnson left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

steven-johnson left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

zvookin commented Dec 2, 2020

Uh oh!

Uh oh!

Make context handling in GPU runtimes more consistent and robust. #5474

Make context handling in GPU runtimes more consistent and robust. #5474

Uh oh!

Conversation

zvookin commented Nov 24, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

steven-johnson commented Nov 24, 2020

Uh oh!

steven-johnson left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

steven-johnson left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

zvookin commented Dec 2, 2020

Uh oh!

Uh oh!

zvookin commented Nov 24, 2020 •

edited

Loading