Reduce cuda api overhead for kernel launch #774

RSchwan · 2025-06-04T22:23:19Z

Description

This PR short circuits an unnecessary cuda api call when launching a kernel. This can improve performance in the case the kernel run time is tiny.

This is the profile before:

and after:

Before your PR is "Ready for review"

All commits are signed-off to indicate that your contribution adheres to the Developer Certificate of Origin requirements
Necessary tests have been added
Documentation is up-to-date
Auto-generated files modified by compiling Warp and building the documentation have been updated (e.g. stubs.py, functions.rst)
Code passes formatting and linting checks with pre-commit run -a

Signed-off-by: Roland Schwan <roland.schwan@mikrounix.com>

christophercrouzet · 2025-06-05T19:59:47Z

Thanks @RSchwan, this looks good to me.

shi-eric · 2025-06-05T20:19:39Z

I mentioned this pull request to @nvlukasz over dinner and he pointed out that this won't work because we need to care about the cases in which another framework initiates the graph capture. Apparently he's considered this topic for quite some time.

christophercrouzet · 2025-06-05T21:01:04Z

That's exactly why I added @nvlukasz as a reviewer 😁 Thanks!

RSchwan · 2025-06-05T22:27:27Z

But is it actually a problem in this particular case? In case runtime.captures has no elements, the calls to runtime.core.cuda_stream_is_capturing(stream.cuda_stream) and runtime.core.cuda_stream_get_capture_id(stream.cuda_stream) are not used since runtime.captures.get(capture_id) will always return None. Hence, I'm just short-circuiting useless API calls (in a logical sense).

shi-eric · 2025-06-06T00:05:12Z

But is it actually a problem in this particular case? In case runtime.captures has no elements, the calls to runtime.core.cuda_stream_is_capturing(stream.cuda_stream) and runtime.core.cuda_stream_get_capture_id(stream.cuda_stream) are not used since runtime.captures.get(capture_id) will always return None. Hence, I'm just short-circuiting useless API calls (in a logical sense).

Good point! What you said makes sense to me, I'll ask @nvlukasz to take another look to confirm.

shi-eric · 2025-06-06T16:30:05Z

Sorry for the confusion, everything's good and merged now. Thanks again @RSchwan!

reduce cuda api overhead for launch

b7fb17f

Signed-off-by: Roland Schwan <roland.schwan@mikrounix.com>

RSchwan changed the title ~~reduce cuda api overhead for launch~~ Reduce cuda api overhead for kernel launch Jun 4, 2025

shi-eric requested a review from christophercrouzet June 5, 2025 00:37

christophercrouzet requested a review from nvlukasz June 5, 2025 19:56

shi-eric merged commit 21e5a0b into NVIDIA:main Jun 6, 2025
13 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Reduce cuda api overhead for kernel launch #774

Reduce cuda api overhead for kernel launch #774

Uh oh!

RSchwan commented Jun 4, 2025

Uh oh!

christophercrouzet commented Jun 5, 2025

Uh oh!

shi-eric commented Jun 5, 2025

Uh oh!

christophercrouzet commented Jun 5, 2025

Uh oh!

RSchwan commented Jun 5, 2025

Uh oh!

shi-eric commented Jun 6, 2025

Uh oh!

Uh oh!

shi-eric commented Jun 6, 2025 •

edited

Loading

Uh oh!

Uh oh!

Reduce cuda api overhead for kernel launch #774

Reduce cuda api overhead for kernel launch #774

Uh oh!

Conversation

RSchwan commented Jun 4, 2025

Description

Before your PR is "Ready for review"

Uh oh!

christophercrouzet commented Jun 5, 2025

Uh oh!

shi-eric commented Jun 5, 2025

Uh oh!

christophercrouzet commented Jun 5, 2025

Uh oh!

RSchwan commented Jun 5, 2025

Uh oh!

shi-eric commented Jun 6, 2025

Uh oh!

Uh oh!

shi-eric commented Jun 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

shi-eric commented Jun 6, 2025 •

edited

Loading