Skip to content

Conversation

RSchwan
Copy link
Contributor

@RSchwan RSchwan commented Jun 4, 2025

Description

This PR short circuits an unnecessary cuda api call when launching a kernel. This can improve performance in the case the kernel run time is tiny.

This is the profile before:
image
and after:
image

Before your PR is "Ready for review"

  • All commits are signed-off to indicate that your contribution adheres to the Developer Certificate of Origin requirements
  • Necessary tests have been added
  • Documentation is up-to-date
  • Auto-generated files modified by compiling Warp and building the documentation have been updated (e.g. stubs.py, functions.rst)
  • Code passes formatting and linting checks with pre-commit run -a

Signed-off-by: Roland Schwan <roland.schwan@mikrounix.com>
@RSchwan RSchwan changed the title reduce cuda api overhead for launch Reduce cuda api overhead for kernel launch Jun 4, 2025
@christophercrouzet
Copy link
Member

Thanks @RSchwan, this looks good to me.

@shi-eric
Copy link
Contributor

shi-eric commented Jun 5, 2025

I mentioned this pull request to @nvlukasz over dinner and he pointed out that this won't work because we need to care about the cases in which another framework initiates the graph capture. Apparently he's considered this topic for quite some time.

@christophercrouzet
Copy link
Member

That's exactly why I added @nvlukasz as a reviewer 😁 Thanks!

@RSchwan
Copy link
Contributor Author

RSchwan commented Jun 5, 2025

But is it actually a problem in this particular case? In case runtime.captures has no elements, the calls to runtime.core.cuda_stream_is_capturing(stream.cuda_stream) and runtime.core.cuda_stream_get_capture_id(stream.cuda_stream) are not used since runtime.captures.get(capture_id) will always return None. Hence, I'm just short-circuiting useless API calls (in a logical sense).

@shi-eric
Copy link
Contributor

shi-eric commented Jun 6, 2025

But is it actually a problem in this particular case? In case runtime.captures has no elements, the calls to runtime.core.cuda_stream_is_capturing(stream.cuda_stream) and runtime.core.cuda_stream_get_capture_id(stream.cuda_stream) are not used since runtime.captures.get(capture_id) will always return None. Hence, I'm just short-circuiting useless API calls (in a logical sense).

Good point! What you said makes sense to me, I'll ask @nvlukasz to take another look to confirm.

@shi-eric shi-eric merged commit 21e5a0b into NVIDIA:main Jun 6, 2025
13 checks passed
@shi-eric
Copy link
Contributor

shi-eric commented Jun 6, 2025

Sorry for the confusion, everything's good and merged now. Thanks again @RSchwan!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants