-
Notifications
You must be signed in to change notification settings - Fork 349
Closed
Description
Bug Description
@adenzler-nvidia discovered that Warp kernels containing an early return
sometimes won't work correctly inside grid-stride loops (e.g. when running kernels on GPU devices). When a GPU thread is processing multiple Warp threads, a return
will prevent the GPU thread from processing the intended number of elements.
import warp as wp
wp.init()
wp.clear_kernel_cache()
@wp.kernel
def conditional_return_or_sum(result: wp.array(dtype=wp.int32)):
tid = wp.tid()
if tid < 256:
return
wp.atomic_add(result, 0, 1)
with wp.ScopedDevice("cuda:0"):
result = wp.zeros(1, dtype=wp.int32)
grid_size = 1024
# On CUDA devices, this becomes a grid-stride loop
wp.launch(conditional_return_or_sum, dim=grid_size, inputs=[result], block_dim=256, max_blocks=1)
print(f"result: {result.numpy()[0]}, expected: {grid_size - 256}")
wp.synchronize()
The above code prints out result: 0, expected: 768
. Note that we limited the number of blocks to 1 to force each GPU thread to process multiple elements (in this case each GPU thread processes four work items).
System Information
No response
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't working