Skip to content

[BUG] Grid-stride kernels return instead of continue #594

@shi-eric

Description

@shi-eric

Bug Description

@adenzler-nvidia discovered that Warp kernels containing an early return sometimes won't work correctly inside grid-stride loops (e.g. when running kernels on GPU devices). When a GPU thread is processing multiple Warp threads, a return will prevent the GPU thread from processing the intended number of elements.

import warp as wp

wp.init()

wp.clear_kernel_cache()


@wp.kernel
def conditional_return_or_sum(result: wp.array(dtype=wp.int32)):
    tid = wp.tid()

    if tid < 256:
        return

    wp.atomic_add(result, 0, 1)


with wp.ScopedDevice("cuda:0"):
    result = wp.zeros(1, dtype=wp.int32)

    grid_size = 1024

    # On CUDA devices, this becomes a grid-stride loop
    wp.launch(conditional_return_or_sum, dim=grid_size, inputs=[result], block_dim=256, max_blocks=1)

    print(f"result: {result.numpy()[0]}, expected: {grid_size - 256}")

    wp.synchronize()

The above code prints out result: 0, expected: 768. Note that we limited the number of blocks to 1 to force each GPU thread to process multiple elements (in this case each GPU thread processes four work items).

System Information

No response

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions