[BUG] Grid-stride kernels `return` instead of `continue`

### Bug Description

@adenzler-nvidia discovered that Warp kernels containing an early `return` sometimes won't work correctly inside grid-stride loops (e.g. when running kernels on GPU devices). When a GPU thread is processing multiple Warp threads, a  `return` will prevent the GPU thread from processing the intended number of elements.

```python
import warp as wp

wp.init()

wp.clear_kernel_cache()


@wp.kernel
def conditional_return_or_sum(result: wp.array(dtype=wp.int32)):
    tid = wp.tid()

    if tid < 256:
        return

    wp.atomic_add(result, 0, 1)


with wp.ScopedDevice("cuda:0"):
    result = wp.zeros(1, dtype=wp.int32)

    grid_size = 1024

    # On CUDA devices, this becomes a grid-stride loop
    wp.launch(conditional_return_or_sum, dim=grid_size, inputs=[result], block_dim=256, max_blocks=1)

    print(f"result: {result.numpy()[0]}, expected: {grid_size - 256}")

    wp.synchronize()
```

The above code prints out `result: 0, expected: 768`. Note that we limited the number of blocks to 1 to force each GPU thread to process multiple elements (in this case each GPU thread processes four work items).


### System Information

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[BUG] Grid-stride kernels `return` instead of `continue` #594

Bug Description

System Information

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[BUG] Grid-stride kernels return instead of continue #594

Description

Bug Description

System Information

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

[BUG] Grid-stride kernels `return` instead of `continue` #594