vulkan_radix_sort

Vulkan implementation of radix sort.

Reduce-then-scan GPU radix sort algorithm is implemented (Onesweep is abandoned.)

Recent Changes

v0.1.0
- Use VK_KHR_push_descriptor instead of VK_KHR_buffer_device_address
  - Not to use vulkan-specific language in shader codes.
  - Buffers no longer need to have VK_BUFFER_USAGE_SHADER_DEVICE_ADDRESS_BIT when created.
  - Buffers offsets must be multiple of minStorageBufferOffsetAlignment (16 in most cases).
  - Users of previous version need to update codes for device creation accordingly.
- Migrate from GLSL to Slang shader language
  - For extensibility to other graphics APIs.
v0.0.0
- First publish.

Requirements

VulkanSDK>=1.3.296.0
- Download from https://vulkan.lunarg.com/ and follow install instruction.
- slangc executable is included in VulkanSDK>=1.3.296.0.
- Requires several features available in 1.1.
- Must support VK_KHR_push_descriptor:
  - Run vulkaninfo and check if VK_KHR_push_descriptor device extension is available.
cmake>=3.15

Build

$ cmake . -B build
$ cmake --build build --config Release -j

Test

$ ./build/Release/bench.exe <N> <type>  # Windows
$ ./build/bench <N> <type>              # Linux
$ ./build/bench 10000000 vulkan

N = number of elements to sort
type = one of cpu,vulkan,cuda

Test Environment

Windows, NVIDIA GeForce RTX 4090.

Benchmark Result

Not precisely benchmarked, but the speed is competitive compared to CUB Reduce-then-Scan radix sort.
- I could infer CUB runs from DeviceRadixSortUpsweepKernel, RadixSortScanBinsKernel, DeviceRadixSortDownsweepKernel
- There seems a way to enable CUB Onesweep in the CUB benchmark code, I will study it some time!
32-bit key-only: my implementation is 10% slower when sorting 33M (2^25) elements.
32-bit Key-value: my implementation is 15-25% faster when sorting 33M (2^25) key-value pairs.

vulkan

> .\build\Release\bench.exe 33554432 vulkan
vk_radix_sort benchmark
================ sort ================
total time: 2.67571ms (12.5404 GItems/s)
================ sort key value ================
total time: 3.42221ms (9.80491 GItems/s)
================ sort key value speed ================
[0] total time: 3.41706ms (9.81969 GItems/s)
[1] total time: 3.43142ms (9.77857 GItems/s)
[2] total time: 3.42298ms (9.80271 GItems/s)
[3] total time: 3.46208ms (9.69199 GItems/s)
[4] total time: 3.42426ms (9.79904 GItems/s)
[5] total time: 3.43725ms (9.762 GItems/s)
[6] total time: 3.42016ms (9.81078 GItems/s)
[7] total time: 3.42016ms (9.81078 GItems/s)
[8] total time: 3.42099ms (9.80839 GItems/s)
[9] total time: 3.41606ms (9.82254 GItems/s)
...

CUDA Version 12.6 CUB Reduce-then-Scan

> .\build\Release\bench.exe 33554432 cuda
vk_radix_sort benchmark
================ sort ================
total time: 2.5047ms (13.3966 GItems/s)
================ sort key value ================
total time: 4.19226ms (8.00391 GItems/s)
================ sort key value speed ================
[0] total time: 4.20352ms (7.98246 GItems/s)
[1] total time: 4.50355ms (7.45066 GItems/s)
[2] total time: 4.21376ms (7.96306 GItems/s)
[3] total time: 4.22298ms (7.94568 GItems/s)
[4] total time: 4.22208ms (7.94737 GItems/s)
[5] total time: 4.2199ms (7.95147 GItems/s)
[6] total time: 4.21274ms (7.965 GItems/s)
[7] total time: 4.20352ms (7.98246 GItems/s)
[8] total time: 4.21376ms (7.96306 GItems/s)
[9] total time: 4.21478ms (7.96113 GItems/s)
...

Use as a Library with CMake

Add subdirectory vulkan_radix_sort

add_subdirectory(path/to/vulkan_radix_sort)

Link to vk_radix_sort in your project (library, binary)

target_link_libraries(my_project PRIVATE Vulkan::Vulkan VulkanMemoryAllocator vk_radix_sort)

Usage

When creating VkDevice, add VK_KHR_PUSH_DESCRIPTOR_EXTENSION_NAME (="VK_KHR_push_descriptor").
Create VkBuffer for keys and values, with VK_BUFFER_USAGE_STORAGE_BUFFER_BIT.

Create VrdxSorter

It creates shared resources: pipeline layouts, pipelines, etc.

VrdxSorter sorter = VK_NULL_HANDLE;
VrdxSorterCreateInfo sorterInfo = {};
sorterInfo.physicalDevice = physicalDevice;
sorterInfo.device = device;
sorterInfo.pipelineCache = pipelineCache;
vrdxCreateSorter(&sorterInfo, &sorter);

Create a temporary storage buffer for sort.

// request storage buffer request
VrdxSorterStorageRequirements requirements;
// for key-only
vrdxGetSorterStorageRequirements(sorter, elementCount, &requirements);
// for key-value
vrdxGetSorterKeyValueStorageRequirements(sorter, elementCount, &requirements);

// create or reuse buffer
VkBufferCreateInfo bufferInfo = {VK_STRUCTURE_TYPE_BUFFER_CREATE_INFO};
bufferInfo.size = requirements.size;
bufferInfo.usage = requirements.usage;
// ...

Record sort commands.

Requirements: buffer offsets must be multiple of minStorageBufferOffsetAlignment (usually 16.)

This command binds pipeline, pipeline layout, and push constants internally.

So, users must not expect previously bound targets retain after the sort command.

Users must add proper execution barriers.

One can use buffer memory barrier, but in general, global barriers are more efficient than per-resource, according to official synchronization examples:

... global memory barrier covers all resources. Generally considered more efficient to do a global memory barrier than per-resource barriers, per-resource barriers should usually be used for queue ownership transfers and image layout transitions - otherwise use global barriers.

The sort command will read from key/value buffers (and elementCount buffer for indirect sort) in compute shader stage, and write to output key/value buffers in later compute shader stage.

The second synchronization scope before sort command must include COMPUTE_SHADER stage (and TRANSFER for indirect sort) and SHADER_READ access (and TRANSFER_READ for indirect sort).

The first synchronization scope after sort command must include COMPUTE_SHADER stage and SHADER_WRITE access.
```
VkQueryPool queryPool;  // VK_NULL_HANDLE, or a valid timestamp query pool with size at least 15.

// sort keys
vrdxCmdSort(commandBuffer, sorter, elementCount,
            keysBuffer, 0,
            storageBuffer, 0,
            queryPool, 0);

// sort keys with values
vrdxCmdSortKeyValue(commandBuffer, sorter, elementCount,
                    keysBuffer, 0,
                    valuesBuffer, 0,
                    storageBuffer, 0,
                    queryPool, 0);

// indirectBuffer contains elementCount, a single uint entry in GPU buffer.
// maxElementCount is required for storage buffer offsets.
// element count in the indirect buffer must not be greater than maxElementCount. Otherwise, undefined behavior.
vrdxCmdSortKeyValueIndirect(commandBuffer, sorter, maxElementCount,
                            indirectBuffer, 0,
                            keysBuffer, 0,
                            valuesBuffer, 0,
                            storageBuffer, 0,
                            queryPool, 0);
```

TODO

Use VkPhysicalDeviceLimits to get compute shader-related limits, such as maxComputeWorkGroupSize or maxComputeSharedMemorySize.
Increase allowed maxElementCount by allocating buffers properly.
Compare with CUB Reduce-then-Scan radix sort
Compare with CUB Onesweep radix sort
Compare with VkRadixSort
Compare with Fuchsia radix sort
Find best WORKGROUP_SIZE and PARTITION_DIVISION for different devices.
Support for SubgroupSize=64.

References

https://github.com/b0nes164/GPUSorting : their CUDA kernel codes were very helpful when trying to catch the idea of how the algorithm works.

Troubleshooting

(NVIDIA GPU, Windows) Slow runtime after a few seconds.
- Reason: NVidia driver adjust GPU/Memory clock. Open Performance Overlay (Alt+R), then you will see GPU/Memory Clock gets down.
- Solution: change performance mode in control panel.

Name		Name	Last commit message	Last commit date
Latest commit History 88 Commits
bench		bench
include		include
media		media
src		src
third_party		third_party
tools		tools
.clang-format		.clang-format
.gitignore		.gitignore
.gitmodules		.gitmodules
CMakeLists.txt		CMakeLists.txt
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

vulkan_radix_sort

Recent Changes

Requirements

Build

Test

Test Environment

Benchmark Result

Use as a Library with CMake

Usage

TODO

References

Troubleshooting

About

Uh oh!

Releases

Packages

Languages

License

jaesung-cs/vulkan_radix_sort

Folders and files

Latest commit

History

Repository files navigation

vulkan_radix_sort

Recent Changes

Requirements

Build

Test

Test Environment

Benchmark Result

Use as a Library with CMake

Usage

TODO

References

Troubleshooting

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages