gpu: add coherent qualifier #6037

xzfc · 2025-02-21T12:42:52Z

This fixes #5834 for NVIDIA cards.

How to reproduce

The steps from #5834 works, but the following command catches the issue faster:

bfb -n 2000000 -d 4 --max-segment-size 200000 --indexing-threshold 1 --hnsw-m 4 --hnsw-ef-construct 8

Qdrant output:

WARN segment::index::hnsw_index::gpu::gpu_links: Incorrect links on level 2 were found. Amount of incorrect links: 1, zeroes: 1    
WARN segment::index::hnsw_index::gpu::gpu_links: Incorrect links on level 3 were found. Amount of incorrect links: 1, zeroes: 1    
WARN segment::index::hnsw_index::gpu::gpu_links: Incorrect links on level 3 were found. Amount of incorrect links: 1, zeroes: 1    
WARN segment::index::hnsw_index::gpu::gpu_links: Incorrect links on level 1 were found. Amount of incorrect links: 1, zeroes: 1

The issue

Code in question:

qdrant/lib/segment/src/index/hnsw_index/gpu/shaders/run_insert_vector.comp

Lines 68 to 123 in b50a702

    
           for (uint i = 0; i < count; i++) { 
        
               uint other_id = GET_LINK(request.id, i); 
        
               bool is_locked = false; 
        
               if (gl_SubgroupInvocationID == 0) { 
        
                   uint other_atomic = atomicExchange(atomics.data[other_id], 1); 
        
                   is_locked = other_atomic == 1; 
        
               } 
        
               if (subgroupAny(is_locked)) { 
        
                   // point is already being processed by another subgroup 
        
                   continue; 
        
               } 
        
               uint other_links_count = LINKS_COUNT(other_id); 
        
               if (other_links_count < LEVEL_M) { 
        
                   if (subgroupElect()) { 
        
                       GET_LINK(other_id, other_links_count) = request.id; 
        
                       LINKS_SET_SIZE(other_id, other_links_count + 1); 
        
                   } 
        
               } else { 
        
                   set_target(other_id); 
        
                   nearest_count = 0; 
        
                   for (uint j = 0; j < other_links_count; j++) { 
        
                       POINT_ID link = GET_LINK(other_id, j); 
        
                       shared_buffer[NEAREST_HEAP_OFFSET + nearest_count] = ScoredPoint( 
        
                           link, 
        
                           similarity(link) 
        
                       ); 
        
                       nearest_count++; 
        
                   } 
        
                   shared_buffer[NEAREST_HEAP_OFFSET + nearest_count] = ScoredPoint( 
        
                       request.id, 
        
                       similarity(request.id) 
        
                   ); 
        
                   nearest_count++; 
        
                   subgroupMemoryBarrierShared(); 
        
                   sort(NEAREST_HEAP_OFFSET, nearest_count); 
        
                   subgroupMemoryBarrierShared(); 
        
                   uint other_new_links_count = run_heuristic(); 
        
                   subgroupMemoryBarrierShared(); 
        
                   for (uint j = gl_SubgroupInvocationID; j < other_new_links_count; j += SUBGROUP_SIZE) { 
        
                       GET_LINK(other_id, j) = shared_buffer[NEAREST_HEAP_OFFSET + j].id; 
        
                   } 
        
                   if (subgroupElect()) { 
        
                       LINKS_SET_SIZE(other_id, other_new_links_count); 
        
                   } 
        
               } 
        
               subgroupMemoryBarrier(); 
        
               if (gl_SubgroupInvocationID == 0) { 
        
                   atomicExchange(atomics.data[other_id], 0); 
        
               }

In a rough pseudo-code the algorithm is the following:

# 1. Lock
if mutexes[other_id].try_lock() == already_locked:
    continue

# 2. Read links count
count = links[other_id].count
for i in 0..count:
    shared_buffer[i] = links[other_id].links[i]

# 3. Update links
new_count = update(shared_buffer)

# 4. Write link values
for i in 0..new_count:
    links[other_id].links[i] = shared_buffer[i]

# 5. Write links count
links[other_id].count = new_count

# 6. Unlock
mutexes[other_id].unlock()

It appears to be correct: it uses proper locking (steps 1 and 6); it writes the count (step 5) only after writing the values (step 4). However, other threads could see the order differently. In the step 2, it could be possible for them to read the new updated count, but old links (which could contain zero-initialized values). What happens here is an "incoherent memory accesses". To make it coherent, we just need add a coherent GLSL qualifier.

And actually, this issue happen far more frequently than reported; but in most cases these zero values are filtered out in the step 3.

IvanPleshkov

Great finding! I also propose to add this qualifier into all shared buffers which are writable.
There are:

layout(set = VISITED_FLAGS_LAYOUT_SET, binding = 1)
buffer VisitedFlagsBuffer {
    uint8_t data[];
} visited_flags;

layout(set = 0, binding = 1) buffer NewEntries {
    uint data[];
} new_entries;

layout(set = 0, binding = 2) buffer Atomics {
    uint data[];
} atomics;

layout(set = 0, binding = 1) buffer SearchResults {
    uint data[];
} search_results;

timvisee

Nice!

Approved based on my understanding from the description, and on what you explained before during one of our dailies.

xzfc · 2025-02-22T02:18:42Z

I also propose to add this qualifier into all shared buffers which are writable.

@IvanPleshkov Update: I've added an explicit qualifier (coherent/readonly/writeonly) to every buffer.

xzfc requested review from timvisee, generall and IvanPleshkov February 21, 2025 12:42

github-actions bot mentioned this pull request Feb 21, 2025

Flaky test tests::snapshot_test::test_snapshot_collection_listener #5218

Open

IvanPleshkov approved these changes Feb 21, 2025

View reviewed changes

timvisee approved these changes Feb 21, 2025

View reviewed changes

gpu: add coherent/readonly/writeonly qualifiers

9104b7f

xzfc force-pushed the gpu-coherent branch from bdf942b to 9104b7f Compare February 22, 2025 02:11

xzfc requested a review from IvanPleshkov February 22, 2025 02:18

github-actions bot mentioned this pull request Feb 22, 2025

Flaky test hnsw_discover_test::hnsw_discover_precision #2973

Open

IvanPleshkov approved these changes Feb 24, 2025

View reviewed changes

xzfc merged commit 2cc7a4d into dev Feb 24, 2025
20 checks passed

xzfc deleted the gpu-coherent branch February 24, 2025 10:11

timvisee pushed a commit that referenced this pull request Mar 21, 2025

gpu: add coherent/readonly/writeonly qualifiers (#6037)

af664aa

timvisee mentioned this pull request Mar 21, 2025

Bump version to 1.13.5 #6223

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

gpu: add coherent qualifier #6037

gpu: add coherent qualifier #6037

Uh oh!

xzfc commented Feb 21, 2025

Uh oh!

IvanPleshkov left a comment

Uh oh!

timvisee left a comment

Uh oh!

xzfc commented Feb 22, 2025

Uh oh!

Uh oh!

Uh oh!

	for (uint i = 0; i < count; i++) {
	uint other_id = GET_LINK(request.id, i);

	bool is_locked = false;
	if (gl_SubgroupInvocationID == 0) {
	uint other_atomic = atomicExchange(atomics.data[other_id], 1);
	is_locked = other_atomic == 1;
	}
	if (subgroupAny(is_locked)) {
	// point is already being processed by another subgroup
	continue;
	}

	uint other_links_count = LINKS_COUNT(other_id);
	if (other_links_count < LEVEL_M) {
	if (subgroupElect()) {
	GET_LINK(other_id, other_links_count) = request.id;
	LINKS_SET_SIZE(other_id, other_links_count + 1);
	}
	} else {
	set_target(other_id);
	nearest_count = 0;
	for (uint j = 0; j < other_links_count; j++) {
	POINT_ID link = GET_LINK(other_id, j);
	shared_buffer[NEAREST_HEAP_OFFSET + nearest_count] = ScoredPoint(
	link,
	similarity(link)
	);
	nearest_count++;
	}
	shared_buffer[NEAREST_HEAP_OFFSET + nearest_count] = ScoredPoint(
	request.id,
	similarity(request.id)
	);
	nearest_count++;

	subgroupMemoryBarrierShared();
	sort(NEAREST_HEAP_OFFSET, nearest_count);

	subgroupMemoryBarrierShared();
	uint other_new_links_count = run_heuristic();

	subgroupMemoryBarrierShared();
	for (uint j = gl_SubgroupInvocationID; j < other_new_links_count; j += SUBGROUP_SIZE) {
	GET_LINK(other_id, j) = shared_buffer[NEAREST_HEAP_OFFSET + j].id;
	}
	if (subgroupElect()) {
	LINKS_SET_SIZE(other_id, other_new_links_count);
	}
	}

	subgroupMemoryBarrier();

	if (gl_SubgroupInvocationID == 0) {
	atomicExchange(atomics.data[other_id], 0);
	}

gpu: add coherent qualifier #6037

gpu: add coherent qualifier #6037

Uh oh!

Conversation

xzfc commented Feb 21, 2025

How to reproduce

The issue

Uh oh!

IvanPleshkov left a comment

Choose a reason for hiding this comment

Uh oh!

timvisee left a comment

Choose a reason for hiding this comment

Uh oh!

xzfc commented Feb 22, 2025

Uh oh!

Uh oh!

Uh oh!