Skip to content

Conversation

xzfc
Copy link
Contributor

@xzfc xzfc commented Feb 21, 2025

This fixes #5834 for NVIDIA cards.

How to reproduce

The steps from #5834 works, but the following command catches the issue faster:

bfb -n 2000000 -d 4 --max-segment-size 200000 --indexing-threshold 1 --hnsw-m 4 --hnsw-ef-construct 8

Qdrant output:

WARN segment::index::hnsw_index::gpu::gpu_links: Incorrect links on level 2 were found. Amount of incorrect links: 1, zeroes: 1    
WARN segment::index::hnsw_index::gpu::gpu_links: Incorrect links on level 3 were found. Amount of incorrect links: 1, zeroes: 1    
WARN segment::index::hnsw_index::gpu::gpu_links: Incorrect links on level 3 were found. Amount of incorrect links: 1, zeroes: 1    
WARN segment::index::hnsw_index::gpu::gpu_links: Incorrect links on level 1 were found. Amount of incorrect links: 1, zeroes: 1  

The issue

Code in question:

for (uint i = 0; i < count; i++) {
uint other_id = GET_LINK(request.id, i);
bool is_locked = false;
if (gl_SubgroupInvocationID == 0) {
uint other_atomic = atomicExchange(atomics.data[other_id], 1);
is_locked = other_atomic == 1;
}
if (subgroupAny(is_locked)) {
// point is already being processed by another subgroup
continue;
}
uint other_links_count = LINKS_COUNT(other_id);
if (other_links_count < LEVEL_M) {
if (subgroupElect()) {
GET_LINK(other_id, other_links_count) = request.id;
LINKS_SET_SIZE(other_id, other_links_count + 1);
}
} else {
set_target(other_id);
nearest_count = 0;
for (uint j = 0; j < other_links_count; j++) {
POINT_ID link = GET_LINK(other_id, j);
shared_buffer[NEAREST_HEAP_OFFSET + nearest_count] = ScoredPoint(
link,
similarity(link)
);
nearest_count++;
}
shared_buffer[NEAREST_HEAP_OFFSET + nearest_count] = ScoredPoint(
request.id,
similarity(request.id)
);
nearest_count++;
subgroupMemoryBarrierShared();
sort(NEAREST_HEAP_OFFSET, nearest_count);
subgroupMemoryBarrierShared();
uint other_new_links_count = run_heuristic();
subgroupMemoryBarrierShared();
for (uint j = gl_SubgroupInvocationID; j < other_new_links_count; j += SUBGROUP_SIZE) {
GET_LINK(other_id, j) = shared_buffer[NEAREST_HEAP_OFFSET + j].id;
}
if (subgroupElect()) {
LINKS_SET_SIZE(other_id, other_new_links_count);
}
}
subgroupMemoryBarrier();
if (gl_SubgroupInvocationID == 0) {
atomicExchange(atomics.data[other_id], 0);
}

In a rough pseudo-code the algorithm is the following:

# 1. Lock
if mutexes[other_id].try_lock() == already_locked:
    continue

# 2. Read links count
count = links[other_id].count
for i in 0..count:
    shared_buffer[i] = links[other_id].links[i]

# 3. Update links
new_count = update(shared_buffer)

# 4. Write link values
for i in 0..new_count:
    links[other_id].links[i] = shared_buffer[i]

# 5. Write links count
links[other_id].count = new_count

# 6. Unlock
mutexes[other_id].unlock()

It appears to be correct: it uses proper locking (steps 1 and 6); it writes the count (step 5) only after writing the values (step 4). However, other threads could see the order differently. In the step 2, it could be possible for them to read the new updated count, but old links (which could contain zero-initialized values). What happens here is an "incoherent memory accesses". To make it coherent, we just need add a coherent GLSL qualifier.

And actually, this issue happen far more frequently than reported; but in most cases these zero values are filtered out in the step 3.

Copy link
Contributor

@IvanPleshkov IvanPleshkov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great finding! I also propose to add this qualifier into all shared buffers which are writable.
There are:

layout(set = VISITED_FLAGS_LAYOUT_SET, binding = 1)
buffer VisitedFlagsBuffer {
    uint8_t data[];
} visited_flags;

layout(set = 0, binding = 1) buffer NewEntries {
    uint data[];
} new_entries;

layout(set = 0, binding = 2) buffer Atomics {
    uint data[];
} atomics;

layout(set = 0, binding = 1) buffer SearchResults {
    uint data[];
} search_results;

Copy link
Member

@timvisee timvisee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

Approved based on my understanding from the description, and on what you explained before during one of our dailies.

@xzfc
Copy link
Contributor Author

xzfc commented Feb 22, 2025

I also propose to add this qualifier into all shared buffers which are writable.

@IvanPleshkov Update: I've added an explicit qualifier (coherent/readonly/writeonly) to every buffer.

@xzfc xzfc merged commit 2cc7a4d into dev Feb 24, 2025
20 checks passed
@xzfc xzfc deleted the gpu-coherent branch February 24, 2025 10:11
@timvisee timvisee mentioned this pull request Mar 21, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants