-
Notifications
You must be signed in to change notification settings - Fork 1.8k
Don't reallocate gpu resources between payload blocks #6462
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Don't reallocate gpu resources between payload blocks #6462
Conversation
📝 WalkthroughWalkthroughThis set of changes refactors GPU context management in the HNSW index implementation by introducing a new Suggested reviewers
📜 Recent review detailsConfiguration used: CodeRabbit UI 📒 Files selected for processing (2)
⏰ Context from checks skipped due to timeout of 90000ms (14)
🔇 Additional comments (20)
✨ Finishing Touches
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 4
🔭 Outside diff range comments (1)
lib/segment/src/index/hnsw_index/gpu/gpu_insert_context.rs (1)
98-106
:⚠️ Potential issuePotential buffer-overflow when more points than
points_capacity
are passed toinit
insert_atomics_buffer
and all visited-flags buffers are sized with:let points_count = gpu_vector_storage.num_vectors(); // ≈ points_capacityLater,
init(remap)
blindly uploadsremap.len()
elements.
If the caller adds extra virtual point IDs that are not yet present in the vector storage (e.g. during batched import)remap.len()
can exceedpoints_capacity
, causing out-of-bounds writes in:GpuVisitedFlags::init → context.copy_gpu_buffer(...)Recommendation:
pub fn init(&mut self, remap: &[PointOffsetType]) -> OperationResult<()> { - self.gpu_visited_flags.init(remap)?; + if remap.len() > self.gpu_visited_flags.capacity() { + return Err(OperationError::service_error("Visited-flags capacity is too small for provided remap")); + } + self.gpu_visited_flags.init(remap)?;…and add an assert/early-return inside
GpuVisitedFlags::init
as well.Also applies to: 134-139
🧹 Nitpick comments (7)
lib/segment/src/index/hnsw_index/gpu/gpu_level_builder.rs (1)
150-161
: Avoid panicking in test helpers – propagate the error instead
gpu_search_context.init(...)
is unwrapped here. If the allocation of the remap staging buffer insideGpuVisitedFlags::init
fails (OOM on CI runners with small GPUs) the whole test binary will abort, making diagnostics harder.
Prefer bubbling theOperationResult
up with?
, or at least add a helpfulexpect()
message.-gpu_search_context.init(batched_points.remap()).unwrap(); +gpu_search_context + .init(batched_points.remap()) + .expect("initialising GPU insert context failed");lib/segment/src/index/hnsw_index/gpu/gpu_insert_context.rs (1)
264-270
:init
re-allocates a freshgpu::Context
for every call insideGpuVisitedFlags::init
GpuVisitedFlags::init
creates a brand-newgpu::Context
, which is expensive (command-pool, fence, etc.).
SinceGpuInsertContext
already owns an operationalContext
, consider passing it down to reuse existing resources:pub fn init(&mut self, gpu_ctx: &mut gpu::Context, points_remap: &[PointOffsetType]) -> OperationResult<()>This will cut redundant allocations when
init
is invoked many times across large imports.lib/segment/src/index/hnsw_index/gpu/gpu_graph_builder.rs (3)
15-15
: Preferconst
overstatic
for immutable scalar values
GPU_MAX_VISITED_FLAGS_FACTOR
is used as a compile-time constant. Declaring it aspub const GPU_MAX_VISITED_FLAGS_FACTOR: usize = 32;avoids putting an unnecessary symbol into the data section and prevents accidental mutability through
unsafe { … }
.
19-21
: Annotate lifetime'b
more explicitly (optional)The extra lifetime on
GpuInsertContext<'b>
is only needed if the context borrows from something with the same lifetime.
If the context now owns all internal buffers (post-refactor) you can drop the explicit'b
to simplify the signature:pub fn build_hnsw_on_gpu<'a>( gpu_insert_context: &mut GpuInsertContext, … )This is optional but makes call-sites cleaner and eliminates an otherwise unused lifetime parameter.
149-171
: Avoid cloning the fullids
vector on every repeat
ids.clone()
duplicates a potentially large buffer for each iteration.
Instead, pass a slice tobuild_hnsw_on_gpu
and let the callee create its own copy only if needed:- build_hnsw_on_gpu( + build_hnsw_on_gpu( &mut gpu_search_context, &test.graph_layers_builder, groups_count, 1, cpu_linked_points_count, - ids.clone(), + &ids,and change the parameter type in
build_hnsw_on_gpu
to&[PointOffsetType]
.
This cuts memory reallocations in tight benchmarks.lib/segment/src/index/hnsw_index/hnsw.rs (2)
700-711
: Potential duplication of expensive GPU context creation
GpuInsertContext::new
is called insidebuild_main_graph_on_gpu
, which is itself called from the builder.
Because the context can now be reused, consider propagating an existingGpuInsertContext
instead of creating a fresh one here to avoid reallocating buffers (~hundreds of MB on large indices).If life-time constraints prevent reuse, document why a fresh context is necessary.
764-775
: Graceful fallback covers functional errors but not panics
build_hnsw_on_gpu
errors are converted into a warning and the CPU path is taken, which is great.
However, if the GPU code panics (e.g., viaunwrap()
in lower levels) the whole process aborts.Recommend wrapping the GPU build in
std::panic::catch_unwind
to keep the same graceful-degradation guarantee in release builds.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (6)
lib/segment/src/index/hnsw_index/gpu/gpu_graph_builder.rs
(9 hunks)lib/segment/src/index/hnsw_index/gpu/gpu_insert_context.rs
(5 hunks)lib/segment/src/index/hnsw_index/gpu/gpu_level_builder.rs
(1 hunks)lib/segment/src/index/hnsw_index/gpu/gpu_visited_flags.rs
(7 hunks)lib/segment/src/index/hnsw_index/hnsw.rs
(8 hunks)lib/segment/src/index/hnsw_index/mod.rs
(1 hunks)
🧰 Additional context used
🧬 Code Graph Analysis (2)
lib/segment/src/index/hnsw_index/gpu/gpu_visited_flags.rs (4)
lib/segment/src/index/hnsw_index/gpu/gpu_vector_storage/mod.rs (11)
Self
(266-266)Self
(290-290)Self
(314-314)Self
(433-433)Self
(461-461)Self
(484-484)Self
(505-505)Self
(537-537)Self
(564-564)device
(794-796)new
(132-150)lib/segment/src/index/hnsw_index/gpu/gpu_insert_context.rs (14)
init
(264-270)new
(99-178)new
(183-262)std
(112-112)std
(125-125)std
(138-138)std
(278-278)std
(321-321)std
(388-388)std
(615-615)std
(731-731)std
(852-852)std
(859-859)std
(883-883)lib/gpu/src/context.rs (1)
new
(39-87)lib/gpu/src/buffer.rs (1)
new
(53-158)
lib/segment/src/index/hnsw_index/hnsw.rs (4)
lib/segment/src/index/hnsw_index/gpu/gpu_insert_context.rs (2)
new
(99-178)new
(183-262)lib/segment/src/index/visited_pool.rs (3)
new
(34-39)new
(50-55)new
(102-106)lib/segment/src/index/hnsw_index/gpu/mod.rs (1)
get_gpu_groups_count
(55-57)lib/segment/src/index/hnsw_index/gpu/gpu_graph_builder.rs (1)
build_hnsw_on_gpu
(19-102)
🔇 Additional comments (3)
lib/segment/src/index/hnsw_index/mod.rs (1)
27-32
: Placeholder fits the pattern – looks goodThe stub for
GpuInsertContext
mirrors the existing GPU-conditional placeholders and unblocksno-gpu
builds. No further action required.lib/segment/src/index/hnsw_index/gpu/gpu_visited_flags.rs (1)
189-218
: Danger of integer overflow for large collections
points_capacity.next_multiple_of(alignment)
is used twice.
For very large collections (≥ 2³¹ points) the intermediate multiplication* std::mem::size_of::<u32>()
below can overflowusize
on 32-bit targets and silently wrap, leading to undersized buffers.Add a checked multiplication:
let byte_len = capacity .checked_mul(std::mem::size_of::<u8>()) .ok_or_else(|| OperationError::service_error("visited-flags size overflow"))?;and propagate the error.
lib/segment/src/index/hnsw_index/gpu/gpu_graph_builder.rs (1)
82-99
:init
is executed unconditionally – verify it is idempotent
gpu_insert_context.init( … )?
is called on every invocation even if the same remap is passed repeatedly (see the test that calls the function twice in a loop).
Please double-check thatinit
:
- Clears or rewrites all previously uploaded data.
- Does not leak GPU resources on re-initialisation.
- Resets internal counters that affect
upload_links
/download_links
.If any of these invariants are not true the second build may produce corrupted links.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've only been able to skim this just yet, sorry if my review remarks seem a bit hasty 😃
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (1)
lib/segment/src/index/hnsw_index/hnsw.rs (1)
701-714
: Avoid re-allocating a secondGpuInsertContext
for the main graphA brand-new
GpuInsertContext
is constructed for the main graph here, while another one is created (and reused) for payload blocks later. That’s an extra set of large GPU buffers (~gigabytes on big indexes) that immediately gets dropped after the main build.Reusing the same context throughout the whole
build()
run keeps allocations O(1). One possible approach:// (1) create context once, before main-graph build) -let gpu_vectors = Self::create_gpu_vectors(...)?; -let mut gpu_insert_context = gpu_vectors.as_ref().map(|v| { - GpuInsertContext::new( - v, - get_gpu_groups_count(), - config.m, // main-graph params - config.m0, - config.ef_construct, - false, - 1..=GPU_MAX_VISITED_FLAGS_FACTOR, - ) -}).transpose()?; +let mut gpu_insert_context = gpu_vectors.as_ref().map(|v| { + GpuInsertContext::new( + v, + get_gpu_groups_count(), + config.m.max(payload_m), // max of main/payload + config.m0.max(config.payload_m0.unwrap_or(config.m0)), + config.ef_construct, + false, + 1..=GPU_MAX_VISITED_FLAGS_FACTOR, + ) +}).transpose()?; // (2) pass `&mut gpu_insert_context` to BOTH main and payload builders Self::build_main_graph_on_gpu(..., gpu_insert_context.as_mut(), ...)?; ... Self::build_filtered_graph(..., &mut gpu_insert_context, ...)?;This reduces memory pressure and honours the PR goal of not reallocating GPU resources between phases.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (4)
lib/segment/src/index/hnsw_index/gpu/gpu_graph_builder.rs
(9 hunks)lib/segment/src/index/hnsw_index/gpu/gpu_insert_context.rs
(5 hunks)lib/segment/src/index/hnsw_index/gpu/gpu_visited_flags.rs
(8 hunks)lib/segment/src/index/hnsw_index/hnsw.rs
(10 hunks)
🚧 Files skipped from review as they are similar to previous changes (3)
- lib/segment/src/index/hnsw_index/gpu/gpu_insert_context.rs
- lib/segment/src/index/hnsw_index/gpu/gpu_visited_flags.rs
- lib/segment/src/index/hnsw_index/gpu/gpu_graph_builder.rs
⏰ Context from checks skipped due to timeout of 90000ms (13)
- GitHub Check: Basic TLS/HTTPS tests
- GitHub Check: test-snapshot-operations-s3-minio
- GitHub Check: test-shard-snapshot-api-s3-minio
- GitHub Check: test-low-resources
- GitHub Check: test-consistency
- GitHub Check: rust-tests (macos-latest)
- GitHub Check: integration-tests-consensus
- GitHub Check: rust-tests (windows-latest)
- GitHub Check: integration-tests
- GitHub Check: lint
- GitHub Check: storage-compat-test
- GitHub Check: rust-tests (ubuntu-latest)
- GitHub Check: test-consensus-compose
🔇 Additional comments (2)
lib/segment/src/index/hnsw_index/hnsw.rs (2)
449-451
: VisitedPool is now reused – great performance winCreating
VisitedPool
only once outside the payload-block loop removes thousands of tiny allocations on large payload sets. 🎉
452-467
: Double-check thatGpuInsertContext
getsinit()
-ialised per block
GpuInsertContext::new(..)
now returns an un-initialised context (per the refactor that added an explicitinit()
call with the remap slice).
Ifbuild_graph_on_gpu
relies on the context being initialised, we must ensure.init(remap)
is invoked before every call; otherwise the builder may read uninitialised GPU buffers and produce corrupted links.Please verify that one of:
•build_graph_on_gpu
internally callsgpu_insert_context.init(..)
each time, or
• the caller performs that call right before handing the context over.If neither is true, add the init call similar to:
// before first use inside build_filtered_graph(...) -if let Some(ctx) = gpu_insert_context.as_mut() { - // ctx is un-initialised here -} +if let Some(ctx) = gpu_insert_context.as_mut() { + ctx.init(&points_to_index); // map old→new indices +}
Please fix linter |
done, GPU CI is also green: https://github.com/qdrant/qdrant/actions/runs/14833216062/job/41638808735 |
* dont reallocate gpu resources between payload blocks * are you happy codespell * fix review remarks * fix gpu tests * are you happy clippy
This PR fixes slow HNSW construction on GPU in case of many payload blocks.
GPU indexing has already a common structure
GpuInsertContext
with all allocated GPU resources inside. This PR uses thisGpuInsertContext
as a parameter of indexing (previously)To reuse
GpuInsertContext
between indexing, the structure hasinit
method to clear all buffers.Test
test_gpu_hnsw_equivalency
checks reusingGpuInsertContext
between indexing.Measurements oh H&M dataset:
Dev:
Branch: