Skip to content

Conversation

IvanPleshkov
Copy link
Contributor

@IvanPleshkov IvanPleshkov commented Apr 29, 2025

This PR fixes slow HNSW construction on GPU in case of many payload blocks.

GPU indexing has already a common structure GpuInsertContext with all allocated GPU resources inside. This PR uses this GpuInsertContext as a parameter of indexing (previously)

To reuse GpuInsertContext between indexing, the structure has init method to clear all buffers.

Test test_gpu_hnsw_equivalency checks reusing GpuInsertContext between indexing.

Measurements oh H&M dataset:
Dev:

Upload time: 35.9856106
Total import time: 418.84596930000004

Branch:

Upload time: 38.78802339999999
Total import time: 225.5910519

@IvanPleshkov IvanPleshkov marked this pull request as ready for review April 30, 2025 09:42
@IvanPleshkov IvanPleshkov requested review from timvisee and xzfc April 30, 2025 09:42
Copy link
Contributor

coderabbitai bot commented Apr 30, 2025

📝 Walkthrough

Walkthrough

This set of changes refactors GPU context management in the HNSW index implementation by introducing a new GpuInsertContext abstraction that encapsulates GPU resources required for graph construction and insertion. The build_hnsw_on_gpu function now accepts a mutable reference to an existing GpuInsertContext instead of creating one internally, and the exact parameter was removed. GPU resource constructors were changed to accept points capacity rather than remap slices, and initialization dependent on remap data was moved to a new explicit init method. The GpuVisitedFlags struct was extended with device ownership and staging buffers, separating buffer creation from remap data uploading. Test helpers were updated to reuse a single GpuInsertContext for multiple build repetitions. The main HNSW index methods were updated to accept mutable references to GpuInsertContext instead of raw GPU vector storage references. A placeholder GpuInsertContext module was added for non-GPU builds. These changes improve encapsulation, separate resource allocation from initialization, and support repeated GPU graph builds with a single context.

Suggested reviewers

  • xzfc
  • timvisee

📜 Recent review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between b7f33a4 and aac8a91.

📒 Files selected for processing (2)
  • lib/segment/src/index/hnsw_index/gpu/gpu_graph_builder.rs (8 hunks)
  • lib/segment/src/index/hnsw_index/gpu/gpu_visited_flags.rs (8 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (14)
  • GitHub Check: rust-gpu-tests (ubuntu-latest)
  • GitHub Check: test-snapshot-operations-s3-minio
  • GitHub Check: test-shard-snapshot-api-s3-minio
  • GitHub Check: Basic TLS/HTTPS tests
  • GitHub Check: test-consistency
  • GitHub Check: test-low-resources
  • GitHub Check: test-consensus-compose
  • GitHub Check: rust-tests (macos-latest)
  • GitHub Check: integration-tests-consensus
  • GitHub Check: rust-tests (windows-latest)
  • GitHub Check: integration-tests
  • GitHub Check: rust-tests (ubuntu-latest)
  • GitHub Check: lint
  • GitHub Check: storage-compat-test
🔇 Additional comments (20)
lib/segment/src/index/hnsw_index/gpu/gpu_graph_builder.rs (10)

15-15: Confirming the use of GPU_MAX_VISITED_FLAGS_FACTOR as a static variable.

The switch to using a static variable over a const for GPU_MAX_VISITED_FLAGS_FACTOR is appropriate here. The value of 32 for the maximum visited flags factor allows for flexible GPU memory usage optimization.


19-20: Good refactoring to reuse GPU context between calls.

The function signature change to accept a mutable reference to an existing GpuInsertContext instead of creating one internally aligns perfectly with the PR objective of avoiding GPU resource reallocation between payload blocks.


82-82: Proper initialization pattern with dedicated init method.

Calling init on the context with the batched points remap data nicely separates resource allocation from initialization, which is a cleaner design pattern and supports the reusability goal.


88-97: Consistent GPU context usage throughout the function.

The code now consistently uses the passed-in GPU context for uploading links, building levels, and downloading links, ensuring the same allocated GPU resources are reused throughout the entire operation.


99-99: Added performance measurement logging.

The addition of log_measurements() provides valuable performance insights, which is particularly important for this performance-focused PR.


120-144: Updated test helper to support multiple builds with the same context.

The build_gpu_graph test helper now accepts a repeats parameter and reuses the same GpuInsertContext across multiple builds, which is consistent with the core functionality changes. The range 1..=GPU_MAX_VISITED_FLAGS_FACTOR is now correctly inclusive.


147-170: Improved test structure for multiple repetitions.

The refactored test helper now returns a vector of GraphLayersBuilder instances by mapping over the repetition count and building the graph for each iteration, which provides a cleaner way to test the reusability of the GPU context.


188-192: Test updated to verify context reuse correctness.

The test_gpu_hnsw_equivalency test now builds the graph twice with the same context and compares both results with the reference builder, effectively verifying that reusing the context produces consistent and correct results.


213-217: Test updated to handle the new return type.

The quality tests correctly handle the vector return type by extracting the first result, maintaining compatibility with the existing test assertions.


238-242: Consistently updated all tests.

All tests have been updated to handle the new patterns consistently, ensuring proper test coverage of the refactored functionality.

lib/segment/src/index/hnsw_index/gpu/gpu_visited_flags.rs (10)

11-12: Good choice for chunked upload buffer size.

The constant UPLOAD_REMAP_BUFFER_COUNT with a value of 1M elements provides a good balance between memory usage and upload efficiency for large remap data.


19-31: Improved resource ownership with explicit device reference.

Adding the device: Arc<gpu::Device> field to the struct ensures clear ownership and lifecycle management of GPU resources, which is important for resource reuse. The change from a boolean remap flag to an optional remap_buffer and the addition of an optional remap_staging_buffer better express the intent and lifecycle of these resources.


47-49: Clean check for optional remap buffer.

The conditional check now uses is_some() on the remap_buffer field rather than a boolean flag, making the code more self-documenting.


55-60: Improved constructor signature for better resource management.

The constructor now takes points_capacity and an inclusive factor_range instead of a remap slice and exclusive range, which better separates resource allocation from data initialization and aligns with the PR's goal of resource reuse.


74-75: Updated buffer creation to use capacity-based parameters.

The buffer creation now uses points_capacity and inclusive factor_range, which is consistent with the constructor changes and better supports resource allocation separate from initialization.


108-118: Efficient staging buffer allocation pattern.

Creating the remap staging buffer conditionally only when needed and sizing it based on the chunk size rather than the full remap data is an efficient approach.


133-169: Well-designed initialization method with chunked uploads.

The new init method properly implements:

  1. Chunked uploading of remap data to avoid GPU memory pressure
  2. Buffer clearing and parameter resetting
  3. Proper error handling with meaningful messages

This separation of initialization from construction supports the PR's goal of resource reuse between calls.


205-206: Consistent parameter updates in helper methods.

The create_flags_buffer method has been updated to use the same parameter types as the constructor, maintaining consistency throughout the codebase.


209-212: Proper alignment and size calculations.

Using next_multiple_of for alignment and properly calculating buffer sizes based on capacity ensures correct memory allocation and alignment requirements are met.


233-234: Fixed range iteration with inclusive end.

The change from while factor <= *factor_range.end() ensures the upper bound is included in the iteration, correctly reflecting the inclusive range parameter.

✨ Finishing Touches
  • 📝 Generate Docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR.
  • @coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🔭 Outside diff range comments (1)
lib/segment/src/index/hnsw_index/gpu/gpu_insert_context.rs (1)

98-106: ⚠️ Potential issue

Potential buffer-overflow when more points than points_capacity are passed to init

insert_atomics_buffer and all visited-flags buffers are sized with:

let points_count = gpu_vector_storage.num_vectors();          // ≈ points_capacity

Later, init(remap) blindly uploads remap.len() elements.
If the caller adds extra virtual point IDs that are not yet present in the vector storage (e.g. during batched import) remap.len() can exceed points_capacity, causing out-of-bounds writes in:

GpuVisitedFlags::init → context.copy_gpu_buffer(...)

Recommendation:

 pub fn init(&mut self, remap: &[PointOffsetType]) -> OperationResult<()> {
-    self.gpu_visited_flags.init(remap)?;
+    if remap.len() > self.gpu_visited_flags.capacity() {
+        return Err(OperationError::service_error("Visited-flags capacity is too small for provided remap"));
+    }
+    self.gpu_visited_flags.init(remap)?;

…and add an assert/early-return inside GpuVisitedFlags::init as well.

Also applies to: 134-139

🧹 Nitpick comments (7)
lib/segment/src/index/hnsw_index/gpu/gpu_level_builder.rs (1)

150-161: Avoid panicking in test helpers – propagate the error instead

gpu_search_context.init(...) is unwrapped here. If the allocation of the remap staging buffer inside GpuVisitedFlags::init fails (OOM on CI runners with small GPUs) the whole test binary will abort, making diagnostics harder.
Prefer bubbling the OperationResult up with ?, or at least add a helpful expect() message.

-gpu_search_context.init(batched_points.remap()).unwrap();
+gpu_search_context
+    .init(batched_points.remap())
+    .expect("initialising GPU insert context failed");
lib/segment/src/index/hnsw_index/gpu/gpu_insert_context.rs (1)

264-270: init re-allocates a fresh gpu::Context for every call inside GpuVisitedFlags::init

GpuVisitedFlags::init creates a brand-new gpu::Context, which is expensive (command-pool, fence, etc.).
Since GpuInsertContext already owns an operational Context, consider passing it down to reuse existing resources:

pub fn init(&mut self,
            gpu_ctx: &mut gpu::Context,
            points_remap: &[PointOffsetType]) -> OperationResult<()>

This will cut redundant allocations when init is invoked many times across large imports.

lib/segment/src/index/hnsw_index/gpu/gpu_graph_builder.rs (3)

15-15: Prefer const over static for immutable scalar values

GPU_MAX_VISITED_FLAGS_FACTOR is used as a compile-time constant. Declaring it as

pub const GPU_MAX_VISITED_FLAGS_FACTOR: usize = 32;

avoids putting an unnecessary symbol into the data section and prevents accidental mutability through unsafe { … }.


19-21: Annotate lifetime 'b more explicitly (optional)

The extra lifetime on GpuInsertContext<'b> is only needed if the context borrows from something with the same lifetime.
If the context now owns all internal buffers (post-refactor) you can drop the explicit 'b to simplify the signature:

pub fn build_hnsw_on_gpu<'a>(
    gpu_insert_context: &mut GpuInsertContext,)

This is optional but makes call-sites cleaner and eliminates an otherwise unused lifetime parameter.


149-171: Avoid cloning the full ids vector on every repeat

ids.clone() duplicates a potentially large buffer for each iteration.
Instead, pass a slice to build_hnsw_on_gpu and let the callee create its own copy only if needed:

-                build_hnsw_on_gpu(
+                build_hnsw_on_gpu(
                     &mut gpu_search_context,
                     &test.graph_layers_builder,
                     groups_count,
                     1,
                     cpu_linked_points_count,
-                    ids.clone(),
+                    &ids,

and change the parameter type in build_hnsw_on_gpu to &[PointOffsetType].
This cuts memory reallocations in tight benchmarks.

lib/segment/src/index/hnsw_index/hnsw.rs (2)

700-711: Potential duplication of expensive GPU context creation

GpuInsertContext::new is called inside build_main_graph_on_gpu, which is itself called from the builder.
Because the context can now be reused, consider propagating an existing GpuInsertContext instead of creating a fresh one here to avoid reallocating buffers (~hundreds of MB on large indices).

If life-time constraints prevent reuse, document why a fresh context is necessary.


764-775: Graceful fallback covers functional errors but not panics

build_hnsw_on_gpu errors are converted into a warning and the CPU path is taken, which is great.
However, if the GPU code panics (e.g., via unwrap() in lower levels) the whole process aborts.

Recommend wrapping the GPU build in std::panic::catch_unwind to keep the same graceful-degradation guarantee in release builds.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between d74d5e3 and e097d7c.

📒 Files selected for processing (6)
  • lib/segment/src/index/hnsw_index/gpu/gpu_graph_builder.rs (9 hunks)
  • lib/segment/src/index/hnsw_index/gpu/gpu_insert_context.rs (5 hunks)
  • lib/segment/src/index/hnsw_index/gpu/gpu_level_builder.rs (1 hunks)
  • lib/segment/src/index/hnsw_index/gpu/gpu_visited_flags.rs (7 hunks)
  • lib/segment/src/index/hnsw_index/hnsw.rs (8 hunks)
  • lib/segment/src/index/hnsw_index/mod.rs (1 hunks)
🧰 Additional context used
🧬 Code Graph Analysis (2)
lib/segment/src/index/hnsw_index/gpu/gpu_visited_flags.rs (4)
lib/segment/src/index/hnsw_index/gpu/gpu_vector_storage/mod.rs (11)
  • Self (266-266)
  • Self (290-290)
  • Self (314-314)
  • Self (433-433)
  • Self (461-461)
  • Self (484-484)
  • Self (505-505)
  • Self (537-537)
  • Self (564-564)
  • device (794-796)
  • new (132-150)
lib/segment/src/index/hnsw_index/gpu/gpu_insert_context.rs (14)
  • init (264-270)
  • new (99-178)
  • new (183-262)
  • std (112-112)
  • std (125-125)
  • std (138-138)
  • std (278-278)
  • std (321-321)
  • std (388-388)
  • std (615-615)
  • std (731-731)
  • std (852-852)
  • std (859-859)
  • std (883-883)
lib/gpu/src/context.rs (1)
  • new (39-87)
lib/gpu/src/buffer.rs (1)
  • new (53-158)
lib/segment/src/index/hnsw_index/hnsw.rs (4)
lib/segment/src/index/hnsw_index/gpu/gpu_insert_context.rs (2)
  • new (99-178)
  • new (183-262)
lib/segment/src/index/visited_pool.rs (3)
  • new (34-39)
  • new (50-55)
  • new (102-106)
lib/segment/src/index/hnsw_index/gpu/mod.rs (1)
  • get_gpu_groups_count (55-57)
lib/segment/src/index/hnsw_index/gpu/gpu_graph_builder.rs (1)
  • build_hnsw_on_gpu (19-102)
🔇 Additional comments (3)
lib/segment/src/index/hnsw_index/mod.rs (1)

27-32: Placeholder fits the pattern – looks good

The stub for GpuInsertContext mirrors the existing GPU-conditional placeholders and unblocks no-gpu builds. No further action required.

lib/segment/src/index/hnsw_index/gpu/gpu_visited_flags.rs (1)

189-218: Danger of integer overflow for large collections

points_capacity.next_multiple_of(alignment) is used twice.
For very large collections (≥ 2³¹ points) the intermediate multiplication * std::mem::size_of::<u32>() below can overflow usize on 32-bit targets and silently wrap, leading to undersized buffers.

Add a checked multiplication:

let byte_len = capacity
    .checked_mul(std::mem::size_of::<u8>())
    .ok_or_else(|| OperationError::service_error("visited-flags size overflow"))?;

and propagate the error.

lib/segment/src/index/hnsw_index/gpu/gpu_graph_builder.rs (1)

82-99: init is executed unconditionally – verify it is idempotent

gpu_insert_context.init( … )? is called on every invocation even if the same remap is passed repeatedly (see the test that calls the function twice in a loop).
Please double-check that init:

  1. Clears or rewrites all previously uploaded data.
  2. Does not leak GPU resources on re-initialisation.
  3. Resets internal counters that affect upload_links / download_links.

If any of these invariants are not true the second build may produce corrupted links.

Copy link
Member

@timvisee timvisee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've only been able to skim this just yet, sorry if my review remarks seem a bit hasty 😃

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
lib/segment/src/index/hnsw_index/hnsw.rs (1)

701-714: Avoid re-allocating a second GpuInsertContext for the main graph

A brand-new GpuInsertContext is constructed for the main graph here, while another one is created (and reused) for payload blocks later. That’s an extra set of large GPU buffers (~gigabytes on big indexes) that immediately gets dropped after the main build.

Reusing the same context throughout the whole build() run keeps allocations O(1). One possible approach:

// (1) create context once, before main-graph build)
-let gpu_vectors = Self::create_gpu_vectors(...)?;
-let mut gpu_insert_context = gpu_vectors.as_ref().map(|v| {
-    GpuInsertContext::new(
-        v,
-        get_gpu_groups_count(),
-        config.m,                     // main-graph params
-        config.m0,
-        config.ef_construct,
-        false,
-        1..=GPU_MAX_VISITED_FLAGS_FACTOR,
-    )
-}).transpose()?;
+let mut gpu_insert_context = gpu_vectors.as_ref().map(|v| {
+    GpuInsertContext::new(
+        v,
+        get_gpu_groups_count(),
+        config.m.max(payload_m),      // max of main/payload
+        config.m0.max(config.payload_m0.unwrap_or(config.m0)),
+        config.ef_construct,
+        false,
+        1..=GPU_MAX_VISITED_FLAGS_FACTOR,
+    )
+}).transpose()?;

// (2) pass `&mut gpu_insert_context` to BOTH main and payload builders
 Self::build_main_graph_on_gpu(..., gpu_insert_context.as_mut(), ...)?;
 ...
 Self::build_filtered_graph(..., &mut gpu_insert_context, ...)?;

This reduces memory pressure and honours the PR goal of not reallocating GPU resources between phases.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between e097d7c and cac1278.

📒 Files selected for processing (4)
  • lib/segment/src/index/hnsw_index/gpu/gpu_graph_builder.rs (9 hunks)
  • lib/segment/src/index/hnsw_index/gpu/gpu_insert_context.rs (5 hunks)
  • lib/segment/src/index/hnsw_index/gpu/gpu_visited_flags.rs (8 hunks)
  • lib/segment/src/index/hnsw_index/hnsw.rs (10 hunks)
🚧 Files skipped from review as they are similar to previous changes (3)
  • lib/segment/src/index/hnsw_index/gpu/gpu_insert_context.rs
  • lib/segment/src/index/hnsw_index/gpu/gpu_visited_flags.rs
  • lib/segment/src/index/hnsw_index/gpu/gpu_graph_builder.rs
⏰ Context from checks skipped due to timeout of 90000ms (13)
  • GitHub Check: Basic TLS/HTTPS tests
  • GitHub Check: test-snapshot-operations-s3-minio
  • GitHub Check: test-shard-snapshot-api-s3-minio
  • GitHub Check: test-low-resources
  • GitHub Check: test-consistency
  • GitHub Check: rust-tests (macos-latest)
  • GitHub Check: integration-tests-consensus
  • GitHub Check: rust-tests (windows-latest)
  • GitHub Check: integration-tests
  • GitHub Check: lint
  • GitHub Check: storage-compat-test
  • GitHub Check: rust-tests (ubuntu-latest)
  • GitHub Check: test-consensus-compose
🔇 Additional comments (2)
lib/segment/src/index/hnsw_index/hnsw.rs (2)

449-451: VisitedPool is now reused – great performance win

Creating VisitedPool only once outside the payload-block loop removes thousands of tiny allocations on large payload sets. 🎉


452-467: Double-check that GpuInsertContext gets init()-ialised per block

GpuInsertContext::new(..) now returns an un-initialised context (per the refactor that added an explicit init() call with the remap slice).
If build_graph_on_gpu relies on the context being initialised, we must ensure .init(remap) is invoked before every call; otherwise the builder may read uninitialised GPU buffers and produce corrupted links.

Please verify that one of:
build_graph_on_gpu internally calls gpu_insert_context.init(..) each time, or
• the caller performs that call right before handing the context over.

If neither is true, add the init call similar to:

// before first use inside build_filtered_graph(...)
-if let Some(ctx) = gpu_insert_context.as_mut() {
-    // ctx is un-initialised here
-}
+if let Some(ctx) = gpu_insert_context.as_mut() {
+    ctx.init(&points_to_index);        // map old→new indices
+}

@IvanPleshkov IvanPleshkov requested a review from timvisee May 2, 2025 11:49
@generall
Copy link
Member

generall commented May 3, 2025

Please fix linter

@IvanPleshkov
Copy link
Contributor Author

Please fix linter

done, GPU CI is also green: https://github.com/qdrant/qdrant/actions/runs/14833216062/job/41638808735

@IvanPleshkov IvanPleshkov merged commit 737c15c into dev May 5, 2025
20 checks passed
@IvanPleshkov IvanPleshkov deleted the dont-reallocate-gpu-resources-between-payload-blocks branch May 5, 2025 12:42
generall pushed a commit that referenced this pull request May 22, 2025
* dont reallocate gpu resources between payload blocks

* are you happy codespell

* fix review remarks

* fix gpu tests

* are you happy clippy
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants