-
Notifications
You must be signed in to change notification settings - Fork 1.8k
Reuse space more in gridstore #6445
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
lib/gridstore/src/tracker.rs
Outdated
#[derive(Debug)] | ||
enum PointerUpdate { | ||
Set(ValuePointer), | ||
Unset(ValuePointer), | ||
#[derive(Debug, Default)] | ||
struct PointerUpdates { | ||
/// Whether the latest pointer is set (true) or unset (false) | ||
latest_is_set: bool, | ||
/// List of pointers where the value was written | ||
history: SmallVec<[ValuePointer; 1]>, | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the main change of the PR
📝 WalkthroughWalkthroughThe changes introduce and standardize the use of the 📜 Recent review detailsConfiguration used: CodeRabbit UI 📒 Files selected for processing (1)
🚧 Files skipped from review as they are similar to previous changes (1)
⏰ Context from checks skipped due to timeout of 90000ms (13)
✨ Finishing Touches
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🧹 Nitpick comments (3)
lib/gridstore/src/tracker.rs (3)
39-45
: Consider documenting the invariants ofPointerUpdates
Because correctness of the clean-up logic relies onhistory
never being empty oncelatest_is_set
becomestrue/false
, a short doc-comment describing that assumption will help future maintainers.
172-190
: Optionally deduplicateold_pointers
before returning
With the fix above duplicates should be rare, but if external callers treat the returned vector as a set, it may be safer (and cheaper long-term) to weed out duplicates here:old_pointers.sort_unstable(); old_pointers.dedup();Not essential, but helps keep the freeing path idempotent.
291-293
: Minor ergonomics—avoid doubleOption
withand_then
Readability can be improved by replacing themap(..).or_else(..).flatten()
dance with:self.pending_updates .get(&point_offset) .and_then(PointerUpdates::latest) .or_else(|| self.get_raw(point_offset).copied())Same behaviour, one fewer
Option
layer.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
⛔ Files ignored due to path filters (1)
Cargo.lock
is excluded by!**/*.lock
📒 Files selected for processing (6)
Cargo.toml
(1 hunks)lib/collection/Cargo.toml
(1 hunks)lib/gridstore/Cargo.toml
(1 hunks)lib/gridstore/src/gridstore.rs
(1 hunks)lib/gridstore/src/tracker.rs
(8 hunks)lib/segment/Cargo.toml
(1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (13)
- GitHub Check: test-snapshot-operations-s3-minio
- GitHub Check: test-shard-snapshot-api-s3-minio
- GitHub Check: Basic TLS/HTTPS tests
- GitHub Check: test-low-resources
- GitHub Check: test-consistency
- GitHub Check: rust-tests (macos-latest)
- GitHub Check: test-consensus-compose
- GitHub Check: rust-tests (windows-latest)
- GitHub Check: integration-tests-consensus
- GitHub Check: rust-tests (ubuntu-latest)
- GitHub Check: lint
- GitHub Check: integration-tests
- GitHub Check: storage-compat-test
🔇 Additional comments (8)
Cargo.toml (1)
208-208
: Properly adding smallvec dependency.Adding the smallvec crate as a workspace dependency aligns with the PR objective to improve space reuse in gridstore by tracking update history.
lib/gridstore/Cargo.toml (1)
20-20
: LGTM - Adding smallvec as a workspace dependency.This addition is consistent with the root Cargo.toml change and supports the implementation of pointer update history tracking in the tracker module.
lib/collection/Cargo.toml (1)
46-46
: Good standardization of smallvec dependency.Converting from a fixed version to a workspace dependency enhances consistency across the project.
lib/segment/Cargo.toml (1)
94-94
: Good standardization of smallvec dependency.Converting from a fixed version to a workspace dependency enhances consistency across the project.
lib/gridstore/src/gridstore.rs (1)
736-767
: Good test refactoring with explicit block offset reuse test.The refactored test with a closure nicely validates the PR objective of improved space reuse. The test now explicitly verifies that after flushing updates, the first block offset becomes available again for reuse.
The closure eliminates code duplication and makes the test more maintainable while clearly demonstrating the improved behavior.
lib/gridstore/src/tracker.rs (3)
9-9
: Good call introducingSmallVec
to avoid heap allocations in the common case
The import is correct and sits with the other extern-crate imports, keeping the module header tidy.
309-313
: Looks good – offset tracking logic preserved
Ensuresnext_pointer_offset
only grows.
321-324
: Behaviourally sound, but double-check race conditions in the caller
unset
relies onget
returning the latest flushed/pending value. IfTracker
is ever shared across threads, the caller must synchronise calls toset
/unset
/flush
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR updates the gridstore implementation to more effectively manage and free outdated pointers during flushes by tracking the complete history of pointer updates. Key changes include:
- Replacing the previous pointer update enum with a new PointerUpdates struct to record update history.
- Adjusting the persistence logic to free all outdated pointer locations.
- Updating tests and Cargo.toml dependency references for consistency.
Reviewed Changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated no comments.
Show a summary per file
File | Description |
---|---|
lib/segment/Cargo.toml | Changed smallvec to be workspace-managed. |
lib/gridstore/src/tracker.rs | Introduced PointerUpdates struct and updated flush logic to correctly free earlier pointer updates. |
lib/gridstore/src/gridstore.rs | Modified tests to verify block offset reuse after flush. |
lib/gridstore/Cargo.toml | Updated smallvec dependency to workspace. |
lib/collection/Cargo.toml | Updated smallvec dependency to workspace. |
Cargo.toml | Updated smallvec dependency version reference. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (2)
lib/gridstore/src/tracker.rs (2)
78-88
: Consider using a reference instead of consuming self.The
outdated_pointers
method currently consumesself
by taking ownership, which means thePointerUpdates
instance can't be used after calling this method. Consider changing the signature to use a reference if you need to preserve the instance.- fn outdated_pointers(self) -> impl Iterator<Item = ValuePointer> { + fn outdated_pointers(&self) -> impl Iterator<Item = ValuePointer> + '_ { - let take = if self.latest_is_set { + let take = if self.latest_is_set { // all but the latest one - self.history.len().saturating_sub(1) + self.history.len().saturating_sub(1) } else { // all of them - self.history.len() + self.history.len() }; - self.history.into_iter().take(take) + self.history.iter().take(take).copied() }
171-201
: Consider handling errors more robustly.If
mmap.flush()
fails, the method will return an error, but the pending updates have already been taken and will be lost. Consider handling this more robustly to preserve updates in case of errors.pub fn write_pending_and_flush(&mut self) -> std::io::Result<Vec<ValuePointer>> { // Write pending updates from memory - let mut pending_updates = std::mem::take(&mut self.pending_updates); + // Take a clone instead of ownership to preserve updates in case of errors + let mut pending_updates = self.pending_updates.clone(); + self.pending_updates.clear(); let mut old_pointers = Vec::new(); for (point_offset, updates) in pending_updates.drain() { match updates.latest() { // ... existing code ... } old_pointers.extend(updates.outdated_pointers()); } // increment header count if necessary self.persist_pointer_count(); // Flush the mmap - self.mmap.flush()?; + if let Err(e) = self.mmap.flush() { + // Restore the pending updates on error + self.pending_updates = pending_updates; + return Err(e); + } Ok(old_pointers) }
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
lib/gridstore/src/tracker.rs
(8 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (13)
- GitHub Check: Basic TLS/HTTPS tests
- GitHub Check: test-snapshot-operations-s3-minio
- GitHub Check: test-shard-snapshot-api-s3-minio
- GitHub Check: test-low-resources
- GitHub Check: test-consistency
- GitHub Check: integration-tests-consensus
- GitHub Check: rust-tests (macos-latest)
- GitHub Check: test-consensus-compose
- GitHub Check: rust-tests (windows-latest)
- GitHub Check: integration-tests
- GitHub Check: storage-compat-test
- GitHub Check: rust-tests (ubuntu-latest)
- GitHub Check: lint
🔇 Additional comments (10)
lib/gridstore/src/tracker.rs (10)
9-9
: Good addition of the SmallVec dependency.The import of
smallvec::SmallVec
is appropriate for this use case, as it provides a vector-like container that's optimized for the common case where there are few elements, avoiding heap allocations.
39-45
: Well-structured replacement for PointerUpdate.This new
PointerUpdates
struct effectively implements the core of the PR's objective, tracking all locations where values have been written between flushes, not just the last one.
47-62
: Implementation looks good with proper safeguards.The implementation for
set
andunset
is clear and includes the defensive check to prevent duplicate pointers in the history as suggested in previous review comments.
69-75
: Clean implementation of latest pointer resolution.The
latest
method correctly handles both set and unset states, returning the appropriate value based on thelatest_is_set
flag.
78-88
: Good implementation with safeguard against underflow.The
outdated_pointers
method correctly identifies which pointers should be freed based on the latest state, and includes thesaturating_sub
safeguard as suggested in a previous review.
107-107
: Appropriate field type update.The
pending_updates
field has been correctly updated to use the newPointerUpdates
struct.
175-193
: Core improvement for space reuse implemented correctly.This is the main change of the PR, as noted in previous comments. The implementation now properly:
- Handles the latest pointer state (set/unset)
- Collects all outdated pointers from the update history
- Ensures they're properly freed
This directly addresses the PR's objective of improving space reuse by tracking all locations where values were written between flushes.
294-294
: Appropriate method update.The
get
method has been properly updated to use the newPointerUpdates.latest()
method.
312-315
: Clean implementation of set method.The method now correctly creates or retrieves a
PointerUpdates
instance and delegates to itsset
method.
324-327
: Well-implemented unset method.The method correctly creates or retrieves a
PointerUpdates
instance and delegates to itsunset
method.
/// If this is `true`, then history must have at least one element. | ||
latest_is_set: bool, | ||
/// List of pointers where the value has been written | ||
history: SmallVec<[ValuePointer; 1]>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great! I like the use of SmallVec here.
This PR fixes one thing we overlooked in gridstore.
Pending updates only stored the last place where a value is. So, when flushing, we would only free the previous stored value. If multiple updates happened in between flushes, we'd end up with storage that is never marked as free.
With these changes, now we keep track of all the places that values have been written in between flushes, so that we only keep the latest one, and are able to mark all the previous places as free.
Testing
For testing, I modified one test, so that it failed without the changes, and also did manual validation by making sure that
bfb -n 5000000 --max-id 10 -d 128 --float-payloads true --segments 1 --on-disk-payload --text-payloads --text-payload-length 100 --skip-field-indices
doesn't grow the size of payload storage forever. It now reduces in between flushes, and it is minimal after updates have finished.