-
Notifications
You must be signed in to change notification settings - Fork 1.8k
Mutable ID tracker: implement byte storage format for mappings #6166
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
internal_to_external: &mut Vec<PointIdType>, | ||
external_to_internal_num: &mut BTreeMap<u64, PointOffsetType>, | ||
external_to_internal_uuid: &mut BTreeMap<Uuid, PointOffsetType>, | ||
/// Store new mapping changes, appending them to the given file |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The meat for all store/load logic begins here.
It's all separated in functions like:
- store (append) mappings to file
- write mappings to writer
- write single mapping to writer
- load mappings from file
- read mappings from reader
- read single mapping from reader
writer.write_u128::<FileEndianess>(external_id.to_u128_le())?; | ||
} | ||
} | ||
|
||
Ok(()) | ||
} | ||
|
||
fn load_versions( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Storage for versions starts here and is still using the same format. This will be taken care of in the next PR.
a5f067f
to
92bf3e7
Compare
This comment was marked as resolved.
This comment was marked as resolved.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, though, why not use serde_cbor
for serialization instead of hand-rolled format? Would be a bit less efficient, but easier to maintain, and we already use CBOR in number of places. 🤔
I didn't look at CBOR specifically, but did look at bincode. But when I look at CBOR now, I can list a few arguments on why I'd still prefer the current approach:
|
std::iter::from_fn(move || match read_entry(&mut reader) { | ||
Ok(entry) => Some(Ok(entry)), | ||
// Done reading if end of file is reached | ||
Err(err) if err.kind() == io::ErrorKind::UnexpectedEof => None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What would happen if we have a corrupted file, in which we ignored last entry error, in case we continue to write new records into it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! That would indeed be a problem. I had it on my task list, but forgot it in this PR.
I'll try to add some logic to truncate it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll handle this in a separate PR, as it isn't trivial to do and I don't want to make this PR more complicated. I've added it in the tracking issue.
Not exactly related to this PR, but
those file names already used in immutable ID tracker. Let's change them to some other unique names to avoid potential misread. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🔭 Outside diff range comments (1)
lib/segment/src/id_tracker/mutable_id_tracker.rs (1)
21-22
: 🛠️ Refactor suggestionConsider renaming files to avoid confusion with immutable ID tracker
As pointed out by generall in the PR comments, the file names
id_tracker.mappings
andid_tracker.versions
are already in use by the immutable ID tracker. This could lead to confusion and potential issues.Consider using more specific file names like
mutable_id_tracker.mappings
andmutable_id_tracker.versions
to clearly distinguish between the two implementations.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
lib/segment/src/id_tracker/mutable_id_tracker.rs
(11 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (13)
- GitHub Check: test-snapshot-operations-s3-minio
- GitHub Check: test-shard-snapshot-api-s3-minio
- GitHub Check: test-low-resources
- GitHub Check: test-consistency
- GitHub Check: test-consensus-compose
- GitHub Check: test (macos-latest)
- GitHub Check: test-consensus
- GitHub Check: test (windows-latest)
- GitHub Check: Basic TLS/HTTPS tests
- GitHub Check: test (ubuntu-latest)
- GitHub Check: test
- GitHub Check: test
- GitHub Check: test
🔇 Additional comments (9)
lib/segment/src/id_tracker/mutable_id_tracker.rs (9)
32-41
: Well-designed implementation of change type classificationThe addition of the
change_type
method to theMappingChange
enum provides a clean way to classify different mapping changes, improving code organization.
43-62
: Good enum design with byte representationThe
MappingChangeType
enum with#[repr(u8)]
is well-implemented. Thefrom_byte
method provides a robust way to convert from raw bytes to the enum type with proper error handling for invalid values.
113-127
: Improved error handling and code organizationThe refactored
open
method now uses dedicated loading functions, making the code more readable and maintainable while preserving proper error handling.
330-353
: Well-structured persistence implementation with proper fsyncThe implementation correctly:
- Opens the file in append mode
- Writes changes using a buffered writer
- Explicitly calls
sync_all()
to ensure durabilityThis ensures data integrity even in case of system crashes.
355-375
: Clear documentation and implementation of write operationsThe function is well-documented explaining the file format, and the implementation properly handles the serialization of mapping changes.
384-405
: Elegant implementation of error handling in iteratorThe
read_mappings_iter
function usestake_while_inclusive
to stop iteration after the first error, which is a clean approach to handling errors in an iterator-based API.
497-532
: Comprehensive entry deserialization with robust error handlingThe implementation correctly handles different types of mapping changes and has proper error handling for corrupted data.
534-575
: Well-documented serialization formatThe documentation clearly explains the serialization format with a detailed comment showing the byte layout, which is excellent for future maintainers.
906-978
: Excellent test coverage for serialization edge casesThe tests comprehensively cover various scenarios including:
- Normal serialization/deserialization
- Empty input
- Invalid type bytes
- Malformed data
- Incomplete entries
- Multiple entries
This ensures the robustness of the storage format implementation.
std::iter::from_fn(move || match read_entry(&mut reader) { | ||
Ok(entry) => Some(Ok(entry)), | ||
// Done reading if end of file is reached | ||
Err(err) if err.kind() == io::ErrorKind::UnexpectedEof => None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Verification agent
❓ Verification inconclusive
Consider implementing file truncation for corrupted files
Per the comment from generall and your response, there's a need to handle corrupted files when writing new records. If a file is corrupted and the last entry was ignored, writing new records could lead to further corruption.
I don't see the truncation logic implemented yet - this should be addressed as you mentioned in your response.
Could you implement a mechanism to truncate corrupted files before appending new records? This would involve checking for file integrity before writing and truncating to the last valid entry if corruption is detected.
Below is a shell script to verify whether any file truncation logic is already present in the target file:
Once you’ve reviewed the results from these commands, please note that—if nothing relevant is found—it confirms that the truncation mechanism for handling corrupted files has not yet been implemented. Based on that outcome, please consider updating your code before proceeding to append new records.
Implement file truncation for corrupted files in mutable_id_tracker
- The current logic only ignores errors (
io::ErrorKind::UnexpectedEof
) by returningNone
without attempting to repair or truncate the file. - If a file is corrupted and the last entry is ignored, appending new records risks further corrupting file data.
- Action Required: Introduce file integrity checks prior to writing new records and, upon detection of corruption, truncate the file at the last valid record.
🏁 Script executed:
#!/bin/bash
# Check for any file truncation markers, e.g. usage of "truncate" or "set_len",
# in the mutable_id_tracker.rs file.
rg -nH "truncate" lib/segment/src/id_tracker/mutable_id_tracker.rs
rg -nH "set_len" lib/segment/src/id_tracker/mutable_id_tracker.rs
Length of output: 132
Action Required: Implement File Truncation for Corrupted Files
After verifying that there is no evidence of file truncation logic (e.g., the absence of keywords like truncate
or set_len
in lib/segment/src/id_tracker/mutable_id_tracker.rs
), it appears that the mechanism to repair corrupted files has not yet been implemented. When encountering a corruption (such as an UnexpectedEof
error), the code should check the file's integrity and truncate it to the last valid entry before appending new records. This is necessary to prevent further data corruption.
- File:
lib/segment/src/id_tracker/mutable_id_tracker.rs
- Location Concerned: Around line 399, where the error is handled.
- Suggested Improvement: Incorporate file integrity checking logic to detect corruption and truncate the file accordingly.
Please update the code to address this issue.
Merging now. The review comments are handled in a follow up PR. |
* Extract common points mappings check into function * Fully rework load/store for point mapping changes, use byte format * Add extra serialize/deserialize test for new format * Fuse iterator * Add another test case * Take mutable reader when reading point mapping entry * Flush versions in the same way we flush mappings now * Don't explicitly fuse reader, it's not necessary
Tracked in: #6157
Depends on: #6158
Implements a custom byte-based storage format for persisting mutable ID tracker mappings. It also improves how we handle reading/writing/loading/storing these mappings in general.
Note that this does not really touch point version storage yet. That'll be done in the following PR.
The following comment in code describes the storage format:
The implementation is based on what we have in the immutable ID tracker, though it has been extended a bit to also support deletions.
It is resilient against partially written files due to a crash during flush. The reader ignores the very last entry if it's incomplete, so that we can still recover with the WAL. If any other entries are corrupted an error is returned.
This includes tests to ensure serializing and deserialization is sound, even in case of corrupted data.
Tasks
dev
All Submissions:
dev
branch. Did you create your branch fromdev
?New Feature Submissions:
cargo +nightly fmt --all
command prior to submission?cargo clippy --all --all-features
command?Changes to Core Features: