Mutable ID tracker: implement byte storage format for mappings #6166

timvisee · 2025-03-13T16:20:51Z

Tracked in: #6157
Depends on: #6158

Implements a custom byte-based storage format for persisting mutable ID tracker mappings. It also improves how we handle reading/writing/loading/storing these mappings in general.

Note that this does not really touch point version storage yet. That'll be done in the following PR.

The following comment in code describes the storage format:

/// Each change entry has a variable size. We first write a 1-byte header to define the change
/// type. The change type implies how long the entry is.
///
/// Insertion changes are serialized as follows:
///
/// +-----------------------+-----------------------+------------------+
/// | MappingChangeType: u8 | Number/UUID: u64/u128 | Internal ID: u32 |
/// +-----------------------+-----------------------+------------------+
///
/// Deletion changes are serialized as follows:
///
/// +-----------------------+-----------------------+
/// | MappingChangeType: u8 | Number/UUID: u64/u128 |
/// +-----------------------+-----------------------+

The implementation is based on what we have in the immutable ID tracker, though it has been extended a bit to also support deletions.

It is resilient against partially written files due to a crash during flush. The reader ignores the very last entry if it's incomplete, so that we can still recover with the WAL. If any other entries are corrupted an error is returned.

This includes tests to ensure serializing and deserialization is sound, even in case of corrupted data.

Tasks

Merge Add mutable ID tracker tests #6158
Rebase on dev
Undraft

All Submissions:

Contributions should target the dev branch. Did you create your branch from dev?
Have you followed the guidelines in our Contributing document?
Have you checked to ensure there aren't other open Pull Requests for the same update/change?

New Feature Submissions:

Does your submission pass tests?
Have you formatted your code locally using cargo +nightly fmt --all command prior to submission?
Have you checked your code using cargo clippy --all --all-features command?

Changes to Core Features:

Have you added an explanation of what your changes do and why you'd like us to include them?
Have you written new tests for your core changes, as applicable?
Have you successfully ran tests with your changes locally?

timvisee · 2025-03-13T16:27:59Z

lib/segment/src/id_tracker/mutable_id_tracker.rs

-    internal_to_external: &mut Vec<PointIdType>,
-    external_to_internal_num: &mut BTreeMap<u64, PointOffsetType>,
-    external_to_internal_uuid: &mut BTreeMap<Uuid, PointOffsetType>,
+/// Store new mapping changes, appending them to the given file


The meat for all store/load logic begins here.

It's all separated in functions like:

store (append) mappings to file

write mappings to writer

write single mapping to writer

load mappings from file

read mappings from reader

read single mapping from reader

timvisee · 2025-03-13T16:28:36Z

lib/segment/src/id_tracker/mutable_id_tracker.rs

+            writer.write_u128::<FileEndianess>(external_id.to_u128_le())?;
+        }
+    }
+
    Ok(())
 }

 fn load_versions(


Storage for versions starts here and is still using the same format. This will be taken care of in the next PR.

ffuugoo

LGTM, though, why not use serde_cbor for serialization instead of hand-rolled format? Would be a bit less efficient, but easier to maintain, and we already use CBOR in number of places. 🤔

timvisee · 2025-03-14T11:44:06Z

LGTM, though, why not use serde_cbor for serialization instead of hand-rolled format? Would be a bit less efficient, but easier to maintain, and we already use CBOR in number of places. 🤔

I didn't look at CBOR specifically, but did look at bincode.

But when I look at CBOR now, I can list a few arguments on why I'd still prefer the current approach:

CBOR does not support a native u128 to store a UUID
with CBOR, we'd need to know the length of each item to read which would require a length header prefix
my previous naive implementation with JSON did split items by newline, but we cannot do the same with CBOR
this custom implementation is still relatively simple (imo)
this custom implementation is smaller

generall · 2025-03-17T09:23:39Z

lib/segment/src/id_tracker/mutable_id_tracker.rs

+    std::iter::from_fn(move || match read_entry(&mut reader) {
+        Ok(entry) => Some(Ok(entry)),
+        // Done reading if end of file is reached
+        Err(err) if err.kind() == io::ErrorKind::UnexpectedEof => None,


What would happen if we have a corrupted file, in which we ignored last entry error, in case we continue to write new records into it?

Thanks! That would indeed be a problem. I had it on my task list, but forgot it in this PR.

I'll try to add some logic to truncate it.

I'll handle this in a separate PR, as it isn't trivial to do and I don't want to make this PR more complicated. I've added it in the tracking issue.

generall · 2025-03-17T09:36:52Z

Not exactly related to this PR, but

const FILE_MAPPINGS: &str = "id_tracker.mappings";
const FILE_VERSIONS: &str = "id_tracker.versions";

those file names already used in immutable ID tracker. Let's change them to some other unique names to avoid potential misread.

coderabbitai

Actionable comments posted: 1

🔭 Outside diff range comments (1)

lib/segment/src/id_tracker/mutable_id_tracker.rs (1)

21-22: 🛠️ Refactor suggestion

Consider renaming files to avoid confusion with immutable ID tracker

As pointed out by generall in the PR comments, the file names id_tracker.mappings and id_tracker.versions are already in use by the immutable ID tracker. This could lead to confusion and potential issues.

Consider using more specific file names like mutable_id_tracker.mappings and mutable_id_tracker.versions to clearly distinguish between the two implementations.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 92bf3e7 and 12663ac.

📒 Files selected for processing (1)

lib/segment/src/id_tracker/mutable_id_tracker.rs (11 hunks)

⏰ Context from checks skipped due to timeout of 90000ms (13)

GitHub Check: test-snapshot-operations-s3-minio
GitHub Check: test-shard-snapshot-api-s3-minio
GitHub Check: test-low-resources
GitHub Check: test-consistency
GitHub Check: test-consensus-compose
GitHub Check: test (macos-latest)
GitHub Check: test-consensus
GitHub Check: test (windows-latest)
GitHub Check: Basic TLS/HTTPS tests
GitHub Check: test (ubuntu-latest)
GitHub Check: test
GitHub Check: test
GitHub Check: test

🔇 Additional comments (9)

lib/segment/src/id_tracker/mutable_id_tracker.rs (9)

32-41: Well-designed implementation of change type classification

The addition of the change_type method to the MappingChange enum provides a clean way to classify different mapping changes, improving code organization.

43-62: Good enum design with byte representation

The MappingChangeType enum with #[repr(u8)] is well-implemented. The from_byte method provides a robust way to convert from raw bytes to the enum type with proper error handling for invalid values.

113-127: Improved error handling and code organization

The refactored open method now uses dedicated loading functions, making the code more readable and maintainable while preserving proper error handling.

330-353: Well-structured persistence implementation with proper fsync

The implementation correctly:

Opens the file in append mode

Writes changes using a buffered writer

Explicitly calls sync_all() to ensure durability

This ensures data integrity even in case of system crashes.

355-375: Clear documentation and implementation of write operations

The function is well-documented explaining the file format, and the implementation properly handles the serialization of mapping changes.

384-405: Elegant implementation of error handling in iterator

The read_mappings_iter function uses take_while_inclusive to stop iteration after the first error, which is a clean approach to handling errors in an iterator-based API.

497-532: Comprehensive entry deserialization with robust error handling

The implementation correctly handles different types of mapping changes and has proper error handling for corrupted data.

534-575: Well-documented serialization format

The documentation clearly explains the serialization format with a detailed comment showing the byte layout, which is excellent for future maintainers.

906-978: Excellent test coverage for serialization edge cases

The tests comprehensively cover various scenarios including:

Normal serialization/deserialization

Empty input

Invalid type bytes

Malformed data

Incomplete entries

Multiple entries

This ensures the robustness of the storage format implementation.

coderabbitai · 2025-03-17T10:11:22Z

lib/segment/src/id_tracker/mutable_id_tracker.rs

+    std::iter::from_fn(move || match read_entry(&mut reader) {
+        Ok(entry) => Some(Ok(entry)),
+        // Done reading if end of file is reached
+        Err(err) if err.kind() == io::ErrorKind::UnexpectedEof => None,


💡 Verification agent

❓ Verification inconclusive

Consider implementing file truncation for corrupted files

Per the comment from generall and your response, there's a need to handle corrupted files when writing new records. If a file is corrupted and the last entry was ignored, writing new records could lead to further corruption.

I don't see the truncation logic implemented yet - this should be addressed as you mentioned in your response.

Could you implement a mechanism to truncate corrupted files before appending new records? This would involve checking for file integrity before writing and truncating to the last valid entry if corruption is detected.

Below is a shell script to verify whether any file truncation logic is already present in the target file:

Once you’ve reviewed the results from these commands, please note that—if nothing relevant is found—it confirms that the truncation mechanism for handling corrupted files has not yet been implemented. Based on that outcome, please consider updating your code before proceeding to append new records.

Implement file truncation for corrupted files in mutable_id_tracker

The current logic only ignores errors (io::ErrorKind::UnexpectedEof) by returning None without attempting to repair or truncate the file.

If a file is corrupted and the last entry is ignored, appending new records risks further corrupting file data.

Action Required: Introduce file integrity checks prior to writing new records and, upon detection of corruption, truncate the file at the last valid record.

🏁 Script executed:

#!/bin/bash # Check for any file truncation markers, e.g. usage of "truncate" or "set_len", # in the mutable_id_tracker.rs file. rg -nH "truncate" lib/segment/src/id_tracker/mutable_id_tracker.rs rg -nH "set_len" lib/segment/src/id_tracker/mutable_id_tracker.rs

Length of output: 132

Action Required: Implement File Truncation for Corrupted Files

After verifying that there is no evidence of file truncation logic (e.g., the absence of keywords like truncate or set_len in lib/segment/src/id_tracker/mutable_id_tracker.rs), it appears that the mechanism to repair corrupted files has not yet been implemented. When encountering a corruption (such as an UnexpectedEof error), the code should check the file's integrity and truncate it to the last valid entry before appending new records. This is necessary to prevent further data corruption.

File: lib/segment/src/id_tracker/mutable_id_tracker.rs

Location Concerned: Around line 399, where the error is handled.

Suggested Improvement: Incorporate file integrity checking logic to detect corruption and truncate the file accordingly.

Please update the code to address this issue.

timvisee · 2025-03-17T13:25:46Z

Merging now. The review comments are handled in a follow up PR.

* Extract common points mappings check into function * Fully rework load/store for point mapping changes, use byte format * Add extra serialize/deserialize test for new format * Fuse iterator * Add another test case * Take mutable reader when reading point mapping entry * Flush versions in the same way we flush mappings now * Don't explicitly fuse reader, it's not necessary

timvisee mentioned this pull request Mar 13, 2025

Tracking issue: mutable ID tracker without RocksDB #6157

Closed

18 tasks

timvisee requested review from agourlay, generall, ffuugoo and JojiiOfficial March 13, 2025 16:21

timvisee commented Mar 13, 2025

View reviewed changes

JojiiOfficial approved these changes Mar 13, 2025

View reviewed changes

Base automatically changed from mutable-id-tracker-tests to dev March 14, 2025 08:52

timvisee added 7 commits March 14, 2025 09:53

Extract common points mappings check into function

c9228a7

Fully rework load/store for point mapping changes, use byte format

172a671

Add extra serialize/deserialize test for new format

9199282

Fuse iterator

786e88a

Add another test case

65e1f59

Take mutable reader when reading point mapping entry

0c997a9

Flush versions in the same way we flush mappings now

92bf3e7

timvisee force-pushed the mutable-id-tracker-byte-storage branch from a5f067f to 92bf3e7 Compare March 14, 2025 08:53

This comment was marked as resolved.

Sign in to view

ffuugoo approved these changes Mar 14, 2025

View reviewed changes

timvisee mentioned this pull request Mar 14, 2025

Mutable ID tracker: implement storage for point versions #6169

Merged

9 tasks

timvisee changed the title ~~Implement byte storage format for mutable ID tracker mappings~~ Mutable ID tracker: implement byte storage format for mappings Mar 14, 2025

generall reviewed Mar 17, 2025

View reviewed changes

Don't explicitly fuse reader, it's not necessary

12663ac

coderabbitai bot reviewed Mar 17, 2025

View reviewed changes

github-actions bot mentioned this pull request Mar 17, 2025

Flaky test multivector_filtrable_hnsw_test::test_multi_filterable_hnsw::case_5_recommend_eq #5902

Closed

timvisee merged commit e86f364 into dev Mar 17, 2025
17 checks passed

timvisee deleted the mutable-id-tracker-byte-storage branch March 17, 2025 13:26

This was referenced Mar 18, 2025

Mutable ID tracker: truncate mappings file if last entry is partial #6173

Merged

Mutable ID tracker integration #6174

Merged

Mutable ID tracker: show warning if mapping and versions count mismatch #6207

Merged

coderabbitai bot mentioned this pull request Mar 28, 2025

Mutable ID tracker: improve version flushing, grow file in one shot #6267

Merged

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Mutable ID tracker: implement byte storage format for mappings #6166

Mutable ID tracker: implement byte storage format for mappings #6166

Uh oh!

timvisee commented Mar 13, 2025 •

edited

Loading

Uh oh!

timvisee Mar 13, 2025

Uh oh!

timvisee Mar 13, 2025

Uh oh!

This comment was marked as resolved.

This comment was marked as resolved.

Uh oh!

ffuugoo left a comment

Uh oh!

timvisee commented Mar 14, 2025 •

edited

Loading

Uh oh!

generall Mar 17, 2025

Uh oh!

timvisee Mar 17, 2025

Uh oh!

timvisee Mar 17, 2025

Uh oh!

generall commented Mar 17, 2025

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Mar 17, 2025

Uh oh!

timvisee commented Mar 17, 2025

Uh oh!

Uh oh!

Uh oh!

Mutable ID tracker: implement byte storage format for mappings #6166

Mutable ID tracker: implement byte storage format for mappings #6166

Uh oh!

Conversation

timvisee commented Mar 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Tasks

All Submissions:

New Feature Submissions:

Changes to Core Features:

Uh oh!

timvisee Mar 13, 2025

Choose a reason for hiding this comment

Uh oh!

timvisee Mar 13, 2025

Choose a reason for hiding this comment

Uh oh!

This comment was marked as resolved.

This comment was marked as resolved.

Uh oh!

ffuugoo left a comment

Choose a reason for hiding this comment

Uh oh!

timvisee commented Mar 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

generall Mar 17, 2025

Choose a reason for hiding this comment

Uh oh!

timvisee Mar 17, 2025

Choose a reason for hiding this comment

Uh oh!

timvisee Mar 17, 2025

Choose a reason for hiding this comment

Uh oh!

generall commented Mar 17, 2025

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 17, 2025

Choose a reason for hiding this comment

Uh oh!

timvisee commented Mar 17, 2025

Uh oh!

Uh oh!

Uh oh!

timvisee commented Mar 13, 2025 •

edited

Loading

timvisee commented Mar 14, 2025 •

edited

Loading