Skip to content

Conversation

timvisee
Copy link
Member

@timvisee timvisee commented Mar 11, 2025

Tracked in: #6157

Add initial implementation of mutable ID tracker. It's in-memory and persisted on disk. The key selling point is that it does not rely on RocksDB, but on simple files.

The idea is simple: the ID tracker holds a list of mappings and a list of point versions. All changes are simply appended to a file on disk. When loading from disk we scroll through the whole file and deduplicate in memory so that only the last mappings are kept.

Obviously, this structure can grow forever if we're not careful. That's why it relies on Qdrant's optimizers. Once the ID tracker collects too many changes, the optimizer will pick it up and create a new ID tracker. The new ID tracker will start from scratch, dropping all the garbage we had collected along the way.

The new ID tracker is ported from our simple ID tracker. What changed in this type is the backing storage - now using simple files.

I'm implementing this one step at a time. Tests are in the next PR, and there's more to come. Please see the tracking issue for more information.

All Submissions:

  • Contributions should target the dev branch. Did you create your branch from dev?
  • Have you followed the guidelines in our Contributing document?
  • Have you checked to ensure there aren't other open Pull Requests for the same update/change?

New Feature Submissions:

  1. Does your submission pass tests?
  2. Have you formatted your code locally using cargo +nightly fmt --all command prior to submission?
  3. Have you checked your code using cargo clippy --all --all-features command?

Changes to Core Features:

  • Have you added an explanation of what your changes do and why you'd like us to include them?
  • Have you written new tests for your core changes, as applicable?
  • Have you successfully ran tests with your changes locally?

Comment on lines +236 to +242
// Take out pending mappings to flush and replace it with a preallocated vector to avoid
// frequent reallocation on a busy segment
let pending_mappings = {
let mut pending_mappings = self.pending_mappings.lock();
let count = pending_mappings.len();
mem::replace(&mut *pending_mappings, Vec::with_capacity(count))
};
Copy link
Member Author

@timvisee timvisee Mar 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This tries to be intelligent about memory allocation.

When we flush we immediately reallocate the buffer to have at least the capacity for the current number of points.

It will save us a bunch of (expensive) reallocations on a hot ID tracker receiving a lot of upsertions.

If there's no point ID mapping changes for some time this will eventually pre allocate nothing, which is identical to not having this optimization.

@timvisee timvisee changed the title WIP: mutable in-memory ID tracker without RocksDB Mutable in-memory ID tracker without RocksDB Mar 12, 2025
@timvisee timvisee marked this pull request as ready for review March 12, 2025 13:01

This comment was marked as resolved.

coderabbitai[bot]

This comment was marked as resolved.

coderabbitai[bot]

This comment was marked as resolved.

coderabbitai[bot]

This comment was marked as resolved.

@timvisee timvisee merged commit 6a1b9de into dev Mar 13, 2025
17 checks passed
@timvisee timvisee deleted the mutable-id-tracker branch March 13, 2025 16:12
timvisee added a commit that referenced this pull request Mar 21, 2025
* Add initial mutable ID tracker

* Correctly handle duplicate point mappings and deleted flags

* Improve error handling in flush

* Preallocate capacity for pending mappings/versions more intelligently

* Warn or error about missing ID tracker files

* Reformat

* Don't crash if just the last mapping/version entry is corrupt

* Move mapping and point parsing into separate functions

* Extract loading logic into separate functions

* Do not allow partially corrupted ID tracker files for now

* Remove TODOs

* Minor improvements

* Fsync mappings and versions file after writing to it

* Return error when fsync fails
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants