-
Notifications
You must be signed in to change notification settings - Fork 1.8k
Mutable in-memory ID tracker without RocksDB #6150
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
370af0d
to
055e9b5
Compare
// Take out pending mappings to flush and replace it with a preallocated vector to avoid | ||
// frequent reallocation on a busy segment | ||
let pending_mappings = { | ||
let mut pending_mappings = self.pending_mappings.lock(); | ||
let count = pending_mappings.len(); | ||
mem::replace(&mut *pending_mappings, Vec::with_capacity(count)) | ||
}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This tries to be intelligent about memory allocation.
When we flush we immediately reallocate the buffer to have at least the capacity for the current number of points.
It will save us a bunch of (expensive) reallocations on a hot ID tracker receiving a lot of upsertions.
If there's no point ID mapping changes for some time this will eventually pre allocate nothing, which is identical to not having this optimization.
This comment was marked as resolved.
This comment was marked as resolved.
* Add initial mutable ID tracker * Correctly handle duplicate point mappings and deleted flags * Improve error handling in flush * Preallocate capacity for pending mappings/versions more intelligently * Warn or error about missing ID tracker files * Reformat * Don't crash if just the last mapping/version entry is corrupt * Move mapping and point parsing into separate functions * Extract loading logic into separate functions * Do not allow partially corrupted ID tracker files for now * Remove TODOs * Minor improvements * Fsync mappings and versions file after writing to it * Return error when fsync fails
Tracked in: #6157
Add initial implementation of mutable ID tracker. It's in-memory and persisted on disk. The key selling point is that it does not rely on RocksDB, but on simple files.
The idea is simple: the ID tracker holds a list of mappings and a list of point versions. All changes are simply appended to a file on disk. When loading from disk we scroll through the whole file and deduplicate in memory so that only the last mappings are kept.
Obviously, this structure can grow forever if we're not careful. That's why it relies on Qdrant's optimizers. Once the ID tracker collects too many changes, the optimizer will pick it up and create a new ID tracker. The new ID tracker will start from scratch, dropping all the garbage we had collected along the way.
The new ID tracker is ported from our simple ID tracker. What changed in this type is the backing storage - now using simple files.
I'm implementing this one step at a time. Tests are in the next PR, and there's more to come. Please see the tracking issue for more information.
All Submissions:
dev
branch. Did you create your branch fromdev
?New Feature Submissions:
cargo +nightly fmt --all
command prior to submission?cargo clippy --all --all-features
command?Changes to Core Features:
Have you written new tests for your core changes, as applicable?