-
Notifications
You must be signed in to change notification settings - Fork 1.8k
BM25 in the Core #6891
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BM25 in the Core #6891
Conversation
📝 WalkthroughWalkthroughThis set of changes introduces a comprehensive refactor and extension of the text tokenization and inference logic in the codebase. The main updates include replacing the legacy Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Suggested reviewers
Note ⚡️ Unit Test Generation is now available in beta!Learn more here, or try it out under "Finishing Touches" below. 📜 Recent review detailsConfiguration used: CodeRabbit UI 📒 Files selected for processing (1)
🚧 Files skipped from review as they are similar to previous changes (1)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (12)
✨ Finishing Touches
🧪 Generate unit tests
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
config: TokenizerConfig, | ||
} | ||
|
||
impl Tokenizer { | ||
pub fn new(params: &TextIndexParams) -> Self { | ||
pub fn new_from_text_index_params(params: &TextIndexParams) -> Self { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might be worth considering this a as a From<&TextIndexParams> for Tokenizer
implementation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I decided to not implement From<> here, because it adds some indirection and unclarity on how to instantiate new Tokenizers.
* Refactor token processing in TokenizerConfig * fix max length checking
* Add Bm25 * Execute BM25 if config available * cargo format * Add mocked tests for inference and bm25 * Properly apply InferenceType * Fix tests * Review remarks * Review remarks * Add overwritten model name fix again * use enum instead of multiple constants * ensure handling all fields of InferenceInput in infer_local --------- Co-authored-by: Luis Cossío <luis.cossio@outlook.com>
c9f3626
to
e55d493
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (4)
src/common/inference/service.rs (4)
118-162
: Well-structured hybrid inference implementation.The partitioning logic cleanly separates local and remote inputs while maintaining position tracking for correct result ordering. The early return optimization for local-only scenarios is a nice touch.
Consider memory optimization for large input sets:
- let ( - (local_inference_inputs, local_inference_positions), - (remote_inference_inputs, remote_inference_positions), - ): ((Vec<_>, Vec<_>), (Vec<_>, Vec<_>)) = inference_inputs - .into_iter() - // Keep track of the input's positions so we can properly merge them together later. - .enumerate() - .partition_map(|(pos, input)| { - // Check if input is targeting a local model or the configured remote server. - if local_model::is_local_model(&input.model) { - Either::Left((input, pos)) - } else { - Either::Right((input, pos)) - } - }); + let mut local_inference_inputs = Vec::new(); + let mut local_inference_positions = Vec::new(); + let mut remote_inference_inputs = Vec::new(); + let mut remote_inference_positions = Vec::new(); + + for (pos, input) in inference_inputs.into_iter().enumerate() { + if local_model::is_local_model(&input.model) { + local_inference_inputs.push(input); + local_inference_positions.push(pos); + } else { + remote_inference_inputs.push(input); + remote_inference_positions.push(pos); + } + }This reduces temporary allocations for large input sets.
216-242
: Solid merging logic with good optimizations.The early return for remote-only results and delegation to a helper function are well-designed. Only preserving remote usage makes sense since local inference is free.
Improve the error message for better debugging:
- .expect( - "Expected local results and remote items being contiguous. This is an internal bug!", - ); + .expect( + "Failed to merge local and remote results: position sequences must be contiguous and cover all input indices. This indicates a bug in the partitioning logic.", + );
306-328
: Clever merging algorithm with good validation.The position-based merging with contiguity checking is well-implemented. The generic design makes it reusable.
Consider adding validation to prevent position overlaps between left and right iterables:
fn merge_position_items<I>( left: impl IntoIterator<Item = I>, left_pos: Vec<usize>, right: impl IntoIterator<Item = I>, right_pos: Vec<usize>, ) -> Option<Vec<I>> { + // Validate that left and right positions don't overlap + let left_set: std::collections::HashSet<_> = left_pos.iter().collect(); + let right_set: std::collections::HashSet<_> = right_pos.iter().collect(); + if !left_set.is_disjoint(&right_set) { + return None; + } + let left_iter = left.into_iter().zip(left_pos); let right_iter = right.into_iter().zip(right_pos);This would catch bugs where the same position appears in both sets.
330-516
: Excellent comprehensive test coverage.The test suite covers all the key scenarios - merge logic, local-only, remote-only, and mixed inference. The use of randomization and mock HTTP servers provides robust validation.
Consider making the constant more descriptive:
-const BM25_LOCAL_MODEL_NAME: &str = "bm25"; +const BM25_LOCAL_MODEL_NAME: &str = "bm25"; // Local BM25 model identifier for testingOr move it to a more central location if used across multiple test files.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
⛔ Files ignored due to path filters (1)
Cargo.lock
is excluded by!**/*.lock
📒 Files selected for processing (13)
Cargo.toml
(2 hunks)lib/collection/src/operations/point_ops.rs
(2 hunks)lib/segment/src/index/field_index/full_text_index/immutable_text_index.rs
(1 hunks)lib/segment/src/index/field_index/full_text_index/mmap_text_index.rs
(2 hunks)lib/segment/src/index/field_index/full_text_index/mod.rs
(1 hunks)lib/segment/src/index/field_index/full_text_index/mutable_text_index.rs
(2 hunks)lib/segment/src/index/field_index/full_text_index/tokenizers/config.rs
(1 hunks)lib/segment/src/index/field_index/full_text_index/tokenizers/mod.rs
(22 hunks)src/common/inference/bm25.rs
(1 hunks)src/common/inference/inference_input.rs
(1 hunks)src/common/inference/local_model.rs
(1 hunks)src/common/inference/mod.rs
(1 hunks)src/common/inference/service.rs
(5 hunks)
✅ Files skipped from review due to trivial changes (1)
- Cargo.toml
🚧 Files skipped from review as they are similar to previous changes (10)
- lib/segment/src/index/field_index/full_text_index/mutable_text_index.rs
- lib/segment/src/index/field_index/full_text_index/mod.rs
- lib/segment/src/index/field_index/full_text_index/mmap_text_index.rs
- src/common/inference/mod.rs
- src/common/inference/local_model.rs
- lib/collection/src/operations/point_ops.rs
- lib/segment/src/index/field_index/full_text_index/immutable_text_index.rs
- src/common/inference/inference_input.rs
- src/common/inference/bm25.rs
- lib/segment/src/index/field_index/full_text_index/tokenizers/config.rs
🧰 Additional context used
🧠 Learnings (2)
📚 Learning: in test code for qdrant's query estimator (lib/segment/src/index/query_estimator.rs), simplified id ...
Learnt from: generall
PR: qdrant/qdrant#6854
File: lib/segment/src/index/query_estimator.rs:320-327
Timestamp: 2025-07-11T11:35:21.549Z
Learning: In test code for Qdrant's query estimator (lib/segment/src/index/query_estimator.rs), simplified ID resolution logic using `id.to_string().parse().unwrap()` is acceptable for testing purposes and doesn't need to match production code's `id_tracker.internal_id()` approach. Test code can use mock implementations that serve the testing goals.
Applied to files:
src/common/inference/service.rs
lib/segment/src/index/field_index/full_text_index/tokenizers/mod.rs
📚 Learning: in the full text inverted index implementations (mutableinvertedindex, immutableinvertedindex, mmapi...
Learnt from: generall
PR: qdrant/qdrant#6766
File: lib/segment/src/index/field_index/full_text_index/inverted_index/mutable_inverted_index.rs:274-282
Timestamp: 2025-06-28T19:25:53.448Z
Learning: In the full text inverted index implementations (MutableInvertedIndex, ImmutableInvertedIndex, MmapInvertedIndex), there's an important distinction between TokenId and PointOffsetType: TokenId represents a token/word identifier used to index into postings lists, while PointOffsetType represents a point/document identifier. The iter_point_ids method should take a TokenId and return an iterator over PointOffsetType values (points that contain that token), not the other way around.
Applied to files:
lib/segment/src/index/field_index/full_text_index/tokenizers/mod.rs
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (12)
- GitHub Check: lint
- GitHub Check: test-shard-snapshot-api-s3-minio
- GitHub Check: storage-compat-test
- GitHub Check: e2e-tests
- GitHub Check: test-consistency
- GitHub Check: test-consensus-compose
- GitHub Check: integration-tests
- GitHub Check: rust-tests-no-rocksdb (ubuntu-latest)
- GitHub Check: integration-tests-consensus
- GitHub Check: rust-tests (ubuntu-latest)
- GitHub Check: rust-tests (windows-latest)
- GitHub Check: rust-tests (macos-latest)
🔇 Additional comments (8)
src/common/inference/service.rs (2)
9-16
: LGTM! Clean import organization for hybrid inference.The new imports support the hybrid inference architecture well -
itertools
for partitioning and merging operations, and the centralizedInferenceInput
re-export improves modularity.
164-214
: Clean extraction of remote inference logic.The method maintains all existing behavior while improving code organization. Error handling and response processing are correctly preserved.
lib/segment/src/index/field_index/full_text_index/tokenizers/mod.rs (6)
1-11
: Good module organization improvements.The addition of a dedicated config module and public re-exports creates a cleaner API. The Arc import supports efficient shared ownership of the stopwords filter.
15-131
: Excellent centralization of token processing logic.Moving token filtering, lowercasing, and stemming into
config.process_token()
creates cleaner, more maintainable code. The special handling ofmin_token_len
intokenize_query
for prefix matching is correctly preserved.
153-192
: Well-designed constructor refactoring.The two-step construction pattern (config building + tokenizer creation) improves separation of concerns. Using Arc for shared ownership of the stopwords filter is efficient. The descriptive method name
new_from_text_index_params
clearly indicates its purpose.
194-219
: Perfect API preservation during refactoring.The public methods maintain their exact signatures while internally using the new config-based approach. This ensures no breaking changes for existing users.
221-712
: Comprehensive and correct test updates.All tests have been properly updated to use the new
TokenizerConfig
-based API. The explicit configuration construction ensures deterministic test behavior, and all assertions validate the same functionality as before.
133-151
: LGTM! Helper functions remain well-designed.The
truncate_cow
functions continue to serve their purpose efficiently, preserving theBorrowed
/Owned
state. The documentation clearly explains when to use each variant.
Renamed
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
♻️ Duplicate comments (1)
lib/segment/src/index/field_index/full_text_index/tokenizers/mod.rs (1)
160-160
: Constructor naming aligns with decoupling objectiveThe rename from
new
tonew_from_text_index_params
clearly indicates the construction path and supports the PR's goal of decoupling the tokenizer fromTextIndexParams
.
🧹 Nitpick comments (1)
lib/segment/src/index/field_index/full_text_index/tokenizers/tokens_processor.rs (1)
7-8
: Track TODO comments for future cleanupThe TODO comments indicate temporary design decisions due to rocksdb. Consider creating a tracking issue to ensure these are addressed when rocksdb is removed.
Would you like me to create a GitHub issue to track the removal of
Clone
and theArc
wrapper once rocksdb is removed?Also applies to: 11-11
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (5)
lib/segment/src/index/field_index/full_text_index/tokenizers/japanese.rs
(6 hunks)lib/segment/src/index/field_index/full_text_index/tokenizers/mod.rs
(26 hunks)lib/segment/src/index/field_index/full_text_index/tokenizers/multilingual.rs
(5 hunks)lib/segment/src/index/field_index/full_text_index/tokenizers/tokens_processor.rs
(1 hunks)src/common/inference/bm25.rs
(1 hunks)
✅ Files skipped from review due to trivial changes (1)
- lib/segment/src/index/field_index/full_text_index/tokenizers/japanese.rs
🚧 Files skipped from review as they are similar to previous changes (1)
- src/common/inference/bm25.rs
🧰 Additional context used
🧠 Learnings (2)
📚 Learning: in the full text inverted index implementations (mutableinvertedindex, immutableinvertedindex, mmapi...
Learnt from: generall
PR: qdrant/qdrant#6766
File: lib/segment/src/index/field_index/full_text_index/inverted_index/mutable_inverted_index.rs:274-282
Timestamp: 2025-06-28T19:25:53.448Z
Learning: In the full text inverted index implementations (MutableInvertedIndex, ImmutableInvertedIndex, MmapInvertedIndex), there's an important distinction between TokenId and PointOffsetType: TokenId represents a token/word identifier used to index into postings lists, while PointOffsetType represents a point/document identifier. The iter_point_ids method should take a TokenId and return an iterator over PointOffsetType values (points that contain that token), not the other way around.
Applied to files:
lib/segment/src/index/field_index/full_text_index/tokenizers/mod.rs
📚 Learning: in test code for qdrant's query estimator (lib/segment/src/index/query_estimator.rs), simplified id ...
Learnt from: generall
PR: qdrant/qdrant#6854
File: lib/segment/src/index/query_estimator.rs:320-327
Timestamp: 2025-07-11T11:35:21.549Z
Learning: In test code for Qdrant's query estimator (lib/segment/src/index/query_estimator.rs), simplified ID resolution logic using `id.to_string().parse().unwrap()` is acceptable for testing purposes and doesn't need to match production code's `id_tracker.internal_id()` approach. Test code can use mock implementations that serve the testing goals.
Applied to files:
lib/segment/src/index/field_index/full_text_index/tokenizers/mod.rs
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (12)
- GitHub Check: lint
- GitHub Check: e2e-tests
- GitHub Check: test-shard-snapshot-api-s3-minio
- GitHub Check: test-consistency
- GitHub Check: test-consensus-compose
- GitHub Check: rust-tests (macos-latest)
- GitHub Check: integration-tests
- GitHub Check: rust-tests-no-rocksdb (ubuntu-latest)
- GitHub Check: integration-tests-consensus
- GitHub Check: rust-tests (ubuntu-latest)
- GitHub Check: rust-tests (windows-latest)
- GitHub Check: storage-compat-test
🔇 Additional comments (5)
lib/segment/src/index/field_index/full_text_index/tokenizers/multilingual.rs (4)
6-6
: Inconsistency with PR objectivesThe PR description states the intention is to "utilize the existing
TokenizerConfig
" to decouple tokenizer from TextIndexParams. However, this change replacesTokenizerConfig
with a newTokensProcessor
abstraction, which contradicts the stated objective.Likely an incorrect or invalid review comment.
22-26
: Method signature update looks goodThe parameter rename from
config
totokens_processor
is more descriptive and accurately reflects the parameter's purpose.
51-61
: Excellent refactoring of token processing logicThe centralization of lowercasing, stemming, and stopword filtering into
tokens_processor.process_token_cow
improves maintainability and ensures consistent token processing across all tokenizers.
182-193
: Test updates correctly reflect the new APIThe test properly demonstrates the new
TokensProcessor
initialization with all required parameters for stemming functionality.lib/segment/src/index/field_index/full_text_index/tokenizers/mod.rs (1)
59-77
: Correct handling of max length check in prefix tokenizationThe use of
check_max_len=false
in line 63 is appropriate for prefix tokenization, as the n-gram generation logic needs to create tokens of various lengths up to the configured maximum.
tokens_processor.min_token_len = Some(3); | ||
tokens_processor.max_token_len = Some(4); | ||
PrefixTokenizer::tokenize(input, &tokens_processor, |i| out.push(i.to_string())); | ||
assert_eq!(out, vec!["int", "inte", "pro", "proc", "liv", "live"]); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Avoid direct field modification in tests
Instead of modifying TokensProcessor
fields directly, create a new instance with the desired configuration:
out.clear();
-tokens_processor.min_token_len = Some(3);
-tokens_processor.max_token_len = Some(4);
-PrefixTokenizer::tokenize(input, &tokens_processor, |i| out.push(i.to_string()));
+let tokens_processor_prefix = TokensProcessor::new(
+ true,
+ Default::default(),
+ Some(make_stemmer(SnowballLanguage::English)),
+ Some(3),
+ Some(4),
+);
+PrefixTokenizer::tokenize(input, &tokens_processor_prefix, |i| out.push(i.to_string()));
This maintains immutability and makes the test more explicit about its configuration.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
tokens_processor.min_token_len = Some(3); | |
tokens_processor.max_token_len = Some(4); | |
PrefixTokenizer::tokenize(input, &tokens_processor, |i| out.push(i.to_string())); | |
assert_eq!(out, vec!["int", "inte", "pro", "proc", "liv", "live"]); | |
out.clear(); | |
let tokens_processor_prefix = TokensProcessor::new( | |
true, | |
Default::default(), | |
Some(make_stemmer(SnowballLanguage::English)), | |
Some(3), | |
Some(4), | |
); | |
PrefixTokenizer::tokenize(input, &tokens_processor_prefix, |i| out.push(i.to_string())); | |
assert_eq!(out, vec!["int", "inte", "pro", "proc", "liv", "live"]); |
🤖 Prompt for AI Agents
In lib/segment/src/index/field_index/full_text_index/tokenizers/mod.rs around
lines 711 to 714, avoid directly modifying the fields of the TokensProcessor
instance in the test. Instead, create a new TokensProcessor instance initialized
with the desired min_token_len and max_token_len values to maintain immutability
and make the test configuration explicit.
lib/segment/src/index/field_index/full_text_index/tokenizers/tokens_processor.rs
Outdated
Show resolved
Hide resolved
* Uncouple tokenizer and TextIndexParams * Refactor token processing in `TokenizerConfig` (#6912) * Refactor token processing in TokenizerConfig * fix max length checking * [Bm25] Execution implementation (#6939) * Add Bm25 * Execute BM25 if config available * cargo format * Add mocked tests for inference and bm25 * Properly apply InferenceType * Fix tests * Review remarks * Review remarks * Add overwritten model name fix again * use enum instead of multiple constants * ensure handling all fields of InferenceInput in infer_local --------- Co-authored-by: Luis Cossío <luis.cossio@outlook.com> * review fixes * fmt * spell-check * deduplicate code --------- Co-authored-by: Luis Cossío <luis.cossio@qdrant.com> Co-authored-by: Luis Cossío <luis.cossio@outlook.com> Co-authored-by: Andrey Vasnetsov <andrey@vasnetsov.com>
This PR uncouples our
Tokenizer
fromTextIndexParams
using the already existingTokenizerConfig
.This is required to create tokenizers from a different context than text indices.
No logic changes introduced.