BM25 in the Core #6891

JojiiOfficial · 2025-07-17T09:59:58Z

This PR uncouples our Tokenizer from TextIndexParams using the already existing TokenizerConfig.
This is required to create tokenizers from a different context than text indices.
No logic changes introduced.

coderabbitai · 2025-07-17T10:01:49Z

📝 Walkthrough

Walkthrough

This set of changes introduces a comprehensive refactor and extension of the text tokenization and inference logic in the codebase. The main updates include replacing the legacy TokenizerConfig with a new TokensProcessor abstraction, which centralizes token processing (lowercasing, stopword filtering, stemming, and token length constraints). All tokenizer implementations are updated to use this new processor. The tokenizer initialization is refactored to use Tokenizer::new_from_text_index_params. New modules are added for BM25 embedding and local inference, supporting configurable text preprocessing and embedding generation. The inference service is updated to partition and merge local and remote inference requests, with extensive new tests for this hybrid logic. Additional dependencies are introduced, and several modules are made public for broader access.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

qdrant/qdrant#6770: Implements stemming support and integrates stemming configuration into tokenizers, which is directly related to the tokenizer and TokensProcessor refactor in this PR.
qdrant/qdrant#6762: Introduces a new multilingual tokenizer and integrates stopword filtering and casing via TokensProcessor, overlapping with the tokenizer struct and method changes in this PR.

Suggested reviewers

coszio
timvisee

Note

⚡️ Unit Test Generation is now available in beta!

Learn more here, or try it out under "Finishing Touches" below.

📜 Recent review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 3e9746b and 1c8830d.

📒 Files selected for processing (1)

lib/segment/src/index/field_index/full_text_index/tokenizers/tokens_processor.rs (1 hunks)

🚧 Files skipped from review as they are similar to previous changes (1)

lib/segment/src/index/field_index/full_text_index/tokenizers/tokens_processor.rs

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (12)

GitHub Check: storage-compat-test
GitHub Check: test-consensus-compose
GitHub Check: test-shard-snapshot-api-s3-minio
GitHub Check: e2e-tests
GitHub Check: test-consistency
GitHub Check: rust-tests (ubuntu-latest)
GitHub Check: rust-tests (macos-latest)
GitHub Check: rust-tests (windows-latest)
GitHub Check: integration-tests-consensus
GitHub Check: rust-tests-no-rocksdb (ubuntu-latest)
GitHub Check: integration-tests
GitHub Check: lint

✨ Finishing Touches

📝 Generate Docstrings

🧪 Generate unit tests

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch uncouple_tokenizer_and_textindex_params

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Explain this complex logic.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
- @coderabbitai explain this code block.
- @coderabbitai modularize this function.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read src/utils.ts and explain its main purpose.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
- @coderabbitai help me debug CodeRabbit configuration file.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

@coderabbitai pause to pause the reviews on a PR.
@coderabbitai resume to resume the paused reviews.
@coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
@coderabbitai full review to do a full review from scratch and review all the files again.
@coderabbitai summary to regenerate the summary of the PR.
@coderabbitai generate docstrings to generate docstrings for this PR.
@coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
@coderabbitai generate unit tests to generate unit tests for this PR.
@coderabbitai resolve resolve all the CodeRabbit review comments.
@coderabbitai configuration to show the current CodeRabbit configuration for the repository.
@coderabbitai help to get help.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (`.coderabbit.yaml`)

You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
Please see the configuration documentation for more information.
If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

coszio · 2025-07-21T14:09:13Z

lib/segment/src/index/field_index/full_text_index/tokenizers/mod.rs

    config: TokenizerConfig,
 }

 impl Tokenizer {
-    pub fn new(params: &TextIndexParams) -> Self {
+    pub fn new_from_text_index_params(params: &TextIndexParams) -> Self {


Might be worth considering this a as a From<&TextIndexParams> for Tokenizer implementation

I decided to not implement From<> here, because it adds some indirection and unclarity on how to instantiate new Tokenizers.

* Refactor token processing in TokenizerConfig * fix max length checking

* Add Bm25 * Execute BM25 if config available * cargo format * Add mocked tests for inference and bm25 * Properly apply InferenceType * Fix tests * Review remarks * Review remarks * Add overwritten model name fix again * use enum instead of multiple constants * ensure handling all fields of InferenceInput in infer_local --------- Co-authored-by: Luis Cossío <luis.cossio@outlook.com>

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (4)

src/common/inference/service.rs (4)

118-162: Well-structured hybrid inference implementation.

The partitioning logic cleanly separates local and remote inputs while maintaining position tracking for correct result ordering. The early return optimization for local-only scenarios is a nice touch.

Consider memory optimization for large input sets:

-        let (
-            (local_inference_inputs, local_inference_positions),
-            (remote_inference_inputs, remote_inference_positions),
-        ): ((Vec<_>, Vec<_>), (Vec<_>, Vec<_>)) = inference_inputs
-            .into_iter()
-            // Keep track of the input's positions so we can properly merge them together later.
-            .enumerate()
-            .partition_map(|(pos, input)| {
-                // Check if input is targeting a local model or the configured remote server.
-                if local_model::is_local_model(&input.model) {
-                    Either::Left((input, pos))
-                } else {
-                    Either::Right((input, pos))
-                }
-            });
+        let mut local_inference_inputs = Vec::new();
+        let mut local_inference_positions = Vec::new();
+        let mut remote_inference_inputs = Vec::new();
+        let mut remote_inference_positions = Vec::new();
+        
+        for (pos, input) in inference_inputs.into_iter().enumerate() {
+            if local_model::is_local_model(&input.model) {
+                local_inference_inputs.push(input);
+                local_inference_positions.push(pos);
+            } else {
+                remote_inference_inputs.push(input);
+                remote_inference_positions.push(pos);
+            }
+        }

This reduces temporary allocations for large input sets.

216-242: Solid merging logic with good optimizations.

The early return for remote-only results and delegation to a helper function are well-designed. Only preserving remote usage makes sense since local inference is free.

Improve the error message for better debugging:

-        .expect(
-            "Expected local results and remote items being contiguous. This is an internal bug!",
-        );
+        .expect(
+            "Failed to merge local and remote results: position sequences must be contiguous and cover all input indices. This indicates a bug in the partitioning logic.",
+        );

306-328: Clever merging algorithm with good validation.

The position-based merging with contiguity checking is well-implemented. The generic design makes it reusable.

Consider adding validation to prevent position overlaps between left and right iterables:

 fn merge_position_items<I>(
     left: impl IntoIterator<Item = I>,
     left_pos: Vec<usize>,
     right: impl IntoIterator<Item = I>,
     right_pos: Vec<usize>,
 ) -> Option<Vec<I>> {
+    // Validate that left and right positions don't overlap
+    let left_set: std::collections::HashSet<_> = left_pos.iter().collect();
+    let right_set: std::collections::HashSet<_> = right_pos.iter().collect();
+    if !left_set.is_disjoint(&right_set) {
+        return None;
+    }
+    
     let left_iter = left.into_iter().zip(left_pos);
     let right_iter = right.into_iter().zip(right_pos);

This would catch bugs where the same position appears in both sets.

330-516: Excellent comprehensive test coverage.

The test suite covers all the key scenarios - merge logic, local-only, remote-only, and mixed inference. The use of randomization and mock HTTP servers provides robust validation.

Consider making the constant more descriptive:

-const BM25_LOCAL_MODEL_NAME: &str = "bm25";
+const BM25_LOCAL_MODEL_NAME: &str = "bm25"; // Local BM25 model identifier for testing

Or move it to a more central location if used across multiple test files.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between c9f3626 and e55d493.

⛔ Files ignored due to path filters (1)

Cargo.lock is excluded by !**/*.lock

📒 Files selected for processing (13)

Cargo.toml (2 hunks)
lib/collection/src/operations/point_ops.rs (2 hunks)
lib/segment/src/index/field_index/full_text_index/immutable_text_index.rs (1 hunks)
lib/segment/src/index/field_index/full_text_index/mmap_text_index.rs (2 hunks)
lib/segment/src/index/field_index/full_text_index/mod.rs (1 hunks)
lib/segment/src/index/field_index/full_text_index/mutable_text_index.rs (2 hunks)
lib/segment/src/index/field_index/full_text_index/tokenizers/config.rs (1 hunks)
lib/segment/src/index/field_index/full_text_index/tokenizers/mod.rs (22 hunks)
src/common/inference/bm25.rs (1 hunks)
src/common/inference/inference_input.rs (1 hunks)
src/common/inference/local_model.rs (1 hunks)
src/common/inference/mod.rs (1 hunks)
src/common/inference/service.rs (5 hunks)

✅ Files skipped from review due to trivial changes (1)

Cargo.toml

🚧 Files skipped from review as they are similar to previous changes (10)

lib/segment/src/index/field_index/full_text_index/mutable_text_index.rs
lib/segment/src/index/field_index/full_text_index/mod.rs
lib/segment/src/index/field_index/full_text_index/mmap_text_index.rs
src/common/inference/mod.rs
src/common/inference/local_model.rs
lib/collection/src/operations/point_ops.rs
lib/segment/src/index/field_index/full_text_index/immutable_text_index.rs
src/common/inference/inference_input.rs
src/common/inference/bm25.rs
lib/segment/src/index/field_index/full_text_index/tokenizers/config.rs

🧰 Additional context used

🧠 Learnings (2)

📚 Learning: in test code for qdrant's query estimator (lib/segment/src/index/query_estimator.rs), simplified id ...

Learnt from: generall
PR: qdrant/qdrant#6854
File: lib/segment/src/index/query_estimator.rs:320-327
Timestamp: 2025-07-11T11:35:21.549Z
Learning: In test code for Qdrant's query estimator (lib/segment/src/index/query_estimator.rs), simplified ID resolution logic using `id.to_string().parse().unwrap()` is acceptable for testing purposes and doesn't need to match production code's `id_tracker.internal_id()` approach. Test code can use mock implementations that serve the testing goals.

Applied to files:

src/common/inference/service.rs
lib/segment/src/index/field_index/full_text_index/tokenizers/mod.rs

📚 Learning: in the full text inverted index implementations (mutableinvertedindex, immutableinvertedindex, mmapi...

Learnt from: generall
PR: qdrant/qdrant#6766
File: lib/segment/src/index/field_index/full_text_index/inverted_index/mutable_inverted_index.rs:274-282
Timestamp: 2025-06-28T19:25:53.448Z
Learning: In the full text inverted index implementations (MutableInvertedIndex, ImmutableInvertedIndex, MmapInvertedIndex), there's an important distinction between TokenId and PointOffsetType: TokenId represents a token/word identifier used to index into postings lists, while PointOffsetType represents a point/document identifier. The iter_point_ids method should take a TokenId and return an iterator over PointOffsetType values (points that contain that token), not the other way around.

Applied to files:

lib/segment/src/index/field_index/full_text_index/tokenizers/mod.rs

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (12)

GitHub Check: lint
GitHub Check: test-shard-snapshot-api-s3-minio
GitHub Check: storage-compat-test
GitHub Check: e2e-tests
GitHub Check: test-consistency
GitHub Check: test-consensus-compose
GitHub Check: integration-tests
GitHub Check: rust-tests-no-rocksdb (ubuntu-latest)
GitHub Check: integration-tests-consensus
GitHub Check: rust-tests (ubuntu-latest)
GitHub Check: rust-tests (windows-latest)
GitHub Check: rust-tests (macos-latest)

🔇 Additional comments (8)

src/common/inference/service.rs (2)

9-16: LGTM! Clean import organization for hybrid inference.

The new imports support the hybrid inference architecture well - itertools for partitioning and merging operations, and the centralized InferenceInput re-export improves modularity.

164-214: Clean extraction of remote inference logic.

The method maintains all existing behavior while improving code organization. Error handling and response processing are correctly preserved.

lib/segment/src/index/field_index/full_text_index/tokenizers/mod.rs (6)

1-11: Good module organization improvements.

The addition of a dedicated config module and public re-exports creates a cleaner API. The Arc import supports efficient shared ownership of the stopwords filter.

15-131: Excellent centralization of token processing logic.

Moving token filtering, lowercasing, and stemming into config.process_token() creates cleaner, more maintainable code. The special handling of min_token_len in tokenize_query for prefix matching is correctly preserved.

153-192: Well-designed constructor refactoring.

The two-step construction pattern (config building + tokenizer creation) improves separation of concerns. Using Arc for shared ownership of the stopwords filter is efficient. The descriptive method name new_from_text_index_params clearly indicates its purpose.

194-219: Perfect API preservation during refactoring.

The public methods maintain their exact signatures while internally using the new config-based approach. This ensures no breaking changes for existing users.

221-712: Comprehensive and correct test updates.

All tests have been properly updated to use the new TokenizerConfig-based API. The explicit configuration construction ensures deterministic test behavior, and all assertions validate the same functionality as before.

133-151: LGTM! Helper functions remain well-designed.

The truncate_cow functions continue to serve their purpose efficiently, preserving the Borrowed/Owned state. The documentation clearly explains when to use each variant.

generall · 2025-08-03T19:17:56Z

Renamed TokenizerConfig -> TokenProcessor for the following reasons:

Structure contained constructured stemmer and resolved stopword list, this is not how user "configure" it.
Config structure should not contain process_token method

coderabbitai

Actionable comments posted: 2

♻️ Duplicate comments (1)

lib/segment/src/index/field_index/full_text_index/tokenizers/mod.rs (1)

160-160: Constructor naming aligns with decoupling objective

The rename from new to new_from_text_index_params clearly indicates the construction path and supports the PR's goal of decoupling the tokenizer from TextIndexParams.

🧹 Nitpick comments (1)

lib/segment/src/index/field_index/full_text_index/tokenizers/tokens_processor.rs (1)

7-8: Track TODO comments for future cleanup

The TODO comments indicate temporary design decisions due to rocksdb. Consider creating a tracking issue to ensure these are addressed when rocksdb is removed.

Would you like me to create a GitHub issue to track the removal of Clone and the Arc wrapper once rocksdb is removed?

Also applies to: 11-11

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between e55d493 and 3e9746b.

📒 Files selected for processing (5)

lib/segment/src/index/field_index/full_text_index/tokenizers/japanese.rs (6 hunks)
lib/segment/src/index/field_index/full_text_index/tokenizers/mod.rs (26 hunks)
lib/segment/src/index/field_index/full_text_index/tokenizers/multilingual.rs (5 hunks)
lib/segment/src/index/field_index/full_text_index/tokenizers/tokens_processor.rs (1 hunks)
src/common/inference/bm25.rs (1 hunks)

✅ Files skipped from review due to trivial changes (1)

lib/segment/src/index/field_index/full_text_index/tokenizers/japanese.rs

🚧 Files skipped from review as they are similar to previous changes (1)

src/common/inference/bm25.rs

🧰 Additional context used

🧠 Learnings (2)

📚 Learning: in the full text inverted index implementations (mutableinvertedindex, immutableinvertedindex, mmapi...

Learnt from: generall
PR: qdrant/qdrant#6766
File: lib/segment/src/index/field_index/full_text_index/inverted_index/mutable_inverted_index.rs:274-282
Timestamp: 2025-06-28T19:25:53.448Z
Learning: In the full text inverted index implementations (MutableInvertedIndex, ImmutableInvertedIndex, MmapInvertedIndex), there's an important distinction between TokenId and PointOffsetType: TokenId represents a token/word identifier used to index into postings lists, while PointOffsetType represents a point/document identifier. The iter_point_ids method should take a TokenId and return an iterator over PointOffsetType values (points that contain that token), not the other way around.

Applied to files:

lib/segment/src/index/field_index/full_text_index/tokenizers/mod.rs

📚 Learning: in test code for qdrant's query estimator (lib/segment/src/index/query_estimator.rs), simplified id ...

Learnt from: generall
PR: qdrant/qdrant#6854
File: lib/segment/src/index/query_estimator.rs:320-327
Timestamp: 2025-07-11T11:35:21.549Z
Learning: In test code for Qdrant's query estimator (lib/segment/src/index/query_estimator.rs), simplified ID resolution logic using `id.to_string().parse().unwrap()` is acceptable for testing purposes and doesn't need to match production code's `id_tracker.internal_id()` approach. Test code can use mock implementations that serve the testing goals.

Applied to files:

lib/segment/src/index/field_index/full_text_index/tokenizers/mod.rs

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (12)

GitHub Check: lint
GitHub Check: e2e-tests
GitHub Check: test-shard-snapshot-api-s3-minio
GitHub Check: test-consistency
GitHub Check: test-consensus-compose
GitHub Check: rust-tests (macos-latest)
GitHub Check: integration-tests
GitHub Check: rust-tests-no-rocksdb (ubuntu-latest)
GitHub Check: integration-tests-consensus
GitHub Check: rust-tests (ubuntu-latest)
GitHub Check: rust-tests (windows-latest)
GitHub Check: storage-compat-test

🔇 Additional comments (5)

lib/segment/src/index/field_index/full_text_index/tokenizers/multilingual.rs (4)

6-6: Inconsistency with PR objectives

The PR description states the intention is to "utilize the existing TokenizerConfig" to decouple tokenizer from TextIndexParams. However, this change replaces TokenizerConfig with a new TokensProcessor abstraction, which contradicts the stated objective.

Likely an incorrect or invalid review comment.

22-26: Method signature update looks good

The parameter rename from config to tokens_processor is more descriptive and accurately reflects the parameter's purpose.

51-61: Excellent refactoring of token processing logic

The centralization of lowercasing, stemming, and stopword filtering into tokens_processor.process_token_cow improves maintainability and ensures consistent token processing across all tokenizers.

182-193: Test updates correctly reflect the new API

The test properly demonstrates the new TokensProcessor initialization with all required parameters for stemming functionality.

lib/segment/src/index/field_index/full_text_index/tokenizers/mod.rs (1)

59-77: Correct handling of max length check in prefix tokenization

The use of check_max_len=false in line 63 is appropriate for prefix tokenization, as the n-gram generation logic needs to create tokens of various lengths up to the configured maximum.

coderabbitai · 2025-08-03T19:20:42Z

lib/segment/src/index/field_index/full_text_index/tokenizers/mod.rs

+        tokens_processor.min_token_len = Some(3);
+        tokens_processor.max_token_len = Some(4);
+        PrefixTokenizer::tokenize(input, &tokens_processor, |i| out.push(i.to_string()));
+        assert_eq!(out, vec!["int", "inte", "pro", "proc", "liv", "live"]);


🛠️ Refactor suggestion

Avoid direct field modification in tests

Instead of modifying TokensProcessor fields directly, create a new instance with the desired configuration:

out.clear(); -tokens_processor.min_token_len = Some(3); -tokens_processor.max_token_len = Some(4); -PrefixTokenizer::tokenize(input, &tokens_processor, |i| out.push(i.to_string())); +let tokens_processor_prefix = TokensProcessor::new( + true, + Default::default(), + Some(make_stemmer(SnowballLanguage::English)), + Some(3), + Some(4), +); +PrefixTokenizer::tokenize(input, &tokens_processor_prefix, |i| out.push(i.to_string()));

This maintains immutability and makes the test more explicit about its configuration.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

tokens_processor.min_token_len = Some(3);

tokens_processor.max_token_len = Some(4);

PrefixTokenizer::tokenize(input, &tokens_processor, |i| out.push(i.to_string()));

assert_eq!(out, vec!["int", "inte", "pro", "proc", "liv", "live"]);

out.clear();

let tokens_processor_prefix = TokensProcessor::new(

true,

Default::default(),

Some(make_stemmer(SnowballLanguage::English)),

Some(3),

Some(4),

);

PrefixTokenizer::tokenize(input, &tokens_processor_prefix, |i| out.push(i.to_string()));

assert_eq!(out, vec!["int", "inte", "pro", "proc", "liv", "live"]);

🤖 Prompt for AI Agents

In lib/segment/src/index/field_index/full_text_index/tokenizers/mod.rs around lines 711 to 714, avoid directly modifying the fields of the TokensProcessor instance in the test. Instead, create a new TokensProcessor instance initialized with the desired min_token_len and max_token_len values to maintain immutability and make the test configuration explicit.

lib/segment/src/index/field_index/full_text_index/tokenizers/tokens_processor.rs

* Uncouple tokenizer and TextIndexParams * Refactor token processing in `TokenizerConfig` (#6912) * Refactor token processing in TokenizerConfig * fix max length checking * [Bm25] Execution implementation (#6939) * Add Bm25 * Execute BM25 if config available * cargo format * Add mocked tests for inference and bm25 * Properly apply InferenceType * Fix tests * Review remarks * Review remarks * Add overwritten model name fix again * use enum instead of multiple constants * ensure handling all fields of InferenceInput in infer_local --------- Co-authored-by: Luis Cossío <luis.cossio@outlook.com> * review fixes * fmt * spell-check * deduplicate code --------- Co-authored-by: Luis Cossío <luis.cossio@qdrant.com> Co-authored-by: Luis Cossío <luis.cossio@outlook.com> Co-authored-by: Andrey Vasnetsov <andrey@vasnetsov.com>

JojiiOfficial requested review from generall and timvisee July 17, 2025 10:00

JojiiOfficial mentioned this pull request Jul 17, 2025

Add BM25 options to inference config #6893

Closed

coszio self-requested a review July 21, 2025 13:20

coszio approved these changes Jul 21, 2025

View reviewed changes

coszio mentioned this pull request Jul 21, 2025

Refactor token processing in TokenizerConfig #6912

Merged

coszio approved these changes Jul 25, 2025

View reviewed changes

JojiiOfficial mentioned this pull request Aug 1, 2025

[Bm25] Execution implementation #6939

Merged

JojiiOfficial and others added 3 commits August 3, 2025 10:50

Uncouple tokenizer and TextIndexParams

e902af5

Refactor token processing in TokenizerConfig (#6912)

2e239f6

* Refactor token processing in TokenizerConfig * fix max length checking

generall force-pushed the uncouple_tokenizer_and_textindex_params branch from c9f3626 to e55d493 Compare August 3, 2025 17:10

coderabbitai bot reviewed Aug 3, 2025

View reviewed changes

generall added 3 commits August 3, 2025 13:14

review fixes

872f61b

fmt

a9bc9b6

spell-check

3e9746b

coderabbitai bot reviewed Aug 3, 2025

View reviewed changes

deduplicate code

1c8830d

generall changed the title ~~Uncouple tokenizer and TextIndexParams~~ BM25 in the Core Aug 3, 2025

generall merged commit c2a9183 into dev Aug 4, 2025
16 checks passed

generall deleted the uncouple_tokenizer_and_textindex_params branch August 4, 2025 02:43

timvisee mentioned this pull request Aug 11, 2025

Bump version to 1.15.2 #7012

Merged

coderabbitai bot mentioned this pull request Aug 13, 2025

Fix bm25 params #7056

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

BM25 in the Core #6891

BM25 in the Core #6891

Uh oh!

JojiiOfficial commented Jul 17, 2025

Uh oh!

coderabbitai bot commented Jul 17, 2025 •

edited

Loading

Walkthrough

Estimated code review effort

Possibly related PRs

Suggested reviewers

Chat

Support

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (`.coderabbit.yaml`)

Documentation and Community

Uh oh!

coszio Jul 21, 2025

Uh oh!

JojiiOfficial Jul 23, 2025

Uh oh!

coderabbitai bot left a comment

Uh oh!

generall commented Aug 3, 2025

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Aug 3, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

-        tokens_processor.min_token_len = Some(3);
-        tokens_processor.max_token_len = Some(4);
-        PrefixTokenizer::tokenize(input, &tokens_processor, |i| out.push(i.to_string()));
-        assert_eq!(out, vec!["int", "inte", "pro", "proc", "liv", "live"]);
+        out.clear();
+        let tokens_processor_prefix = TokensProcessor::new(
+            true,
+            Default::default(),
+            Some(make_stemmer(SnowballLanguage::English)),
+            Some(3),
+            Some(4),
+        );
+        PrefixTokenizer::tokenize(input, &tokens_processor_prefix, |i| out.push(i.to_string()));
+        assert_eq!(out, vec!["int", "inte", "pro", "proc", "liv", "live"]);

BM25 in the Core #6891

BM25 in the Core #6891

Uh oh!

Conversation

JojiiOfficial commented Jul 17, 2025

Uh oh!

coderabbitai bot commented Jul 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Estimated code review effort

Possibly related PRs

Suggested reviewers

Chat

Support

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (.coderabbit.yaml)

Documentation and Community

Uh oh!

coszio Jul 21, 2025

Choose a reason for hiding this comment

Uh oh!

JojiiOfficial Jul 23, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

generall commented Aug 3, 2025

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Aug 3, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot commented Jul 17, 2025 •

edited

Loading

CodeRabbit Configuration File (`.coderabbit.yaml`)