Implement new multilingual tokenizer #6762

JojiiOfficial · 2025-06-26T09:14:50Z

Implements the new multilingual tokenizer. In detail, this PR changes the following:

Remove multilingual-* cargo features and enable multilingual by default
Minor refactor of tokenizer to use Cow<str> for less cloning and easier string manipulation
Integration of the new MultilingualV2 tokenizer (renamed to MultilingualTokenizer)
Less code duplication and other minor code improvements in the new MultilingualTokenizer
End to end tests for stopwords in Japanese and Chinese tokenization test.

Due to lifetimes issues, caused by the underlying tokenization libraries, not all tokenization functions can benefit from clone-less Cow usage. This is kept as open TODO and could be investigated in a future PR.

coderabbitai · 2025-06-26T09:18:36Z

📝 Walkthrough

"""

Walkthrough

This change refactors the text tokenization pipeline to consistently use Cow<str> (copy-on-write string slices) instead of String or &str for token representation, reducing unnecessary allocations and improving flexibility. It removes feature flags for Chinese, Japanese, and Korean multilingual support, integrating relevant features directly into dependencies. The multilingual tokenizer is renamed from MultilingualV2 to MultilingualTokenizer and restructured to support stopword filtering and optional lowercasing, with updated interfaces for all tokenizers. Direct usage of charabia::Tokenize is replaced by a custom iterator to fix lifetime issues. Public method signatures are updated to reflect the new token types, and associated tests are adjusted accordingly.

Possibly related PRs

Add new multilingual tokenizer #6678: Refactored and extended the multilingual tokenizer, which is directly modified and built upon in this PR.
Add Japanese stopwords #6683: Added Japanese stopwords and integrated them into the stopwords filter, which is now used throughout the tokenizer implementations in this PR.
Add stopwords support #6635: Introduced stopword filtering support deeply integrated into the full-text tokenizer pipeline, modifying tokenizer signatures and stopword handling code used here.

Suggested reviewers

JojiiOfficial
"""

📜 Recent review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 701357a and 606b840.

⛔ Files ignored due to path filters (1)

Cargo.lock is excluded by !**/*.lock

📒 Files selected for processing (2)

lib/segment/Cargo.toml (1 hunks)
lib/segment/src/index/field_index/full_text_index/tokenizers/mod.rs (18 hunks)

🚧 Files skipped from review as they are similar to previous changes (1)

lib/segment/Cargo.toml

⏰ Context from checks skipped due to timeout of 90000ms (14)

GitHub Check: lint
GitHub Check: rust-tests (windows-latest)
GitHub Check: rust-tests (macos-latest)
GitHub Check: rust-tests-no-rocksdb (ubuntu-latest)
GitHub Check: rust-tests (ubuntu-latest)
GitHub Check: test-consensus-compose
GitHub Check: Basic TLS/HTTPS tests
GitHub Check: test-shard-snapshot-api-s3-minio
GitHub Check: test-low-resources
GitHub Check: test-consistency
GitHub Check: test-snapshot-operations-s3-minio
GitHub Check: integration-tests-consensus
GitHub Check: storage-compat-test
GitHub Check: integration-tests

🔇 Additional comments (11)

lib/segment/src/index/field_index/full_text_index/tokenizers/mod.rs (11)

1-5: LGTM: Clean module structure with consistent imports

The import of Cow and local MultilingualTokenizer aligns well with the refactor objectives to reduce string cloning and use a local tokenizer implementation.

13-36: LGTM: Efficient Cow usage in WhiteSpaceTokenizer

The implementation correctly uses Cow::Borrowed for unchanged tokens and Cow::Owned only when lowercasing is needed, which optimizes memory usage.

41-65: LGTM: Consistent pattern in WordTokenizer

The implementation follows the same efficient pattern as WhiteSpaceTokenizer, using Cow appropriately for conditional lowercasing.

147-155: LGTM: Efficient Cow truncation with ownership preservation

The function correctly preserves the owned/borrowed state of the Cow while truncating. The implementation is efficient as it avoids unnecessary allocations for borrowed strings.

157-164: LGTM: Reference version of Cow truncation

The function handles the reference case correctly, though it necessarily creates an owned string for the Owned variant. The documentation clearly explains when to prefer truncate_cow over this function.

199-221: LGTM: Flexible token filtering with proper lifetime management

The closure-based approach with explicit lifetimes ('a, 'b) correctly handles the relationship between the tokenizer, callback, and tokens. The implementation maintains the same filtering logic while working with Cow<str>.

223-250: LGTM: Consistent tokenize_doc implementation

All tokenizer types are called with the same parameters and token filter, ensuring consistent behavior across different tokenization strategies.

252-277: LGTM: Query tokenization with proper prefix handling

The method correctly handles query tokenization, with special logic for PrefixTokenizer that uses tokenize_query instead of regular tokenize to generate appropriate n-grams for searching.

355-365: LGTM: Enhanced multilingual tokenizer tests with stopword filtering

The additional test cases for Japanese stopword filtering demonstrate that the multilingual tokenizer properly integrates with the stopwords system, which is a valuable addition to the test suite.

380-389: LGTM: Comprehensive Chinese tokenization testing

The Chinese tokenizer test with stopword filtering provides good coverage for CJK languages and validates that the local multilingual implementation works correctly.

92-99: Let’s inspect the implementation of truncate_cow_ref to confirm it only slices at the provided character boundary:
#!/bin/bash
# Show the body of truncate_cow_ref to verify UTF-8 safety
rg -n -A10 "fn truncate_cow_ref" lib/segment/src/index/field_index/full_text_index/tokenizers/mod.rs

✨ Finishing Touches

📝 Generate Docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Explain this complex logic.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
- @coderabbitai explain this code block.
- @coderabbitai modularize this function.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read src/utils.ts and explain its main purpose.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
- @coderabbitai help me debug CodeRabbit configuration file.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

@coderabbitai pause to pause the reviews on a PR.
@coderabbitai resume to resume the paused reviews.
@coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
@coderabbitai full review to do a full review from scratch and review all the files again.
@coderabbitai summary to regenerate the summary of the PR.
@coderabbitai generate docstrings to generate docstrings for this PR.
@coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
@coderabbitai resolve resolve all the CodeRabbit review comments.
@coderabbitai configuration to show the current CodeRabbit configuration for the repository.
@coderabbitai help to get help.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (`.coderabbit.yaml`)

You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
Please see the configuration documentation for more information.
If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

coderabbitai

Actionable comments posted: 1

♻️ Duplicate comments (1)

lib/segment/src/index/field_index/full_text_index/tokenizers/multilingual.rs (1)

116-116: Fix typo: "returing" → "returning"

🧹 Nitpick comments (2)

lib/segment/src/index/field_index/full_text_index/tokenizers/multilingual.rs (1)

383-384: Track the Chinese stopwords bug

The TODO comment indicates that Chinese stopwords must be in hanzi or the stopword list must be in pinyin. This is a significant limitation that affects Chinese text processing.

Would you like me to create a GitHub issue to track this bug and propose a solution for properly handling Chinese stopwords?
lib/segment/src/index/field_index/full_text_index/tokenizers/mod.rs (1)
157-164: Consider optimizing truncate_cow_ref for owned strings

The current implementation creates a new string when truncating an owned Cow. Since we're already dealing with an owned string, we could potentially optimize this.

However, since we only have a reference to the Cow, we cannot mutate the owned string in place. The current implementation is correct given the constraints. Consider documenting why truncate_cow should be preferred when possible.
 /// Truncates a string inside a `&Cow<str>` to the given `len` preserving the `Borrowed` and `Owned` state.
 /// `truncate_cow` should be preferred over this function if Cow doesn't need to be passed as reference.
+/// Note: When the Cow is Owned, this creates a new allocation as we cannot mutate through a reference.
 fn truncate_cow_ref<'a>(inp: &Cow<'a, str>, len: usize) -> Cow<'a, str> {

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 15a52a4 and 3e074d4.

⛔ Files ignored due to path filters (1)

Cargo.lock is excluded by !**/*.lock

📒 Files selected for processing (8)

Cargo.toml (9 hunks)
lib/segment/Cargo.toml (1 hunks)
lib/segment/src/index/field_index/full_text_index/mmap_text_index.rs (1 hunks)
lib/segment/src/index/field_index/full_text_index/mutable_text_index.rs (2 hunks)
lib/segment/src/index/field_index/full_text_index/text_index.rs (5 hunks)
lib/segment/src/index/field_index/full_text_index/tokenizers/japanese.rs (4 hunks)
lib/segment/src/index/field_index/full_text_index/tokenizers/mod.rs (18 hunks)
lib/segment/src/index/field_index/full_text_index/tokenizers/multilingual.rs (4 hunks)

🧰 Additional context used

🪛 GitHub Actions: Codespell

lib/segment/src/index/field_index/full_text_index/tokenizers/multilingual.rs

[error] 116-116: codespell detected a typo: 'returing' should be 'returning'. Command failed with exit code 65.

⏰ Context from checks skipped due to timeout of 90000ms (14)

GitHub Check: storage-compat-test
GitHub Check: test-low-resources
GitHub Check: test-snapshot-operations-s3-minio
GitHub Check: test-consensus-compose
GitHub Check: Basic TLS/HTTPS tests
GitHub Check: test-shard-snapshot-api-s3-minio
GitHub Check: test-consistency
GitHub Check: rust-tests (macos-latest)
GitHub Check: rust-tests (ubuntu-latest)
GitHub Check: integration-tests-consensus
GitHub Check: rust-tests-no-rocksdb (ubuntu-latest)
GitHub Check: lint
GitHub Check: integration-tests
GitHub Check: rust-tests (windows-latest)

🔇 Additional comments (15)

lib/segment/src/index/field_index/full_text_index/mmap_text_index.rs (1)

137-139: LGTM: Efficient iteration and token handling.

The change from consuming values to iterating by reference (&values) is a good optimization that avoids unnecessary cloning while allowing the tokenizer to provide Cow<str> tokens directly. This aligns well with the broader refactoring to use copy-on-write semantics throughout the tokenization pipeline.

lib/segment/Cargo.toml (1)

106-108: LGTM: Simplified feature management.

Moving the Chinese-related features directly to the charabia dependency simplifies the feature configuration by removing the indirection through feature groups. This makes the dependency requirements more explicit and easier to maintain.

lib/segment/src/index/field_index/full_text_index/mutable_text_index.rs (2)

1-1: LGTM: Proper import for Cow usage.

The addition of std::borrow::Cow import is necessary for the new token representation using copy-on-write semantics.

239-243: LGTM: Efficient token handling with Cow.

The changes properly implement the Cow refactoring:

Type declaration correctly specifies Vec<Cow<str>> for flexible ownership

Iteration by reference (&values) avoids unnecessary cloning

Tokenizer callback can now push Cow<str> tokens directly

This should improve performance by reducing string allocations while maintaining the same functional behavior.

Cargo.toml (2)

84-89: LGTM: Improved readability with multi-line formatting.

The formatting change to use multi-line arrays for the features field improves readability and makes it easier to manage individual features.

337-337: LGTM: Helpful comment for build configuration.

The comment explaining the restoration of default codegen-units value provides useful context for the performance testing profile configuration.

lib/segment/src/index/field_index/full_text_index/text_index.rs (5)

1-1: LGTM: Proper import for Cow usage.

The addition of std::borrow::Cow import is necessary for the updated serialization methods using copy-on-write semantics.

245-249: LGTM: Well-implemented Cow serialization.

The updates to serialize_document properly integrate Cow<str> tokens:

Function signature correctly accepts Vec<Cow<str>> for flexible ownership

StoredDocument struct properly uses lifetime parameters Cow<'a, str>

This allows efficient serialization of both borrowed and owned strings

The serialization maintains backward compatibility since deserialization still uses Vec<String>.

304-304: LGTM: Proper token conversion for API compatibility.

The .as_ref() call correctly converts Cow<str> to &str to match the expected parameter type for get_token(). This maintains API compatibility while leveraging the new token representation.

313-313: LGTM: Consistent token conversion pattern.

The .as_ref() call follows the same pattern as other locations, ensuring consistent handling of Cow<str> tokens throughout the codebase.

323-323: LGTM: Consistent token handling in document parsing.

The .as_ref() call maintains consistency with the token conversion pattern used throughout the file, properly converting Cow<str> to &str for the get_token() API.

lib/segment/src/index/field_index/full_text_index/tokenizers/japanese.rs (1)

40-50: LGTM! Clean integration of stopword filtering.

The stopword filtering and lowercase parameters are well integrated into the Japanese tokenizer, maintaining consistency with other tokenizers in the codebase.

Also applies to: 58-62
lib/segment/src/index/field_index/full_text_index/tokenizers/multilingual.rs (2)

125-129: LGTM! Good workaround for lifetime issues.

The unrolled tokenization approach effectively addresses the lifetime constraints while maintaining functionality. The comment clearly explains why this approach was necessary.

116-123: Optimize apply_casing to avoid unnecessary allocations

When lowercase is false and the input is already borrowed, you can return it directly without allocation.
 /// Applies `lowercase` to the given input, returning a cow.
 fn apply_casing<'a>(input: Cow<'a, str>, lowercase: bool) -> Cow<'a, str> {
     if lowercase {
         Cow::Owned(input.to_lowercase())
     } else {
-        input
+        // Preserve the borrowed/owned state of the input
+        input
     }
 }
Actually, the current implementation is already optimal for this case. The function correctly preserves the Cow state when not lowercasing.
lib/segment/src/index/field_index/full_text_index/tokenizers/mod.rs (1)

10-277: Excellent refactoring to use Cow<str> throughout!

The consistent use of Cow<str> across all tokenizers provides flexibility in ownership and reduces unnecessary allocations. The implementation correctly preserves borrowed strings when possible and only allocates when necessary (e.g., for lowercase conversion).

lib/segment/src/index/field_index/full_text_index/tokenizers/japanese.rs

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (1)

lib/segment/src/index/field_index/full_text_index/tokenizers/multilingual.rs (1)
117-123: Consider optimizing unnecessary allocations in lowercase transformation.

The current implementation always creates an owned string for lowercase transformation. This could be optimized by checking if the string is already lowercase.
 fn apply_casing<'a>(input: Cow<'a, str>, lowercase: bool) -> Cow<'a, str> {
     if lowercase {
-        Cow::Owned(input.to_lowercase())
+        let lowercased = input.to_lowercase();
+        if lowercased == *input {
+            input
+        } else {
+            Cow::Owned(lowercased)
+        }
     } else {
         input
     }
 }

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 3e074d4 and bf28f53.

📒 Files selected for processing (2)

lib/segment/src/index/field_index/full_text_index/tokenizers/japanese.rs (4 hunks)
lib/segment/src/index/field_index/full_text_index/tokenizers/multilingual.rs (4 hunks)

🚧 Files skipped from review as they are similar to previous changes (1)

lib/segment/src/index/field_index/full_text_index/tokenizers/japanese.rs

⏰ Context from checks skipped due to timeout of 90000ms (13)

GitHub Check: test-consensus-compose
GitHub Check: test-shard-snapshot-api-s3-minio
GitHub Check: test-snapshot-operations-s3-minio
GitHub Check: test-low-resources
GitHub Check: test-consistency
GitHub Check: Basic TLS/HTTPS tests
GitHub Check: integration-tests
GitHub Check: lint
GitHub Check: storage-compat-test
GitHub Check: rust-tests (windows-latest)
GitHub Check: integration-tests-consensus
GitHub Check: rust-tests (ubuntu-latest)
GitHub Check: rust-tests (macos-latest)

🔇 Additional comments (7)

lib/segment/src/index/field_index/full_text_index/tokenizers/multilingual.rs (7)

1-21: LGTM! Clean imports and well-documented normalizer configuration.

The addition of Cow import aligns with the PR objective of efficient string handling, and the DEFAULT_NORMALIZER constant properly centralizes the charabia configuration with clear documentation linking to the source.

64-114: Excellent refactoring with proper separation of concerns.

The consolidation of tokenization logic into the charabia method reduces code duplication while maintaining flexibility through the map_token parameter. The consistent use of Cow<str> throughout aligns with the performance objectives.

125-129: Good solution for the lifetime constraint.

The unrolling of Tokenize::tokenize() effectively addresses the lifetime issues mentioned in the PR description. The clear comment explains the technical necessity behind this approach.

131-190: Language detection and mapping logic is solid.

The language detection, script detection, and algorithm mapping functions are well-implemented and appropriately handle the supported multilingual languages mentioned in the PR objectives.

233-260: Test updates properly reflect the API changes.

The tests are correctly updated to use the new API with explicit lowercase and stopwords_filter parameters. The test coverage remains comprehensive for multilingual tokenization scenarios.

47-47: Verify Japanese tokenizer interface compatibility.

The call to japanese::tokenize uses the new parameter signature. Ensure the Japanese tokenizer has been updated to accept the lowercase and stopwords_filter parameters.
#!/bin/bash
# Description: Verify the Japanese tokenizer interface matches the expected signature
# Expected: japanese::tokenize should accept lowercase and stopwords_filter parameters

echo "Checking Japanese tokenizer interface..."
rg -A 5 "pub fn tokenize" lib/segment/src/index/field_index/full_text_index/tokenizers/japanese.rs
23-61: Verify API compatibility across the codebase.

The method signature has changed significantly by adding explicit lowercase and stopwords_filter parameters and using Cow<'a, str> for the callback. This improves control and performance but represents a breaking change.

Run the following script to ensure all callers have been updated:
#!/bin/bash
# Description: Find all calls to MultilingualTokenizer::tokenize to verify they use the new signature
# Expected: All calls should pass lowercase and stopwords_filter parameters

echo "Searching for MultilingualTokenizer::tokenize calls..."
rg -A 3 "MultilingualTokenizer::tokenize"

echo "Searching for any remaining MultilingualV2 references..."
rg "MultilingualV2"

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (1)

lib/segment/src/index/field_index/full_text_index/tokenizers/multilingual.rs (1)

32-33: Address the TODO comment for stemming configuration.

The hardcoded stem = false with a TODO comment suggests this should be configurable. Consider making this a parameter or reading from configuration.

Would you like me to help implement a configurable stemming option or create an issue to track this enhancement?

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between bf28f53 and 3eb829d.

📒 Files selected for processing (2)

lib/segment/src/index/field_index/full_text_index/tokenizers/japanese.rs (4 hunks)
lib/segment/src/index/field_index/full_text_index/tokenizers/multilingual.rs (4 hunks)

🚧 Files skipped from review as they are similar to previous changes (1)

lib/segment/src/index/field_index/full_text_index/tokenizers/japanese.rs

⏰ Context from checks skipped due to timeout of 90000ms (14)

GitHub Check: test-shard-snapshot-api-s3-minio
GitHub Check: test-snapshot-operations-s3-minio
GitHub Check: test-low-resources
GitHub Check: test-consensus-compose
GitHub Check: Basic TLS/HTTPS tests
GitHub Check: test-consistency
GitHub Check: integration-tests
GitHub Check: storage-compat-test
GitHub Check: rust-tests (windows-latest)
GitHub Check: lint
GitHub Check: integration-tests-consensus
GitHub Check: rust-tests (ubuntu-latest)
GitHub Check: rust-tests (macos-latest)
GitHub Check: rust-tests-no-rocksdb (ubuntu-latest)

🔇 Additional comments (7)

lib/segment/src/index/field_index/full_text_index/tokenizers/multilingual.rs (7)

3-4: Good addition of specific imports for the new functionality.

The imports are well-organized and include the necessary components for the new normalization and tokenization approach.

10-21: Excellent refactoring with clear constant definition.

The struct rename from MultilingualV2 to MultilingualTokenizer provides better clarity, and the DEFAULT_NORMALIZER constant centralizes configuration options for consistency across tokenization operations.

24-29: Improved API design with explicit parameters.

The new method signature accepting explicit lowercase and stopwords_filter parameters is more modular and testable compared to the previous TextIndexParams dependency.

94-113: Well-structured core tokenization logic.

The charabia method effectively centralizes token processing with proper filtering for stopwords and non-alphabetic characters. The use of Cow<str> throughout maintains efficiency while providing flexibility.

117-123: Effective casing implementation with Cow optimization.

The apply_casing function properly handles the Cow<str> type, avoiding unnecessary allocations when lowercase is not required.

125-129: Smart workaround for lifetime issues.

The custom charabia_token_iter function is a clever solution to the lifetime problems with direct use of the Tokenize trait. The comment clearly explains the rationale.

233-243: ```shell
#!/bin/bash

Show the full implementation of tokenize to see how lowercase is used

rg -n -A30 "pub fn tokenize" lib/segment/src/index/field_index/full_text_index/tokenizers/multilingual.rs

Search for any explicit lowercase normalization calls within the tokenizer

rg -n "to_lowercase" -A5 lib/segment/src/index/field_index/full_text_index/tokenizers/multilingual.rs

Search for references to the lowercase parameter within the method body

rg -n "lowercase" -A5 lib/segment/src/index/field_index/full_text_index/tokenizers/multilingual.rs
</details>

</blockquote></details>

</details>

* Fix Chinese stopwords * remove todo

coderabbitai

Actionable comments posted: 0

🔭 Outside diff range comments (1)

lib/segment/src/index/field_index/full_text_index/tokenizers/multilingual.rs (1)
233-243: Fix inconsistency in test helper function.

The assert_tokenization helper function passes false for the lowercase parameter, but the expected test results in test_multilingual_tokenization contain lowercase tokens. This suggests either the parameter should be true or the expected results should use original casing.
-        MultilingualTokenizer::tokenize(inp, false, &empty_filter, |i| out.push(i.to_string()));
+        MultilingualTokenizer::tokenize(inp, true, &empty_filter, |i| out.push(i.to_string()));
Or update the test expectations to match the actual casing behavior.

🧹 Nitpick comments (1)

lib/segment/src/index/field_index/full_text_index/tokenizers/multilingual.rs (1)

101-112: Consider optimizing the token filtering logic.

The current implementation processes each token through multiple steps. Consider combining the alphabetic check with the stopword check for better performance.

-        for token in charabia_token_iter(input) {
-            let cased_token = apply_casing(token.lemma, lowercase);
-
-            if stopwords_filter.is_stopword(&cased_token)
-                || cased_token.chars().all(|char| !char.is_alphabetic())
-            {
-                continue;
-            }
-
-            let processed_token = process_token(cased_token);
-            cb(processed_token)
-        }
+        for token in charabia_token_iter(input) {
+            let cased_token = apply_casing(token.lemma, lowercase);
+            
+            // Check alphabetic first as it's likely faster than stopword lookup
+            if cased_token.chars().all(|char| !char.is_alphabetic()) 
+                || stopwords_filter.is_stopword(&cased_token)
+            {
+                continue;
+            }
+
+            let processed_token = process_token(cased_token);
+            cb(processed_token)
+        }

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 3eb829d and 701357a.

📒 Files selected for processing (1)

lib/segment/src/index/field_index/full_text_index/tokenizers/multilingual.rs (4 hunks)

⏰ Context from checks skipped due to timeout of 90000ms (14)

GitHub Check: test-snapshot-operations-s3-minio
GitHub Check: test-shard-snapshot-api-s3-minio
GitHub Check: test-consistency
GitHub Check: rust-tests (ubuntu-latest)
GitHub Check: test-low-resources
GitHub Check: rust-tests (macos-latest)
GitHub Check: rust-tests-no-rocksdb (ubuntu-latest)
GitHub Check: rust-tests (windows-latest)
GitHub Check: integration-tests-consensus
GitHub Check: integration-tests
GitHub Check: test-consensus-compose
GitHub Check: lint
GitHub Check: Basic TLS/HTTPS tests
GitHub Check: storage-compat-test

🔇 Additional comments (7)

lib/segment/src/index/field_index/full_text_index/tokenizers/multilingual.rs (7)

10-10: LGTM: Clear struct renaming improves naming consistency.

The rename from MultilingualV2 to MultilingualTokenizer follows a more descriptive naming convention that clearly indicates the purpose of the struct.

14-21: LGTM: Well-documented default normalizer configuration.

The DEFAULT_NORMALIZER constant properly replicates the default configuration from charabia with clear documentation explaining its purpose and source.

24-29: LGTM: Improved method signature with explicit parameters.

The refactored signature with explicit lowercase and stopwords_filter parameters provides better control and removes the dependency on TextIndexParams, making the interface more flexible and testable.

74-80: LGTM: Effective use of closure for stemming integration.

The stemming logic is cleanly integrated using a closure that calls stem_cow, which is appropriate for the Cow<str> return type and avoids unnecessary allocations.

117-123: LGTM: Efficient casing implementation using Cow.

The apply_casing function correctly leverages Cow<str> to avoid unnecessary allocations when lowercasing is not needed, only creating an owned string when transformation is required.

125-129: LGTM: Effective workaround for lifetime issues.

The custom charabia_token_iter function properly addresses the lifetime issues mentioned in the PR description by unrolling the tokenize() call and using the segment/normalize API directly.

38-38: ```shell
#!/bin/bash

Retrieve the full signature of the tokenize_charabia_and_stem function

rg -n -C5 "fn tokenize_charabia_and_stem" lib/segment/src/index/field_index/full_text_index/tokenizers/multilingual.rs
</details>

</blockquote></details>

</details>

zailiangs · 2025-07-01T01:49:31Z

Thank you! This PR is extremely important to me. I really need to enable the default multilingual tokenizer feature, as it has solved my business problems.

* Implement new multilingual tokenizer * Remove unnecessary clones * Codespell * filter stopwords before stemmer * Fix Chinese stopwords (#6765) * Fix Chinese stopwords * remove todo --------- Co-authored-by: Andrey Vasnetsov <andrey@vasnetsov.com>

JojiiOfficial added 2 commits June 26, 2025 11:02

Implement new multilingual tokenizer

3e074d4

Remove unnecessary clones

f94ce40

JojiiOfficial requested a review from coszio June 26, 2025 09:18

coderabbitai bot reviewed Jun 26, 2025

View reviewed changes

lib/segment/src/index/field_index/full_text_index/tokenizers/japanese.rs Show resolved Hide resolved

JojiiOfficial requested review from generall June 26, 2025 09:18

coderabbitai bot reviewed Jun 26, 2025

View reviewed changes

Codespell

3eb829d

JojiiOfficial force-pushed the implement_new_multilingual_tokenizer branch from bf28f53 to 3eb829d Compare June 26, 2025 09:21

coderabbitai bot reviewed Jun 26, 2025

View reviewed changes

JojiiOfficial mentioned this pull request Jun 26, 2025

Fix Chinese stopwords #6765

Merged

filter stopwords before stemmer

701357a

generall approved these changes Jun 26, 2025

View reviewed changes

Fix Chinese stopwords (#6765)

606b840

* Fix Chinese stopwords * remove todo

coderabbitai bot reviewed Jun 26, 2025

View reviewed changes

github-actions bot mentioned this pull request Jun 26, 2025

Flaky test tests::snapshot_test::test_snapshot_collection_normal #5133

Open

generall approved these changes Jun 26, 2025

View reviewed changes

generall merged commit bfc5706 into dev Jun 26, 2025
18 checks passed

generall deleted the implement_new_multilingual_tokenizer branch June 26, 2025 13:49

zailiangs mentioned this pull request Jul 2, 2025

Please release a new version with built-in CJK tokenizer support #6790

Closed

coderabbitai bot mentioned this pull request Jul 2, 2025

Implement stemming #6770

Merged

generall added this to the Multilingual Tokenizer milestone Jul 17, 2025

coderabbitai bot mentioned this pull request Aug 1, 2025

BM25 in the Core #6891

Merged

voyager0003 mentioned this pull request Aug 1, 2025

how to support full-text search with chinese in 1.15.1 #6970

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Implement new multilingual tokenizer #6762

Implement new multilingual tokenizer #6762

Uh oh!

JojiiOfficial commented Jun 26, 2025

Uh oh!

coderabbitai bot commented Jun 26, 2025 •

edited

Loading

Walkthrough

Possibly related PRs

Suggested reviewers

Chat

Support

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (`.coderabbit.yaml`)

Documentation and Community

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot left a comment

Show the full implementation of `tokenize` to see how `lowercase` is used

Search for any explicit lowercase normalization calls within the tokenizer

Search for references to the `lowercase` parameter within the method body

Uh oh!

coderabbitai bot left a comment

Retrieve the full signature of the `tokenize_charabia_and_stem` function

Uh oh!

Uh oh!

zailiangs commented Jul 1, 2025

Uh oh!

Uh oh!

Implement new multilingual tokenizer #6762

Implement new multilingual tokenizer #6762

Uh oh!

Conversation

JojiiOfficial commented Jun 26, 2025

Uh oh!

coderabbitai bot commented Jun 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Possibly related PRs

Suggested reviewers

Chat

Support

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (.coderabbit.yaml)

Documentation and Community

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Show the full implementation of tokenize to see how lowercase is used

Search for any explicit lowercase normalization calls within the tokenizer

Search for references to the lowercase parameter within the method body

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Retrieve the full signature of the tokenize_charabia_and_stem function

Uh oh!

Uh oh!

zailiangs commented Jul 1, 2025

Uh oh!

Uh oh!

coderabbitai bot commented Jun 26, 2025 •

edited

Loading

CodeRabbit Configuration File (`.coderabbit.yaml`)

Show the full implementation of `tokenize` to see how `lowercase` is used

Search for references to the `lowercase` parameter within the method body

Retrieve the full signature of the `tokenize_charabia_and_stem` function