-
Notifications
You must be signed in to change notification settings - Fork 1.8k
Implement new multilingual tokenizer #6762
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
📝 Walkthrough""" WalkthroughThis change refactors the text tokenization pipeline to consistently use Possibly related PRs
Suggested reviewers
📜 Recent review detailsConfiguration used: CodeRabbit UI ⛔ Files ignored due to path filters (1)
📒 Files selected for processing (2)
🚧 Files skipped from review as they are similar to previous changes (1)
⏰ Context from checks skipped due to timeout of 90000ms (14)
🔇 Additional comments (11)
✨ Finishing Touches
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
♻️ Duplicate comments (1)
lib/segment/src/index/field_index/full_text_index/tokenizers/multilingual.rs (1)
116-116
: Fix typo: "returing" → "returning"
🧹 Nitpick comments (2)
lib/segment/src/index/field_index/full_text_index/tokenizers/multilingual.rs (1)
383-384
: Track the Chinese stopwords bugThe TODO comment indicates that Chinese stopwords must be in hanzi or the stopword list must be in pinyin. This is a significant limitation that affects Chinese text processing.
Would you like me to create a GitHub issue to track this bug and propose a solution for properly handling Chinese stopwords?
lib/segment/src/index/field_index/full_text_index/tokenizers/mod.rs (1)
157-164
: Consider optimizingtruncate_cow_ref
for owned stringsThe current implementation creates a new string when truncating an owned
Cow
. Since we're already dealing with an owned string, we could potentially optimize this.However, since we only have a reference to the
Cow
, we cannot mutate the owned string in place. The current implementation is correct given the constraints. Consider documenting whytruncate_cow
should be preferred when possible./// Truncates a string inside a `&Cow<str>` to the given `len` preserving the `Borrowed` and `Owned` state. /// `truncate_cow` should be preferred over this function if Cow doesn't need to be passed as reference. +/// Note: When the Cow is Owned, this creates a new allocation as we cannot mutate through a reference. fn truncate_cow_ref<'a>(inp: &Cow<'a, str>, len: usize) -> Cow<'a, str> {
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
⛔ Files ignored due to path filters (1)
Cargo.lock
is excluded by!**/*.lock
📒 Files selected for processing (8)
Cargo.toml
(9 hunks)lib/segment/Cargo.toml
(1 hunks)lib/segment/src/index/field_index/full_text_index/mmap_text_index.rs
(1 hunks)lib/segment/src/index/field_index/full_text_index/mutable_text_index.rs
(2 hunks)lib/segment/src/index/field_index/full_text_index/text_index.rs
(5 hunks)lib/segment/src/index/field_index/full_text_index/tokenizers/japanese.rs
(4 hunks)lib/segment/src/index/field_index/full_text_index/tokenizers/mod.rs
(18 hunks)lib/segment/src/index/field_index/full_text_index/tokenizers/multilingual.rs
(4 hunks)
🧰 Additional context used
🪛 GitHub Actions: Codespell
lib/segment/src/index/field_index/full_text_index/tokenizers/multilingual.rs
[error] 116-116: codespell detected a typo: 'returing' should be 'returning'. Command failed with exit code 65.
⏰ Context from checks skipped due to timeout of 90000ms (14)
- GitHub Check: storage-compat-test
- GitHub Check: test-low-resources
- GitHub Check: test-snapshot-operations-s3-minio
- GitHub Check: test-consensus-compose
- GitHub Check: Basic TLS/HTTPS tests
- GitHub Check: test-shard-snapshot-api-s3-minio
- GitHub Check: test-consistency
- GitHub Check: rust-tests (macos-latest)
- GitHub Check: rust-tests (ubuntu-latest)
- GitHub Check: integration-tests-consensus
- GitHub Check: rust-tests-no-rocksdb (ubuntu-latest)
- GitHub Check: lint
- GitHub Check: integration-tests
- GitHub Check: rust-tests (windows-latest)
🔇 Additional comments (15)
lib/segment/src/index/field_index/full_text_index/mmap_text_index.rs (1)
137-139
: LGTM: Efficient iteration and token handling.The change from consuming
values
to iterating by reference (&values
) is a good optimization that avoids unnecessary cloning while allowing the tokenizer to provideCow<str>
tokens directly. This aligns well with the broader refactoring to use copy-on-write semantics throughout the tokenization pipeline.lib/segment/Cargo.toml (1)
106-108
: LGTM: Simplified feature management.Moving the Chinese-related features directly to the
charabia
dependency simplifies the feature configuration by removing the indirection through feature groups. This makes the dependency requirements more explicit and easier to maintain.lib/segment/src/index/field_index/full_text_index/mutable_text_index.rs (2)
1-1
: LGTM: Proper import for Cow usage.The addition of
std::borrow::Cow
import is necessary for the new token representation using copy-on-write semantics.
239-243
: LGTM: Efficient token handling with Cow.The changes properly implement the Cow refactoring:
- Type declaration correctly specifies
Vec<Cow<str>>
for flexible ownership- Iteration by reference (
&values
) avoids unnecessary cloning- Tokenizer callback can now push
Cow<str>
tokens directlyThis should improve performance by reducing string allocations while maintaining the same functional behavior.
Cargo.toml (2)
84-89
: LGTM: Improved readability with multi-line formatting.The formatting change to use multi-line arrays for the
features
field improves readability and makes it easier to manage individual features.
337-337
: LGTM: Helpful comment for build configuration.The comment explaining the restoration of default codegen-units value provides useful context for the performance testing profile configuration.
lib/segment/src/index/field_index/full_text_index/text_index.rs (5)
1-1
: LGTM: Proper import for Cow usage.The addition of
std::borrow::Cow
import is necessary for the updated serialization methods using copy-on-write semantics.
245-249
: LGTM: Well-implemented Cow serialization.The updates to
serialize_document
properly integrateCow<str>
tokens:
- Function signature correctly accepts
Vec<Cow<str>>
for flexible ownershipStoredDocument
struct properly uses lifetime parametersCow<'a, str>
- This allows efficient serialization of both borrowed and owned strings
The serialization maintains backward compatibility since deserialization still uses
Vec<String>
.
304-304
: LGTM: Proper token conversion for API compatibility.The
.as_ref()
call correctly convertsCow<str>
to&str
to match the expected parameter type forget_token()
. This maintains API compatibility while leveraging the new token representation.
313-313
: LGTM: Consistent token conversion pattern.The
.as_ref()
call follows the same pattern as other locations, ensuring consistent handling ofCow<str>
tokens throughout the codebase.
323-323
: LGTM: Consistent token handling in document parsing.The
.as_ref()
call maintains consistency with the token conversion pattern used throughout the file, properly convertingCow<str>
to&str
for theget_token()
API.lib/segment/src/index/field_index/full_text_index/tokenizers/japanese.rs (1)
40-50
: LGTM! Clean integration of stopword filtering.The stopword filtering and lowercase parameters are well integrated into the Japanese tokenizer, maintaining consistency with other tokenizers in the codebase.
Also applies to: 58-62
lib/segment/src/index/field_index/full_text_index/tokenizers/multilingual.rs (2)
125-129
: LGTM! Good workaround for lifetime issues.The unrolled tokenization approach effectively addresses the lifetime constraints while maintaining functionality. The comment clearly explains why this approach was necessary.
116-123
: Optimizeapply_casing
to avoid unnecessary allocationsWhen
lowercase
is false and the input is already borrowed, you can return it directly without allocation./// Applies `lowercase` to the given input, returning a cow. fn apply_casing<'a>(input: Cow<'a, str>, lowercase: bool) -> Cow<'a, str> { if lowercase { Cow::Owned(input.to_lowercase()) } else { - input + // Preserve the borrowed/owned state of the input + input } }Actually, the current implementation is already optimal for this case. The function correctly preserves the
Cow
state when not lowercasing.lib/segment/src/index/field_index/full_text_index/tokenizers/mod.rs (1)
10-277
: Excellent refactoring to useCow<str>
throughout!The consistent use of
Cow<str>
across all tokenizers provides flexibility in ownership and reduces unnecessary allocations. The implementation correctly preserves borrowed strings when possible and only allocates when necessary (e.g., for lowercase conversion).
lib/segment/src/index/field_index/full_text_index/tokenizers/japanese.rs
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (1)
lib/segment/src/index/field_index/full_text_index/tokenizers/multilingual.rs (1)
117-123
: Consider optimizing unnecessary allocations in lowercase transformation.The current implementation always creates an owned string for lowercase transformation. This could be optimized by checking if the string is already lowercase.
fn apply_casing<'a>(input: Cow<'a, str>, lowercase: bool) -> Cow<'a, str> { if lowercase { - Cow::Owned(input.to_lowercase()) + let lowercased = input.to_lowercase(); + if lowercased == *input { + input + } else { + Cow::Owned(lowercased) + } } else { input } }
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
lib/segment/src/index/field_index/full_text_index/tokenizers/japanese.rs
(4 hunks)lib/segment/src/index/field_index/full_text_index/tokenizers/multilingual.rs
(4 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- lib/segment/src/index/field_index/full_text_index/tokenizers/japanese.rs
⏰ Context from checks skipped due to timeout of 90000ms (13)
- GitHub Check: test-consensus-compose
- GitHub Check: test-shard-snapshot-api-s3-minio
- GitHub Check: test-snapshot-operations-s3-minio
- GitHub Check: test-low-resources
- GitHub Check: test-consistency
- GitHub Check: Basic TLS/HTTPS tests
- GitHub Check: integration-tests
- GitHub Check: lint
- GitHub Check: storage-compat-test
- GitHub Check: rust-tests (windows-latest)
- GitHub Check: integration-tests-consensus
- GitHub Check: rust-tests (ubuntu-latest)
- GitHub Check: rust-tests (macos-latest)
🔇 Additional comments (7)
lib/segment/src/index/field_index/full_text_index/tokenizers/multilingual.rs (7)
1-21
: LGTM! Clean imports and well-documented normalizer configuration.The addition of
Cow
import aligns with the PR objective of efficient string handling, and theDEFAULT_NORMALIZER
constant properly centralizes the charabia configuration with clear documentation linking to the source.
64-114
: Excellent refactoring with proper separation of concerns.The consolidation of tokenization logic into the
charabia
method reduces code duplication while maintaining flexibility through themap_token
parameter. The consistent use ofCow<str>
throughout aligns with the performance objectives.
125-129
: Good solution for the lifetime constraint.The unrolling of
Tokenize::tokenize()
effectively addresses the lifetime issues mentioned in the PR description. The clear comment explains the technical necessity behind this approach.
131-190
: Language detection and mapping logic is solid.The language detection, script detection, and algorithm mapping functions are well-implemented and appropriately handle the supported multilingual languages mentioned in the PR objectives.
233-260
: Test updates properly reflect the API changes.The tests are correctly updated to use the new API with explicit
lowercase
andstopwords_filter
parameters. The test coverage remains comprehensive for multilingual tokenization scenarios.
47-47
: Verify Japanese tokenizer interface compatibility.The call to
japanese::tokenize
uses the new parameter signature. Ensure the Japanese tokenizer has been updated to accept thelowercase
andstopwords_filter
parameters.#!/bin/bash # Description: Verify the Japanese tokenizer interface matches the expected signature # Expected: japanese::tokenize should accept lowercase and stopwords_filter parameters echo "Checking Japanese tokenizer interface..." rg -A 5 "pub fn tokenize" lib/segment/src/index/field_index/full_text_index/tokenizers/japanese.rs
23-61
: Verify API compatibility across the codebase.The method signature has changed significantly by adding explicit
lowercase
andstopwords_filter
parameters and usingCow<'a, str>
for the callback. This improves control and performance but represents a breaking change.Run the following script to ensure all callers have been updated:
#!/bin/bash # Description: Find all calls to MultilingualTokenizer::tokenize to verify they use the new signature # Expected: All calls should pass lowercase and stopwords_filter parameters echo "Searching for MultilingualTokenizer::tokenize calls..." rg -A 3 "MultilingualTokenizer::tokenize" echo "Searching for any remaining MultilingualV2 references..." rg "MultilingualV2"
bf28f53
to
3eb829d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (1)
lib/segment/src/index/field_index/full_text_index/tokenizers/multilingual.rs (1)
32-33
: Address the TODO comment for stemming configuration.The hardcoded
stem = false
with a TODO comment suggests this should be configurable. Consider making this a parameter or reading from configuration.Would you like me to help implement a configurable stemming option or create an issue to track this enhancement?
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
lib/segment/src/index/field_index/full_text_index/tokenizers/japanese.rs
(4 hunks)lib/segment/src/index/field_index/full_text_index/tokenizers/multilingual.rs
(4 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- lib/segment/src/index/field_index/full_text_index/tokenizers/japanese.rs
⏰ Context from checks skipped due to timeout of 90000ms (14)
- GitHub Check: test-shard-snapshot-api-s3-minio
- GitHub Check: test-snapshot-operations-s3-minio
- GitHub Check: test-low-resources
- GitHub Check: test-consensus-compose
- GitHub Check: Basic TLS/HTTPS tests
- GitHub Check: test-consistency
- GitHub Check: integration-tests
- GitHub Check: storage-compat-test
- GitHub Check: rust-tests (windows-latest)
- GitHub Check: lint
- GitHub Check: integration-tests-consensus
- GitHub Check: rust-tests (ubuntu-latest)
- GitHub Check: rust-tests (macos-latest)
- GitHub Check: rust-tests-no-rocksdb (ubuntu-latest)
🔇 Additional comments (7)
lib/segment/src/index/field_index/full_text_index/tokenizers/multilingual.rs (7)
3-4
: Good addition of specific imports for the new functionality.The imports are well-organized and include the necessary components for the new normalization and tokenization approach.
10-21
: Excellent refactoring with clear constant definition.The struct rename from
MultilingualV2
toMultilingualTokenizer
provides better clarity, and theDEFAULT_NORMALIZER
constant centralizes configuration options for consistency across tokenization operations.
24-29
: Improved API design with explicit parameters.The new method signature accepting explicit
lowercase
andstopwords_filter
parameters is more modular and testable compared to the previousTextIndexParams
dependency.
94-113
: Well-structured core tokenization logic.The
charabia
method effectively centralizes token processing with proper filtering for stopwords and non-alphabetic characters. The use ofCow<str>
throughout maintains efficiency while providing flexibility.
117-123
: Effective casing implementation with Cow optimization.The
apply_casing
function properly handles theCow<str>
type, avoiding unnecessary allocations when lowercase is not required.
125-129
: Smart workaround for lifetime issues.The custom
charabia_token_iter
function is a clever solution to the lifetime problems with direct use of theTokenize
trait. The comment clearly explains the rationale.
233-243
: ```shell
#!/bin/bashShow the full implementation of
tokenize
to see howlowercase
is usedrg -n -A30 "pub fn tokenize" lib/segment/src/index/field_index/full_text_index/tokenizers/multilingual.rs
Search for any explicit lowercase normalization calls within the tokenizer
rg -n "to_lowercase" -A5 lib/segment/src/index/field_index/full_text_index/tokenizers/multilingual.rs
Search for references to the
lowercase
parameter within the method bodyrg -n "lowercase" -A5 lib/segment/src/index/field_index/full_text_index/tokenizers/multilingual.rs
</details> </blockquote></details> </details> <!-- This is an auto-generated comment by CodeRabbit for review status -->
* Fix Chinese stopwords * remove todo
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🔭 Outside diff range comments (1)
lib/segment/src/index/field_index/full_text_index/tokenizers/multilingual.rs (1)
233-243
: Fix inconsistency in test helper function.The
assert_tokenization
helper function passesfalse
for thelowercase
parameter, but the expected test results intest_multilingual_tokenization
contain lowercase tokens. This suggests either the parameter should betrue
or the expected results should use original casing.- MultilingualTokenizer::tokenize(inp, false, &empty_filter, |i| out.push(i.to_string())); + MultilingualTokenizer::tokenize(inp, true, &empty_filter, |i| out.push(i.to_string()));Or update the test expectations to match the actual casing behavior.
🧹 Nitpick comments (1)
lib/segment/src/index/field_index/full_text_index/tokenizers/multilingual.rs (1)
101-112
: Consider optimizing the token filtering logic.The current implementation processes each token through multiple steps. Consider combining the alphabetic check with the stopword check for better performance.
- for token in charabia_token_iter(input) { - let cased_token = apply_casing(token.lemma, lowercase); - - if stopwords_filter.is_stopword(&cased_token) - || cased_token.chars().all(|char| !char.is_alphabetic()) - { - continue; - } - - let processed_token = process_token(cased_token); - cb(processed_token) - } + for token in charabia_token_iter(input) { + let cased_token = apply_casing(token.lemma, lowercase); + + // Check alphabetic first as it's likely faster than stopword lookup + if cased_token.chars().all(|char| !char.is_alphabetic()) + || stopwords_filter.is_stopword(&cased_token) + { + continue; + } + + let processed_token = process_token(cased_token); + cb(processed_token) + }
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
lib/segment/src/index/field_index/full_text_index/tokenizers/multilingual.rs
(4 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (14)
- GitHub Check: test-snapshot-operations-s3-minio
- GitHub Check: test-shard-snapshot-api-s3-minio
- GitHub Check: test-consistency
- GitHub Check: rust-tests (ubuntu-latest)
- GitHub Check: test-low-resources
- GitHub Check: rust-tests (macos-latest)
- GitHub Check: rust-tests-no-rocksdb (ubuntu-latest)
- GitHub Check: rust-tests (windows-latest)
- GitHub Check: integration-tests-consensus
- GitHub Check: integration-tests
- GitHub Check: test-consensus-compose
- GitHub Check: lint
- GitHub Check: Basic TLS/HTTPS tests
- GitHub Check: storage-compat-test
🔇 Additional comments (7)
lib/segment/src/index/field_index/full_text_index/tokenizers/multilingual.rs (7)
10-10
: LGTM: Clear struct renaming improves naming consistency.The rename from
MultilingualV2
toMultilingualTokenizer
follows a more descriptive naming convention that clearly indicates the purpose of the struct.
14-21
: LGTM: Well-documented default normalizer configuration.The
DEFAULT_NORMALIZER
constant properly replicates the default configuration from charabia with clear documentation explaining its purpose and source.
24-29
: LGTM: Improved method signature with explicit parameters.The refactored signature with explicit
lowercase
andstopwords_filter
parameters provides better control and removes the dependency onTextIndexParams
, making the interface more flexible and testable.
74-80
: LGTM: Effective use of closure for stemming integration.The stemming logic is cleanly integrated using a closure that calls
stem_cow
, which is appropriate for theCow<str>
return type and avoids unnecessary allocations.
117-123
: LGTM: Efficient casing implementation using Cow.The
apply_casing
function correctly leveragesCow<str>
to avoid unnecessary allocations when lowercasing is not needed, only creating an owned string when transformation is required.
125-129
: LGTM: Effective workaround for lifetime issues.The custom
charabia_token_iter
function properly addresses the lifetime issues mentioned in the PR description by unrolling thetokenize()
call and using the segment/normalize API directly.
38-38
: ```shell
#!/bin/bashRetrieve the full signature of the
tokenize_charabia_and_stem
functionrg -n -C5 "fn tokenize_charabia_and_stem" lib/segment/src/index/field_index/full_text_index/tokenizers/multilingual.rs
</details> </blockquote></details> </details> <!-- This is an auto-generated comment by CodeRabbit for review status -->
Thank you! This PR is extremely important to me. I really need to enable the default multilingual tokenizer feature, as it has solved my business problems. |
* Implement new multilingual tokenizer * Remove unnecessary clones * Codespell * filter stopwords before stemmer * Fix Chinese stopwords (#6765) * Fix Chinese stopwords * remove todo --------- Co-authored-by: Andrey Vasnetsov <andrey@vasnetsov.com>
Implements the new multilingual tokenizer. In detail, this PR changes the following:
multilingual-*
cargo features and enable multilingual by defaultCow<str>
for less cloning and easier string manipulationMultilingualV2
tokenizer (renamed toMultilingualTokenizer
)MultilingualTokenizer
Due to lifetimes issues, caused by the underlying tokenization libraries, not all tokenization functions can benefit from clone-less
Cow
usage. This is kept as open TODO and could be investigated in a future PR.