-
Notifications
You must be signed in to change notification settings - Fork 1.8k
Add stopwords support #6635
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add stopwords support #6635
Conversation
lib/segment/src/index/field_index/full_text_index/stop_words.rs
Outdated
Show resolved
Hide resolved
lib/segment/src/index/field_index/full_text_index/stop_words.rs
Outdated
Show resolved
Hide resolved
f2ede10
to
e6bd5d1
Compare
lib/segment/src/index/field_index/full_text_index/stop_words_container.rs
Outdated
Show resolved
Hide resolved
lib/segment/src/index/field_index/full_text_index/stop_words.rs
Outdated
Show resolved
Hide resolved
📝 WalkthroughWalkthroughThe change adds extensive support for stopwords in full-text indexing. It introduces an optional Suggested reviewers
✨ Finishing Touches
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
♻️ Duplicate comments (1)
lib/segment/src/index/field_index/full_text_index/stop_words.rs (1)
1-11
: Consider using wildcard import for better maintainability.While the explicit imports show all available languages clearly, a wildcard import would be more maintainable as new languages are added.
-use super::stop_words_container::{ - ARABIC_STOPWORDS, AZERBAIJANI_STOPWORDS, BASQUE_STOPWORDS, BENGALI_STOPWORDS, - CATALAN_STOPWORDS, CHINESE_STOPWORDS, DANISH_STOPWORDS, DUTCH_STOPWORDS, ENGLISH_STOPWORDS, - FINNISH_STOPWORDS, FRENCH_STOPWORDS, GERMAN_STOPWORDS, GREEK_STOPWORDS, HEBREW_STOPWORDS, - HINGLISH_STOPWORDS, HUNGARIAN_STOPWORDS, INDONESIAN_STOPWORDS, ITALIAN_STOPWORDS, - KAZAKH_STOPWORDS, NEPALI_STOPWORDS, NORWEGIAN_STOPWORDS, PORTUGUESE_STOPWORDS, - ROMANIAN_STOPWORDS, RUSSIAN_STOPWORDS, SLOVENE_STOPWORDS, SPANISH_STOPWORDS, SWEDISH_STOPWORDS, - TAJIK_STOPWORDS, TURKISH_STOPWORDS, -}; +use super::stop_words_container::*;
🧹 Nitpick comments (2)
.github/workflows/codespell.yml (1)
21-25
: Suggestion: Extend skip patterns for stopwords files
You’ve addedskip: '**/stop_words_container.rs'
to avoid Codespell noise. To prevent false positives on domain-specific terms in both modules, consider skippingstop_words.rs
as well.Example diff:
with: - skip: '**/stop_words_container.rs' + skip: + - '**/stop_words_container.rs' + - '**/stop_words.rs'docs/redoc/master/openapi.json (1)
7670-7732
: Verify enum consistency and enhance schema strictness.
- Ensure the
Language
enum values here match the RustLanguage
enum exactly.- Consider adding
"additionalProperties": false
toStopwordsSet
and"uniqueItems": true
on its arrays for stricter validation.- Add
description
fields toStopwordsInterface
,Language
, andStopwordsSet
to improve API documentation clarity.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (9)
.github/workflows/codespell.yml
(1 hunks)docs/redoc/master/openapi.json
(2 hunks)lib/api/src/grpc/conversions.rs
(2 hunks)lib/segment/src/data_types/index.rs
(2 hunks)lib/segment/src/index/field_index/full_text_index/mod.rs
(1 hunks)lib/segment/src/index/field_index/full_text_index/mutable_text_index.rs
(1 hunks)lib/segment/src/index/field_index/full_text_index/stop_words.rs
(1 hunks)lib/segment/src/index/field_index/full_text_index/tests/mod.rs
(1 hunks)lib/segment/src/index/field_index/full_text_index/tokenizers.rs
(6 hunks)
🔇 Additional comments (21)
lib/segment/src/index/field_index/full_text_index/tests/mod.rs (1)
168-168
: Initialize stopwords in test config
Addingstopwords: None
ensures the test reflects the updatedTextIndexParams
API. Good catch.lib/segment/src/index/field_index/full_text_index/mutable_text_index.rs (1)
155-155
: Update test config for stopwords support
Explicitly settingstopwords: None
in theTextIndexParams
aligns with the new optional field. Test update is correct.lib/segment/src/index/field_index/full_text_index/mod.rs (2)
10-10
: Add stop_words module
Declaring thestop_words
module integrates the new stopword filtering implementation. Approved.
18-18
: Add stop_words_container module
Thestop_words_container
module declaration is required for managing custom stopword sets. Approved.lib/api/src/grpc/conversions.rs (1)
217-217
:✅ Verification successful
Verify intentional exclusion of stopwords from gRPC conversion.
The
stopwords
field is destructured but ignored, which means stopwords configuration is lost when converting from internal representation to gRPC. This could lead to data loss if stopwords are configured via other APIs and then processed through gRPC conversions.Please verify whether this is intentional. If the gRPC API doesn't support stopwords yet, this should be documented. If it should support stopwords, the field needs to be properly handled in the conversion.
🏁 Script executed:
#!/bin/bash # Verify if gRPC TextIndexParams proto includes stopwords field echo "Searching for TextIndexParams proto definition..." fd -e proto -x grep -l "TextIndexParams" {} \; echo "" echo "Searching for stopwords in proto files..." fd -e proto -x grep -n "stopwords" {} \;Length of output: 349
gRPC
TextIndexParams
doesn’t includestopwords
—conversion is intentionalThe protobuf schema in
lib/api/src/grpc/proto/collections.proto
forTextIndexParams
does not define astopwords
field, so the current conversions correctly ignore it.If you intend to expose stopwords via gRPC in the future:
- Add
stopwords
to the proto definition.- Update both
From
andTryFrom
implementations inlib/api/src/grpc/conversions.rs
to handle it.Otherwise, please document in the gRPC API that stopwords are not supported over gRPC at this time.
lib/segment/src/index/field_index/full_text_index/tokenizers.rs (6)
4-4
: LGTM: Clean import addition.The import for
StopwordsFilter
is appropriately placed and follows the existing import structure.
85-85
: LGTM: Proper stopwords filter initialization.Creating the
StopwordsFilter
from the configuration at the beginning of the closure is the right approach for performance and clarity.
103-105
: LGTM: Optimal placement of stopword filtering.The stopword check is correctly placed after length validation but before case normalization, ensuring consistent behavior regardless of input case while maintaining the existing validation logic.
148-148
: LGTM: Necessary test imports added.The additional imports for stopwords types are required for the new test cases.
267-267
: LGTM: Backward compatibility maintained.Adding
stopwords: None
to existing test configurations ensures backward compatibility and prevents test failures.
282-395
: Excellent test coverage for stopword functionality.The test suite comprehensively covers:
- Language-based stopwords with English stopword verification
- Custom stopwords with user-defined terms
- Mixed configuration combining language and custom stopwords
- Proper filtering behavior for both stopwords and non-stopwords
The tests validate that the integration works correctly across different stopword configurations.
lib/segment/src/data_types/index.rs (5)
202-203
: LGTM: Well-designed stopwords field addition.The optional
stopwords
field withskip_serializing_if
ensures backward compatibility while enabling the new feature. The field placement and naming follow existing conventions.
206-211
: LGTM: Clean interface design following established patterns.The
StopwordsInterface
enum with#[serde(untagged)]
provides flexible JSON handling, allowing both simple language strings and complex configuration objects. This design aligns with other Interface enums in the codebase.
247-254
: LGTM: Flexible stopwords configuration structure.The
StopwordsSet
struct allows combining multiple languages with custom stopwords, and the#[serde(default)]
attributes ensure clean JSON handling when fields are omitted.
300-356
: Excellent serialization test coverage.The test suite thoroughly validates:
- Language variant serialization to simple strings
- Set variant serialization with both languages and custom words
- Deserialization from various JSON formats
- Proper handling of the untagged enum structure
These tests ensure robust API compatibility and data integrity.
213-245
:✅ Verification successful
Verify all languages have corresponding stopword data.
The comprehensive
Language
enum covers a wide range of languages with proper lowercase serialization. Ensure that all listed languages have corresponding stopword arrays in thestop_words_container
module.
🏁 Script executed:
#!/bin/bash # Description: Verify that all Language enum variants have corresponding stopword constants # Extract language names from the enum echo "Checking if all Language enum variants have corresponding stopword constants..." # Get all language variants from this file ast-grep --pattern 'enum Language { $$$ }' | grep -E '^\s*[A-Z]' | sed 's/,$//' | sort > /tmp/enum_languages.txt # Search for stopword constants in the stop_words_container module rg -o '[A-Z_]+_STOPWORDS' --type rust | sed 's/_STOPWORDS$//' | sort | uniq > /tmp/constant_languages.txt echo "Language enum variants:" cat /tmp/enum_languages.txt echo -e "\nStopword constants found:" cat /tmp/constant_languages.txt echo -e "\nLanguages without stopword constants:" comm -23 /tmp/enum_languages.txt /tmp/constant_languages.txtLength of output: 5245
Couldn’t extract enum variants correctly before – let’s re-run with a more reliable script:
🏁 Script executed:
#!/usr/bin/env bash set -euo pipefail # 1. Pull out the enum variants sed -n '/enum Language {/,/}/p' lib/segment/src/data_types/index.rs \ | sed -e '1d' -e '$d' \ | sed 's/,//g; s/^[[:space:]]*//; s/[[:space:]]*$//' \ | tr '[:lower:]' '[:upper:]' \ | sort > /tmp/enum_languages.txt # 2. Pull out stopword constants grep -R 'pub const [A-Z_]\+_STOPWORDS' lib/segment/src/index/field_index/full_text_index/stop_words_container.rs \ | sed 's/.*pub const \([A-Z_]\+\)_STOPWORDS.*/\1/' \ | sort > /tmp/constant_languages.txt # 3. Report echo "Enum variants:" cat /tmp/enum_languages.txt echo -e "\nStopword constants:" cat /tmp/constant_languages.txt echo -e "\nMissing stopword constants:" grep -vxFf /tmp/constant_languages.txt /tmp/enum_languages.txt || echo "None"Length of output: 371
The
grep
is exiting with a non‐zero status (no matches) and killing the script underset -e
. Let’s ignore that failure and re-run:
🏁 Script executed:
#!/usr/bin/env bash set -u pipefail # 1. Extract enum variants sed -n '/enum Language {/,/}/p' lib/segment/src/data_types/index.rs \ | sed '1d;$d;s/,//g;s/^[[:space:]]*//;s/[[:space:]]*$//' \ | tr '[:lower:]' '[:upper:]' \ | sort > /tmp/enum_languages.txt # 2. Extract stopword constants (ignore “no match”) grep -R 'pub const [A-Z_]\+_STOPWORDS' lib/segment/src/index/field_index/full_text_index/stop_words_container.rs \ || true \ | sed 's/.*pub const \([A-Z_]\+\)_STOPWORDS.*/\1/' \ | sort > /tmp/constant_languages.txt # 3. Report echo "Enum variants:" cat /tmp/enum_languages.txt echo -e "\nStopword constants:" cat /tmp/constant_languages.txt echo -e "\nMissing stopword constants:" comm -23 /tmp/enum_languages.txt /tmp/constant_languages.txt || echo "None"Length of output: 999
Let’s extract both the enum variants and all
_STOPWORDS
identifiers (from both modules) and diff them usinggrep
(avoidingcomm
):
🏁 Script executed:
#!/usr/bin/env bash set -u pipefail # 1. Extract all Language enum variants sed -n '/enum Language {/,/}/p' lib/segment/src/data_types/index.rs \ | sed '1d;$d;s/,//g;s/^[[:space:]]*//;s/[[:space:]]*$//' \ | tr '[:lower:]' '[:upper:]' \ | sort > /tmp/enums.txt # 2. Extract all *_STOPWORDS constants from both modules rg -h -o '[A-Z_]\+_STOPWORDS' lib/segment/src/index/field_index/full_text_index/stop_words*.rs \ | sed 's/_STOPWORDS$//' \ | sort -u > /tmp/constants.txt # 3. Report echo "Enum variants:" cat /tmp/enums.txt echo -e "\nStopword constants:" cat /tmp/constants.txt echo -e "\nLanguages missing stopword constants:" grep -vxFf /tmp/constants.txt /tmp/enums.txt || echo "None"Length of output: 10406
Let’s pull constants from both
stop_words.rs
andstop_words_container.rs
(dropping the mistaken-h
) and then diff against the enum:
🏁 Script executed:
#!/usr/bin/env bash set -euo pipefail # 1. Extract enum variants sed -n '/enum Language {/,/}/p' lib/segment/src/data_types/index.rs \ | sed '1d;$d;s/,//g;s/^[[:space:]]*//;s/[[:space:]]*$//' \ | tr '[:lower:]' '[:upper:]' \ | sort > /tmp/enums.txt # 2. Extract STOPWORDS constants from both files grep -R 'pub const [A-Z_]\+_STOPWORDS' lib/segment/src/index/field_index/full_text_index/stop_words.rs lib/segment/src/index/field_index/full_text_index/stop_words_container.rs \ | sed 's/.*pub const \([A-Z_]\+\)_STOPWORDS.*/\1/' \ | sort -u > /tmp/constants.txt # 3. Diff echo "Languages missing stopword constants:" grep -vxFf /tmp/constants.txt /tmp/enums.txt || echo "None"Length of output: 422
Stopword coverage verified
All 30
Language
enum variants have matching_STOPWORDS
definitions in bothstop_words.rs
andstop_words_container.rs
. No missing entries were found.lib/segment/src/index/field_index/full_text_index/stop_words.rs (5)
14-16
: LGTM: Efficient data structure choice.Using
HashSet<String>
provides O(1) lookup performance for stopword checking, which is optimal for this use case.
18-42
: LGTM: Clean constructor with proper configuration handling.The constructor correctly handles both
Language
andSet
variants ofStopwordsInterface
, building the stopword set efficiently. The pattern matching is clear and comprehensive.
45-47
: LGTM: Simple and efficient stopword checking.The
is_stopword
method provides case-insensitive matching by converting to lowercase, which is the expected behavior for stopword filtering.
50-86
: LGTM: Comprehensive language mapping with proper error handling.The exhaustive match ensures all
Language
enum variants are handled, and the compiler will catch any missing variants if new languages are added. The implementation correctly converts static arrays to owned strings.
89-204
: Excellent test coverage for all stopword scenarios.The test suite comprehensively covers:
- Empty stopword sets
- Language-specific stopwords
- Custom stopwords
- Mixed language and custom configurations
- Case insensitivity
- Coverage of all supported languages
This thorough testing ensures the implementation is robust and handles all expected use cases correctly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (1)
lib/segment/src/index/field_index/full_text_index/stop_words/mod.rs (1)
243-243
: Remove unnecessary clone() call.The
Language
enum likely implementsCopy
, making theclone()
call redundant.- let option = Some(StopwordsInterface::Language(language.clone())); + let option = Some(StopwordsInterface::Language(language));
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (33)
.github/workflows/codespell.yml
(1 hunks)lib/segment/src/index/field_index/full_text_index/mod.rs
(1 hunks)lib/segment/src/index/field_index/full_text_index/stop_words/README.md
(1 hunks)lib/segment/src/index/field_index/full_text_index/stop_words/arabic.rs
(1 hunks)lib/segment/src/index/field_index/full_text_index/stop_words/azerbaijani.rs
(1 hunks)lib/segment/src/index/field_index/full_text_index/stop_words/basque.rs
(1 hunks)lib/segment/src/index/field_index/full_text_index/stop_words/bengali.rs
(1 hunks)lib/segment/src/index/field_index/full_text_index/stop_words/catalan.rs
(1 hunks)lib/segment/src/index/field_index/full_text_index/stop_words/chinese.rs
(1 hunks)lib/segment/src/index/field_index/full_text_index/stop_words/danish.rs
(1 hunks)lib/segment/src/index/field_index/full_text_index/stop_words/dutch.rs
(1 hunks)lib/segment/src/index/field_index/full_text_index/stop_words/english.rs
(1 hunks)lib/segment/src/index/field_index/full_text_index/stop_words/finnish.rs
(1 hunks)lib/segment/src/index/field_index/full_text_index/stop_words/french.rs
(1 hunks)lib/segment/src/index/field_index/full_text_index/stop_words/german.rs
(1 hunks)lib/segment/src/index/field_index/full_text_index/stop_words/greek.rs
(1 hunks)lib/segment/src/index/field_index/full_text_index/stop_words/hebrew.rs
(1 hunks)lib/segment/src/index/field_index/full_text_index/stop_words/hinglish.rs
(1 hunks)lib/segment/src/index/field_index/full_text_index/stop_words/hungarian.rs
(1 hunks)lib/segment/src/index/field_index/full_text_index/stop_words/indonesian.rs
(1 hunks)lib/segment/src/index/field_index/full_text_index/stop_words/italian.rs
(1 hunks)lib/segment/src/index/field_index/full_text_index/stop_words/kazakh.rs
(1 hunks)lib/segment/src/index/field_index/full_text_index/stop_words/mod.rs
(1 hunks)lib/segment/src/index/field_index/full_text_index/stop_words/nepali.rs
(1 hunks)lib/segment/src/index/field_index/full_text_index/stop_words/norwegian.rs
(1 hunks)lib/segment/src/index/field_index/full_text_index/stop_words/portuguese.rs
(1 hunks)lib/segment/src/index/field_index/full_text_index/stop_words/romanian.rs
(1 hunks)lib/segment/src/index/field_index/full_text_index/stop_words/russian.rs
(1 hunks)lib/segment/src/index/field_index/full_text_index/stop_words/slovene.rs
(1 hunks)lib/segment/src/index/field_index/full_text_index/stop_words/spanish.rs
(1 hunks)lib/segment/src/index/field_index/full_text_index/stop_words/swedish.rs
(1 hunks)lib/segment/src/index/field_index/full_text_index/stop_words/tajik.rs
(1 hunks)lib/segment/src/index/field_index/full_text_index/stop_words/turkish.rs
(1 hunks)
✅ Files skipped from review due to trivial changes (30)
- lib/segment/src/index/field_index/full_text_index/stop_words/finnish.rs
- lib/segment/src/index/field_index/full_text_index/stop_words/hebrew.rs
- lib/segment/src/index/field_index/full_text_index/stop_words/tajik.rs
- lib/segment/src/index/field_index/full_text_index/stop_words/README.md
- lib/segment/src/index/field_index/full_text_index/stop_words/turkish.rs
- lib/segment/src/index/field_index/full_text_index/stop_words/russian.rs
- lib/segment/src/index/field_index/full_text_index/stop_words/basque.rs
- lib/segment/src/index/field_index/full_text_index/stop_words/swedish.rs
- lib/segment/src/index/field_index/full_text_index/stop_words/bengali.rs
- lib/segment/src/index/field_index/full_text_index/stop_words/french.rs
- lib/segment/src/index/field_index/full_text_index/stop_words/dutch.rs
- lib/segment/src/index/field_index/full_text_index/stop_words/romanian.rs
- lib/segment/src/index/field_index/full_text_index/stop_words/italian.rs
- lib/segment/src/index/field_index/full_text_index/stop_words/german.rs
- lib/segment/src/index/field_index/full_text_index/stop_words/indonesian.rs
- lib/segment/src/index/field_index/full_text_index/stop_words/norwegian.rs
- lib/segment/src/index/field_index/full_text_index/stop_words/english.rs
- lib/segment/src/index/field_index/full_text_index/stop_words/slovene.rs
- lib/segment/src/index/field_index/full_text_index/stop_words/catalan.rs
- lib/segment/src/index/field_index/full_text_index/stop_words/portuguese.rs
- lib/segment/src/index/field_index/full_text_index/stop_words/nepali.rs
- lib/segment/src/index/field_index/full_text_index/stop_words/hungarian.rs
- lib/segment/src/index/field_index/full_text_index/stop_words/spanish.rs
- lib/segment/src/index/field_index/full_text_index/stop_words/greek.rs
- lib/segment/src/index/field_index/full_text_index/stop_words/hinglish.rs
- lib/segment/src/index/field_index/full_text_index/stop_words/danish.rs
- lib/segment/src/index/field_index/full_text_index/stop_words/azerbaijani.rs
- lib/segment/src/index/field_index/full_text_index/stop_words/chinese.rs
- lib/segment/src/index/field_index/full_text_index/stop_words/arabic.rs
- lib/segment/src/index/field_index/full_text_index/stop_words/kazakh.rs
🚧 Files skipped from review as they are similar to previous changes (2)
- lib/segment/src/index/field_index/full_text_index/mod.rs
- .github/workflows/codespell.yml
⏰ Context from checks skipped due to timeout of 90000ms (13)
- GitHub Check: test-snapshot-operations-s3-minio
- GitHub Check: test-low-resources
- GitHub Check: storage-compat-test
- GitHub Check: lint
- GitHub Check: test-consistency
- GitHub Check: test-shard-snapshot-api-s3-minio
- GitHub Check: test-consensus-compose
- GitHub Check: integration-tests
- GitHub Check: Basic TLS/HTTPS tests
- GitHub Check: integration-tests-consensus
- GitHub Check: rust-tests (macos-latest)
- GitHub Check: rust-tests (ubuntu-latest)
- GitHub Check: rust-tests (windows-latest)
🔇 Additional comments (2)
lib/segment/src/index/field_index/full_text_index/stop_words/mod.rs (2)
1-34
: Well-organized module structure.The module declarations and imports are clean and properly organized with each language having its dedicated module.
51-51
:⚠️ Potential issueFix the casing inconsistency in the Indonesian stopwords constant name.
The constant name should be
INDONESIAN_STOPWORDS
to match the naming convention used for all other languages.-pub use indonesian::INDONESIAN_STOPWORDS; +pub use indonesian::INDONESIAN_STOPWORDS;Likely an incorrect or invalid review comment.
lib/segment/src/index/field_index/full_text_index/stop_words/mod.rs
Outdated
Show resolved
Hide resolved
gRPC interface is also required |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (6)
lib/api/src/grpc/proto/collections.proto (3)
498-502
: Renamecustom
field for clarity.The field name
custom
is ambiguous; renaming to something likecustom_stopwords
makes its purpose explicit.
503-508
: Consider renamingStopwordsInterface
toStopwordsConfig
.To align with other config messages (
VectorsConfig
,SparseVectorConfig
, etc.), renamingStopwordsInterface
toStopwordsConfig
may improve consistency and discoverability.
516-517
: Document default behavior when unset.Add a comment on the
stopwords
field clarifying that omitting it disables stopword filtering (i.e. default is no stopwords). This prevents any ambiguity for API consumers.docs/grpc/docs.md (3)
73-74
: Fix TOC indentation for new stopwords entries.The
- [StopwordsInterface]
and- [StopwordsSet]
items are indented with 4 spaces but should use 2 spaces like other second-level entries in the Table of Contents.Apply this diff:
- - [StopwordsInterface](#qdrant-StopwordsInterface) - - [StopwordsSet](#qdrant-StopwordsSet) + - [StopwordsInterface](#qdrant-StopwordsInterface) + - [StopwordsSet](#qdrant-StopwordsSet)🧰 Tools
🪛 markdownlint-cli2 (0.17.2)
73-73: Unordered list indentation
Expected: 2; Actual: 4(MD007, ul-indent)
74-74: Unordered list indentation
Expected: 2; Actual: 4(MD007, ul-indent)
101-101
: AlignLanguage
TOC entry with siblings.The
- [Language]
entry is over-indented compared to other first-level enum links. It should be moved back to 2 spaces:- - [Language](#qdrant-Language) + - [Language](#qdrant-Language)🧰 Tools
🪛 markdownlint-cli2 (0.17.2)
101-101: Unordered list indentation
Expected: 2; Actual: 4(MD007, ul-indent)
1635-1636
: Enhancestopwords
field description inTextIndexParams
.Consider clarifying default behavior and linking to
StopwordsInterface
:-| stopwords | [StopwordsInterface](#qdrant-StopwordsInterface) | optional | Stopwords for the text index | +| stopwords | [StopwordsInterface](#qdrant-StopwordsInterface) | optional | Configures language-based or custom stopword filtering; if omitted, no filtering is applied. |
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (4)
docs/grpc/docs.md
(5 hunks)lib/api/src/grpc/conversions.rs
(5 hunks)lib/api/src/grpc/proto/collections.proto
(1 hunks)lib/api/src/grpc/qdrant.rs
(3 hunks)
🧰 Additional context used
🪛 markdownlint-cli2 (0.17.2)
docs/grpc/docs.md
73-73: Unordered list indentation
Expected: 2; Actual: 4
(MD007, ul-indent)
74-74: Unordered list indentation
Expected: 2; Actual: 4
(MD007, ul-indent)
101-101: Unordered list indentation
Expected: 2; Actual: 4
(MD007, ul-indent)
🔇 Additional comments (7)
lib/api/src/grpc/qdrant.rs (3)
821-849
: Well-designed protobuf structures for flexible stopwords configuration.The
StopwordsSet
andStopwordsInterface
structures provide a clean API design:
StopwordsSet
allows combining multiple languages with custom stopwordsStopwordsInterface
uses the oneof pattern for simple (single language) vs complex (multiple languages + custom) use cases- The structure aligns well with protobuf best practices
This addresses the gRPC interface requirement mentioned in the PR comments.
869-871
: Proper integration with existing TextIndexParams.The optional
stopwords
field is correctly added toTextIndexParams
with:
- Appropriate protobuf tag "6" for the next available field number
- Optional type maintaining backward compatibility
- Clear documentation
1739-1843
: Comprehensive language enum with proper protobuf patterns.The
Language
enum provides good coverage with 29 languages including major world languages. The implementation follows standard protobuf patterns with:
- Proper enum value assignments starting from 0
- Standard
as_str_name()
andfrom_str_name()
helper methods- Comprehensive language coverage (Arabic, Chinese, European languages, etc.)
The language selection appears well-balanced for international use cases.
lib/api/src/grpc/conversions.rs (4)
148-220
: Excellent comprehensive language conversion implementation.The bidirectional conversions between internal and gRPC Language enums are thorough and well-implemented. All supported languages are explicitly mapped with proper error handling for unknown values.
283-333
: Proper stopwords conversion from internal to gRPC format.The implementation correctly handles both Language and Set variants of StopwordsInterface, converting internal representations to appropriate gRPC message formats with proper language enum conversions.
508-568
: Robust gRPC to internal stopwords conversion with proper error handling.The implementation correctly converts gRPC StopwordsInterface to internal format with comprehensive error handling for invalid language values. Both Language and Set variants are handled symmetrically to the From conversion.
47-56
: Comprehensive stopwords support successfully added to gRPC API.This implementation resolves the previous limitation where gRPC clients couldn't configure stopwords. The code now provides:
- Complete bidirectional Language enum conversions
- Proper StopwordsInterface conversion for both Language and Set variants
- Comprehensive error handling for invalid language values
- Full gRPC API parity with other interfaces for stopwords functionality
The gRPC API now fully supports stopwords configuration as intended.
Also applies to: 148-220, 283-333, 508-568
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (4)
docs/grpc/docs.md (4)
73-74
: Fix unordered list indentation for new Table of Contents entries
Indentation should be 2 spaces for ToC list items. Please adjust these entries to match surrounding items.🧰 Tools
🪛 markdownlint-cli2 (0.17.2)
73-73: Unordered list indentation
Expected: 2; Actual: 4(MD007, ul-indent)
74-74: Unordered list indentation
Expected: 2; Actual: 4(MD007, ul-indent)
101-101
: Fix unordered list indentation forLanguage
entry
This list item also uses 4 spaces but should use 2 to align with other entries.🧰 Tools
🪛 markdownlint-cli2 (0.17.2)
101-101: Unordered list indentation
Expected: 2; Actual: 4(MD007, ul-indent)
1466-1476
: EnsureStopwordsInterface
section follows existing Markdown patterns
The newStopwordsInterface
heading and table should match the style of other sections (consistent heading depth, spacing before/after, and a brief description). Consider adding a short introductory sentence above the table.
1482-1491
: EnsureStopwordsSet
section follows existing Markdown patterns
Similar toStopwordsInterface
, theStopwordsSet
section would benefit from a brief intro and consistent table formatting (spacing, header styling).
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (42)
.github/workflows/codespell.yml
(1 hunks)docs/grpc/docs.md
(5 hunks)docs/redoc/master/openapi.json
(2 hunks)lib/api/src/grpc/conversions.rs
(5 hunks)lib/api/src/grpc/proto/collections.proto
(1 hunks)lib/api/src/grpc/qdrant.rs
(3 hunks)lib/segment/src/data_types/index.rs
(2 hunks)lib/segment/src/index/field_index/full_text_index/mod.rs
(1 hunks)lib/segment/src/index/field_index/full_text_index/mutable_text_index.rs
(1 hunks)lib/segment/src/index/field_index/full_text_index/stop_words/README.md
(1 hunks)lib/segment/src/index/field_index/full_text_index/stop_words/arabic.rs
(1 hunks)lib/segment/src/index/field_index/full_text_index/stop_words/azerbaijani.rs
(1 hunks)lib/segment/src/index/field_index/full_text_index/stop_words/basque.rs
(1 hunks)lib/segment/src/index/field_index/full_text_index/stop_words/bengali.rs
(1 hunks)lib/segment/src/index/field_index/full_text_index/stop_words/catalan.rs
(1 hunks)lib/segment/src/index/field_index/full_text_index/stop_words/chinese.rs
(1 hunks)lib/segment/src/index/field_index/full_text_index/stop_words/danish.rs
(1 hunks)lib/segment/src/index/field_index/full_text_index/stop_words/dutch.rs
(1 hunks)lib/segment/src/index/field_index/full_text_index/stop_words/english.rs
(1 hunks)lib/segment/src/index/field_index/full_text_index/stop_words/finnish.rs
(1 hunks)lib/segment/src/index/field_index/full_text_index/stop_words/french.rs
(1 hunks)lib/segment/src/index/field_index/full_text_index/stop_words/german.rs
(1 hunks)lib/segment/src/index/field_index/full_text_index/stop_words/greek.rs
(1 hunks)lib/segment/src/index/field_index/full_text_index/stop_words/hebrew.rs
(1 hunks)lib/segment/src/index/field_index/full_text_index/stop_words/hinglish.rs
(1 hunks)lib/segment/src/index/field_index/full_text_index/stop_words/hungarian.rs
(1 hunks)lib/segment/src/index/field_index/full_text_index/stop_words/indonesian.rs
(1 hunks)lib/segment/src/index/field_index/full_text_index/stop_words/italian.rs
(1 hunks)lib/segment/src/index/field_index/full_text_index/stop_words/kazakh.rs
(1 hunks)lib/segment/src/index/field_index/full_text_index/stop_words/mod.rs
(1 hunks)lib/segment/src/index/field_index/full_text_index/stop_words/nepali.rs
(1 hunks)lib/segment/src/index/field_index/full_text_index/stop_words/norwegian.rs
(1 hunks)lib/segment/src/index/field_index/full_text_index/stop_words/portuguese.rs
(1 hunks)lib/segment/src/index/field_index/full_text_index/stop_words/romanian.rs
(1 hunks)lib/segment/src/index/field_index/full_text_index/stop_words/russian.rs
(1 hunks)lib/segment/src/index/field_index/full_text_index/stop_words/slovene.rs
(1 hunks)lib/segment/src/index/field_index/full_text_index/stop_words/spanish.rs
(1 hunks)lib/segment/src/index/field_index/full_text_index/stop_words/swedish.rs
(1 hunks)lib/segment/src/index/field_index/full_text_index/stop_words/tajik.rs
(1 hunks)lib/segment/src/index/field_index/full_text_index/stop_words/turkish.rs
(1 hunks)lib/segment/src/index/field_index/full_text_index/tests/mod.rs
(1 hunks)lib/segment/src/index/field_index/full_text_index/tokenizers.rs
(6 hunks)
✅ Files skipped from review due to trivial changes (2)
- lib/segment/src/index/field_index/full_text_index/mutable_text_index.rs
- lib/segment/src/index/field_index/full_text_index/stop_words/arabic.rs
🚧 Files skipped from review as they are similar to previous changes (39)
- lib/segment/src/index/field_index/full_text_index/mod.rs
- lib/segment/src/index/field_index/full_text_index/stop_words/README.md
- lib/segment/src/index/field_index/full_text_index/stop_words/swedish.rs
- lib/segment/src/index/field_index/full_text_index/tests/mod.rs
- .github/workflows/codespell.yml
- lib/segment/src/index/field_index/full_text_index/stop_words/basque.rs
- lib/segment/src/index/field_index/full_text_index/stop_words/french.rs
- lib/segment/src/index/field_index/full_text_index/stop_words/english.rs
- lib/segment/src/index/field_index/full_text_index/stop_words/azerbaijani.rs
- lib/segment/src/index/field_index/full_text_index/stop_words/dutch.rs
- lib/segment/src/index/field_index/full_text_index/stop_words/danish.rs
- lib/segment/src/index/field_index/full_text_index/stop_words/turkish.rs
- lib/segment/src/index/field_index/full_text_index/stop_words/finnish.rs
- lib/segment/src/index/field_index/full_text_index/stop_words/hungarian.rs
- lib/segment/src/index/field_index/full_text_index/stop_words/indonesian.rs
- lib/segment/src/index/field_index/full_text_index/stop_words/greek.rs
- lib/segment/src/index/field_index/full_text_index/stop_words/hebrew.rs
- lib/segment/src/index/field_index/full_text_index/stop_words/nepali.rs
- lib/segment/src/index/field_index/full_text_index/stop_words/tajik.rs
- lib/segment/src/index/field_index/full_text_index/stop_words/russian.rs
- lib/segment/src/index/field_index/full_text_index/stop_words/norwegian.rs
- lib/segment/src/index/field_index/full_text_index/stop_words/romanian.rs
- lib/segment/src/index/field_index/full_text_index/stop_words/portuguese.rs
- lib/segment/src/index/field_index/full_text_index/stop_words/german.rs
- lib/segment/src/index/field_index/full_text_index/stop_words/slovene.rs
- lib/segment/src/index/field_index/full_text_index/stop_words/catalan.rs
- lib/segment/src/index/field_index/full_text_index/stop_words/italian.rs
- lib/segment/src/index/field_index/full_text_index/stop_words/kazakh.rs
- lib/segment/src/index/field_index/full_text_index/stop_words/hinglish.rs
- lib/segment/src/index/field_index/full_text_index/stop_words/spanish.rs
- lib/segment/src/index/field_index/full_text_index/tokenizers.rs
- lib/api/src/grpc/qdrant.rs
- lib/segment/src/index/field_index/full_text_index/stop_words/bengali.rs
- lib/api/src/grpc/proto/collections.proto
- lib/segment/src/data_types/index.rs
- lib/segment/src/index/field_index/full_text_index/stop_words/mod.rs
- lib/segment/src/index/field_index/full_text_index/stop_words/chinese.rs
- docs/redoc/master/openapi.json
- lib/api/src/grpc/conversions.rs
🧰 Additional context used
🪛 markdownlint-cli2 (0.17.2)
docs/grpc/docs.md
73-73: Unordered list indentation
Expected: 2; Actual: 4
(MD007, ul-indent)
74-74: Unordered list indentation
Expected: 2; Actual: 4
(MD007, ul-indent)
101-101: Unordered list indentation
Expected: 2; Actual: 4
(MD007, ul-indent)
⏰ Context from checks skipped due to timeout of 90000ms (13)
- GitHub Check: test-snapshot-operations-s3-minio
- GitHub Check: test-consensus-compose
- GitHub Check: test-consistency
- GitHub Check: Basic TLS/HTTPS tests
- GitHub Check: test-low-resources
- GitHub Check: test-shard-snapshot-api-s3-minio
- GitHub Check: integration-tests
- GitHub Check: lint
- GitHub Check: storage-compat-test
- GitHub Check: integration-tests-consensus
- GitHub Check: rust-tests (windows-latest)
- GitHub Check: rust-tests (ubuntu-latest)
- GitHub Check: rust-tests (macos-latest)
🔇 Additional comments (2)
docs/grpc/docs.md (2)
1635-1636
: Verify gRPC service definitions include the newstopwords
field
Please confirm that theTextIndexParams
message with itsstopwords
field is referenced in the RPC method docs (e.g., CreateCollection, UpdateCollection) so clients can actually use this parameter.
1928-1964
: Language enumeration documented correctly
TheLanguage
enum entries align with the protobuf definitions, follow the same ordering, and include all expected values.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (1)
lib/segment/src/data_types/index.rs (1)
208-213
: Consider potential ambiguity with untagged enum.The
#[serde(untagged)]
approach works but could potentially cause confusion during deserialization. If a user provides a string that happens to match both a language name and could be interpreted as a custom stopword, the Language variant will take precedence.However, given the comprehensive language list and aliases, this is likely the intended behavior and the tests demonstrate it works correctly.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (5)
docs/redoc/master/openapi.json
(2 hunks)lib/api/src/grpc/conversions.rs
(5 hunks)lib/segment/src/data_types/index.rs
(3 hunks)lib/segment/src/index/field_index/full_text_index/stop_words/mod.rs
(1 hunks)lib/segment/src/index/field_index/full_text_index/tokenizers.rs
(6 hunks)
🚧 Files skipped from review as they are similar to previous changes (4)
- lib/segment/src/index/field_index/full_text_index/tokenizers.rs
- lib/segment/src/index/field_index/full_text_index/stop_words/mod.rs
- lib/api/src/grpc/conversions.rs
- docs/redoc/master/openapi.json
⏰ Context from checks skipped due to timeout of 90000ms (13)
- GitHub Check: test-shard-snapshot-api-s3-minio
- GitHub Check: test-low-resources
- GitHub Check: Basic TLS/HTTPS tests
- GitHub Check: test-snapshot-operations-s3-minio
- GitHub Check: test-consistency
- GitHub Check: test-consensus-compose
- GitHub Check: integration-tests-consensus
- GitHub Check: integration-tests
- GitHub Check: rust-tests (macos-latest)
- GitHub Check: rust-tests (windows-latest)
- GitHub Check: rust-tests (ubuntu-latest)
- GitHub Check: lint
- GitHub Check: storage-compat-test
🔇 Additional comments (6)
lib/segment/src/data_types/index.rs (6)
1-1
: Import addition looks good.The import of
BTreeSet
is appropriate for the stopwords collections and provides deterministic ordering.
204-206
: Stopwords field integration is well-designed.The optional
stopwords
field withskip_serializing_if = "Option::is_none"
ensures backward compatibility as discussed in past reviews.
215-278
: Excellent language coverage with ISO aliases.The Language enum addresses the previous review suggestion about ISO abbreviations as aliases. The comprehensive coverage of languages with their standard ISO codes will provide good user experience.
The
UnspecifiedLanguage
variant with "unspecified" alias provides a clear semantic meaning distinct fromNone
(no stopwords) vs unspecified language stopwords.
280-324
: Custom deserialization provides good user experience.The custom deserialization implementation handles both full language names and ISO aliases case-insensitively, which addresses the concern from past reviews about better error messages. The error message format is clear and actionable.
326-336
: StopwordsSet design allows flexible configuration.The struct design supporting both singular
language
and plurallanguages
fields plus custom stopwords provides good flexibility. The use ofBTreeSet
for both languages and custom stopwords ensures deterministic ordering and efficient lookups, addressing previous review suggestions.The serialization skipping for empty collections maintains clean JSON output.
382-533
: Comprehensive test coverage validates the implementation.The test suite thoroughly covers:
- Language and Set variant serialization/deserialization
- ISO alias handling for individual languages and in sets
- Error handling for unsupported languages
- Various JSON format combinations
This addresses potential edge cases and ensures the API works as expected.
lib/segment/src/index/field_index/full_text_index/stop_words/mod.rs
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (1)
lib/api/src/grpc/conversions.rs (1)
47-53
: Addresses previous gRPC stopwords support concern.The addition of
Language
andStopwordsSet
imports and the comprehensive conversion implementations in this file successfully address the previous review comment about incomplete gRPC stopwords support. gRPC clients can now fully configure stopwords through the API.
🧹 Nitpick comments (2)
docs/grpc/docs.md (2)
73-73
: Fix TOC indentation for StopwordsSet entry.
The bullet for StopwordsSet is indented with 4 spaces but the surrounding list uses 2-space indentation.🧰 Tools
🪛 markdownlint-cli2 (0.17.2)
73-73: Unordered list indentation
Expected: 2; Actual: 4(MD007, ul-indent)
100-100
: Fix TOC indentation for Language entry.
The bullet for Language is indented with 4 spaces but the surrounding list uses 2-space indentation.🧰 Tools
🪛 markdownlint-cli2 (0.17.2)
100-100: Unordered list indentation
Expected: 2; Actual: 4(MD007, ul-indent)
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (4)
docs/grpc/docs.md
(5 hunks)lib/api/src/grpc/conversions.rs
(5 hunks)lib/api/src/grpc/proto/collections.proto
(1 hunks)lib/api/src/grpc/qdrant.rs
(3 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
- lib/api/src/grpc/proto/collections.proto
- lib/api/src/grpc/qdrant.rs
🧰 Additional context used
🪛 markdownlint-cli2 (0.17.2)
docs/grpc/docs.md
73-73: Unordered list indentation
Expected: 2; Actual: 4
(MD007, ul-indent)
100-100: Unordered list indentation
Expected: 2; Actual: 4
(MD007, ul-indent)
⏰ Context from checks skipped due to timeout of 90000ms (13)
- GitHub Check: test-low-resources
- GitHub Check: test-shard-snapshot-api-s3-minio
- GitHub Check: Basic TLS/HTTPS tests
- GitHub Check: test-snapshot-operations-s3-minio
- GitHub Check: rust-tests (macos-latest)
- GitHub Check: test-consistency
- GitHub Check: test-consensus-compose
- GitHub Check: storage-compat-test
- GitHub Check: integration-tests-consensus
- GitHub Check: lint
- GitHub Check: rust-tests (windows-latest)
- GitHub Check: integration-tests
- GitHub Check: rust-tests (ubuntu-latest)
🔇 Additional comments (6)
lib/api/src/grpc/conversions.rs (3)
147-225
: Excellent implementation of Language enum conversions.The bidirectional conversion between internal and gRPC Language enums is well-implemented with:
- Complete mapping of all language variants
- Proper error handling for the UnspecifiedLanguage variant
- Clear and descriptive error messages
- Consistent with Rust conversion patterns
298-320
: Well-implemented stopwords conversion from internal to gRPC types.The conversion correctly handles both StopwordsInterface variants:
- Language variant maps to single-language StopwordsSet
- Set variant properly extracts primary language and custom words
- All data is preserved during conversion
521-567
: Comprehensive stopwords conversion with good optimization for common cases.The conversion logic correctly handles all stopwords configuration scenarios:
- Empty configurations optimized to None
- Single-language configurations optimized to Language variant
- Multi-language/custom configurations use Set variant with proper primary/secondary language handling
The comment about primary language being available for future lemmatizer use is helpful context.
docs/grpc/docs.md (3)
1465-1475
: Approve the StopwordsSet documentation.
The newStopwordsSet
section follows the established pattern for message definitions, with a clear table of fields forlanguages
andcustom
.
1618-1618
: Approvestopwords
field in TextIndexParams.
The optionalstopwords
field inTextIndexParams
is correctly introduced with the appropriate link and description.
1911-1918
: Approve the Language enum addition.
TheLanguage
enum and its anchor are correctly added, enumerating supported languages for stopword filtering.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
♻️ Duplicate comments (1)
lib/segment/src/data_types/index.rs (1)
217-280
: Address past review feedback about UnspecifiedLanguage variant.Based on past review comments, there was consensus to remove the
UnspecifiedLanguage
variant from internal data structures and only keep it in the gRPC API layer. The current implementation still includes this variant in the internalLanguage
enum.Consider removing
UnspecifiedLanguage
from this internal enum and handling the conversion at the gRPC boundary as previously discussed.
🧹 Nitpick comments (1)
docs/grpc/docs.md (1)
73-73
: Fix unordered list indentation
Adjust the indentation to match the rest of the Table of Contents (2 spaces).
Apply this diff:- - [StopwordsSet](#qdrant-StopwordsSet) + - [StopwordsSet](#qdrant-StopwordsSet)🧰 Tools
🪛 markdownlint-cli2 (0.17.2)
73-73: Unordered list indentation
Expected: 2; Actual: 4(MD007, ul-indent)
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (8)
docs/grpc/docs.md
(3 hunks)docs/redoc/master/openapi.json
(2 hunks)lib/api/src/grpc/conversions.rs
(5 hunks)lib/api/src/grpc/proto/collections.proto
(1 hunks)lib/api/src/grpc/qdrant.rs
(2 hunks)lib/segment/src/data_types/index.rs
(3 hunks)lib/segment/src/index/field_index/full_text_index/stop_words/mod.rs
(1 hunks)lib/segment/src/index/field_index/full_text_index/tokenizers.rs
(6 hunks)
✅ Files skipped from review due to trivial changes (3)
- lib/api/src/grpc/proto/collections.proto
- lib/api/src/grpc/qdrant.rs
- lib/segment/src/index/field_index/full_text_index/stop_words/mod.rs
🚧 Files skipped from review as they are similar to previous changes (2)
- docs/redoc/master/openapi.json
- lib/api/src/grpc/conversions.rs
🧰 Additional context used
🪛 markdownlint-cli2 (0.17.2)
docs/grpc/docs.md
73-73: Unordered list indentation
Expected: 2; Actual: 4
(MD007, ul-indent)
⏰ Context from checks skipped due to timeout of 90000ms (13)
- GitHub Check: rust-tests (ubuntu-latest)
- GitHub Check: rust-tests (macos-latest)
- GitHub Check: rust-tests (windows-latest)
- GitHub Check: storage-compat-test
- GitHub Check: test-snapshot-operations-s3-minio
- GitHub Check: test-shard-snapshot-api-s3-minio
- GitHub Check: test-consistency
- GitHub Check: test-consensus-compose
- GitHub Check: test-low-resources
- GitHub Check: Basic TLS/HTTPS tests
- GitHub Check: lint
- GitHub Check: integration-tests-consensus
- GitHub Check: integration-tests
🔇 Additional comments (13)
docs/grpc/docs.md (2)
1464-1478
: Document the newStopwordsSet
message
TheStopwordsSet
anchor, heading, and property table are correctly defined and align with the added protobuf schema.
1617-1617
: Addstopwords
toTextIndexParams
The optionalstopwords
field is now documented alongside other text-index parameters.lib/segment/src/index/field_index/full_text_index/tokenizers.rs (5)
4-4
: LGTM: Import addition for stopword filtering.The import of
StopwordsFilter
is correctly placed and necessary for the new stopword filtering functionality.
85-86
: LGTM: Well-designed stopword filtering integration.The stopword filtering is correctly integrated into the token processing pipeline:
- Filter creation uses appropriate configuration parameters
- Filtering occurs after length validation but before case normalization
- Early return pattern efficiently skips stopwords
This ensures consistent behavior across different tokenizer configurations.
Also applies to: 104-106
148-151
: LGTM: Necessary test imports.The additional imports are required for the comprehensive stopword filtering test cases.
270-270
: LGTM: Backward compatibility maintained.Adding
stopwords: None
to the existing test ensures backward compatibility with the new optional field.
285-558
: LGTM: Comprehensive test coverage for stopword filtering.The test suite thoroughly validates stopword filtering functionality across multiple scenarios:
- Language-specific stopwords (English, Spanish combinations)
- Custom stopword sets
- Mixed language and custom configurations
- Case sensitivity handling
- Edge cases with punctuation and contractions
Each test properly validates both stopword removal and preservation of non-stopwords, ensuring robust functionality.
lib/segment/src/data_types/index.rs (6)
1-4
: LGTM: Necessary imports for stopword functionality.The additional imports support the new data structures and custom trait implementations.
206-208
: LGTM: Clean addition of stopwords field.The optional
stopwords
field is properly integrated intoTextIndexParams
with appropriate serde annotations for backward compatibility.
210-215
: LGTM: Flexible interface design.The untagged
StopwordsInterface
enum provides an intuitive API allowing both simple language strings and complex configurations.
282-326
: LGTM: Robust custom deserialization.The custom deserialization implementation provides excellent user experience:
- Case-insensitive language matching
- Support for both full names and ISO codes
- Clear error messages for unsupported languages
- Comprehensive coverage of supported language variants
328-341
: LGTM: Consistent trait implementations.The
Display
andFromStr
implementations cleverly reuse the JSON serialization logic, ensuring consistency with the serde configuration.
343-376
: Consider field naming consistency.The
StopwordsSet
structure only defines alanguages
field, but some tests (e.g., line 473) suggest that alanguage
field should also work. Based on past review comments, there was discussion about supporting both singular and plural forms.The current implementation with custom deserialization handles single/multiple values but doesn't address the field naming flexibility that was previously discussed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🔭 Outside diff range comments (1)
lib/segment/src/index/field_index/full_text_index/tokenizers.rs (1)
301-312
: 🛠️ Refactor suggestion
⚠️ Potential issueUnit test fails due to unexpected lower-casing
test_word_tokenizer
invokes the tokenizer withlowercase = true
, but the expected token list contains"Привет"
(capital ‘П’).
The tokenizer lower-cases the token to"привет"
, causing the assertion to fail.Fix either the call or the expectation; the minimal change is to run the tokenizer in case-sensitive mode:
- WordTokenizer::tokenize(text, true, &StopwordsFilter::default(), |token| { + WordTokenizer::tokenize(text, false, &StopwordsFilter::default(), |token| {
♻️ Duplicate comments (1)
lib/segment/src/index/field_index/full_text_index/tokenizers.rs (1)
207-218
: Multi-byte length checks still use logical AND – bug persistsThe
doc_token_filter
keeps tokens when both the byte length and thechar
count satisfy the limit.
For UTF-8 strings this lets single-character multi-byte tokens slip through themin_token_len
filter (and vice-versa formax_token_len
).This exact issue was raised in a previous review and hasn’t been fixed.
- .map(|min_len| token.len() < min_len && token.chars().count() < min_len) + .map(|min_len| token.len() < min_len || token.chars().count() < min_len) ... - .map(|max_len| token.len() > max_len && token.chars().count() > max_len) + .map(|max_len| token.len() > max_len || token.chars().count() > max_len)
🧹 Nitpick comments (3)
lib/segment/src/index/field_index/full_text_index/tests/test_congruence.rs (1)
107-112
: Use..
to future-proof the struct patternThe test had to be updated solely because a new
tokenizer
field was added toMutableFullTextIndex
.
Capturing only the fields you actually need and ignoring the rest with..
keeps the pattern resilient to further internal changes and avoids extra churn.- let MutableFullTextIndex { - storage, - inverted_index: _, - config, - tokenizer: _, - } = index; + let MutableFullTextIndex { + storage, + config, + .. + } = index;lib/segment/src/index/field_index/full_text_index/tokenizers.rs (2)
248-250
:min_token_len
/max_token_len
conflated with n-gram size
PrefixTokenizer::tokenize
is called withself.min_token_len
&self.max_token_len
as the ngram range.
These fields are documented/used elsewhere as whole-token length limits, so re-using them here overloads semantics and may break callers that set token-length limits unrelated to prefix size.Consider adding dedicated
min_prefix
/max_prefix
parameters inTextIndexParams
to avoid semantic leakage.
89-97
: Quadratic scanning in prefix generationInside every
n
loop you recreate the iterator and traversechar_indices()
up ton
(nth(n)
), resulting in O(L²) work per word.
Pre-computing the offsets once avoids the repeated walk:- for n in min_ngram..=max_ngram { - let ngram = word_cow.char_indices().map(|(i, _)| i).nth(n); + let offsets: Vec<usize> = word_cow.char_indices().map(|(i, _)| i).collect(); + for n in min_ngram..=max_ngram { + let ngram = offsets.get(n).copied();This keeps complexity at O(L) and benefits long tokens.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (46)
.github/workflows/codespell.yml
(1 hunks)docs/grpc/docs.md
(3 hunks)docs/redoc/master/openapi.json
(2 hunks)lib/api/src/grpc/conversions.rs
(6 hunks)lib/api/src/grpc/proto/collections.proto
(1 hunks)lib/api/src/grpc/qdrant.rs
(2 hunks)lib/segment/src/data_types/index.rs
(3 hunks)lib/segment/src/index/field_index/full_text_index/immutable_text_index.rs
(3 hunks)lib/segment/src/index/field_index/full_text_index/mmap_text_index.rs
(5 hunks)lib/segment/src/index/field_index/full_text_index/mod.rs
(1 hunks)lib/segment/src/index/field_index/full_text_index/mutable_text_index.rs
(5 hunks)lib/segment/src/index/field_index/full_text_index/stop_words/README.md
(1 hunks)lib/segment/src/index/field_index/full_text_index/stop_words/arabic.rs
(1 hunks)lib/segment/src/index/field_index/full_text_index/stop_words/azerbaijani.rs
(1 hunks)lib/segment/src/index/field_index/full_text_index/stop_words/basque.rs
(1 hunks)lib/segment/src/index/field_index/full_text_index/stop_words/bengali.rs
(1 hunks)lib/segment/src/index/field_index/full_text_index/stop_words/catalan.rs
(1 hunks)lib/segment/src/index/field_index/full_text_index/stop_words/chinese.rs
(1 hunks)lib/segment/src/index/field_index/full_text_index/stop_words/danish.rs
(1 hunks)lib/segment/src/index/field_index/full_text_index/stop_words/dutch.rs
(1 hunks)lib/segment/src/index/field_index/full_text_index/stop_words/english.rs
(1 hunks)lib/segment/src/index/field_index/full_text_index/stop_words/finnish.rs
(1 hunks)lib/segment/src/index/field_index/full_text_index/stop_words/french.rs
(1 hunks)lib/segment/src/index/field_index/full_text_index/stop_words/german.rs
(1 hunks)lib/segment/src/index/field_index/full_text_index/stop_words/greek.rs
(1 hunks)lib/segment/src/index/field_index/full_text_index/stop_words/hebrew.rs
(1 hunks)lib/segment/src/index/field_index/full_text_index/stop_words/hinglish.rs
(1 hunks)lib/segment/src/index/field_index/full_text_index/stop_words/hungarian.rs
(1 hunks)lib/segment/src/index/field_index/full_text_index/stop_words/indonesian.rs
(1 hunks)lib/segment/src/index/field_index/full_text_index/stop_words/italian.rs
(1 hunks)lib/segment/src/index/field_index/full_text_index/stop_words/kazakh.rs
(1 hunks)lib/segment/src/index/field_index/full_text_index/stop_words/mod.rs
(1 hunks)lib/segment/src/index/field_index/full_text_index/stop_words/nepali.rs
(1 hunks)lib/segment/src/index/field_index/full_text_index/stop_words/norwegian.rs
(1 hunks)lib/segment/src/index/field_index/full_text_index/stop_words/portuguese.rs
(1 hunks)lib/segment/src/index/field_index/full_text_index/stop_words/romanian.rs
(1 hunks)lib/segment/src/index/field_index/full_text_index/stop_words/russian.rs
(1 hunks)lib/segment/src/index/field_index/full_text_index/stop_words/slovene.rs
(1 hunks)lib/segment/src/index/field_index/full_text_index/stop_words/spanish.rs
(1 hunks)lib/segment/src/index/field_index/full_text_index/stop_words/swedish.rs
(1 hunks)lib/segment/src/index/field_index/full_text_index/stop_words/tajik.rs
(1 hunks)lib/segment/src/index/field_index/full_text_index/stop_words/turkish.rs
(1 hunks)lib/segment/src/index/field_index/full_text_index/tests/mod.rs
(1 hunks)lib/segment/src/index/field_index/full_text_index/tests/test_congruence.rs
(1 hunks)lib/segment/src/index/field_index/full_text_index/text_index.rs
(7 hunks)lib/segment/src/index/field_index/full_text_index/tokenizers.rs
(13 hunks)
✅ Files skipped from review due to trivial changes (3)
- lib/segment/src/index/field_index/full_text_index/stop_words/nepali.rs
- lib/api/src/grpc/qdrant.rs
- lib/segment/src/index/field_index/full_text_index/stop_words/portuguese.rs
🚧 Files skipped from review as they are similar to previous changes (40)
- lib/segment/src/index/field_index/full_text_index/tests/mod.rs
- lib/segment/src/index/field_index/full_text_index/stop_words/README.md
- .github/workflows/codespell.yml
- lib/segment/src/index/field_index/full_text_index/stop_words/turkish.rs
- lib/segment/src/index/field_index/full_text_index/mod.rs
- lib/segment/src/index/field_index/full_text_index/stop_words/dutch.rs
- lib/segment/src/index/field_index/full_text_index/stop_words/basque.rs
- lib/segment/src/index/field_index/full_text_index/stop_words/german.rs
- lib/segment/src/index/field_index/full_text_index/stop_words/catalan.rs
- lib/segment/src/index/field_index/full_text_index/stop_words/french.rs
- lib/segment/src/index/field_index/full_text_index/stop_words/azerbaijani.rs
- lib/segment/src/index/field_index/full_text_index/stop_words/danish.rs
- lib/api/src/grpc/proto/collections.proto
- lib/segment/src/index/field_index/full_text_index/stop_words/hebrew.rs
- lib/segment/src/index/field_index/full_text_index/stop_words/bengali.rs
- lib/segment/src/index/field_index/full_text_index/stop_words/english.rs
- lib/segment/src/index/field_index/full_text_index/stop_words/hungarian.rs
- lib/segment/src/index/field_index/full_text_index/stop_words/indonesian.rs
- lib/segment/src/index/field_index/full_text_index/stop_words/tajik.rs
- lib/segment/src/index/field_index/full_text_index/stop_words/romanian.rs
- lib/segment/src/index/field_index/full_text_index/stop_words/italian.rs
- lib/segment/src/index/field_index/full_text_index/stop_words/spanish.rs
- lib/segment/src/index/field_index/full_text_index/stop_words/hinglish.rs
- lib/segment/src/index/field_index/full_text_index/stop_words/arabic.rs
- lib/segment/src/index/field_index/full_text_index/stop_words/finnish.rs
- lib/segment/src/index/field_index/full_text_index/stop_words/slovene.rs
- lib/segment/src/index/field_index/full_text_index/stop_words/swedish.rs
- lib/segment/src/index/field_index/full_text_index/stop_words/chinese.rs
- lib/segment/src/index/field_index/full_text_index/stop_words/norwegian.rs
- lib/segment/src/index/field_index/full_text_index/stop_words/kazakh.rs
- lib/segment/src/index/field_index/full_text_index/stop_words/greek.rs
- lib/segment/src/index/field_index/full_text_index/mmap_text_index.rs
- lib/segment/src/index/field_index/full_text_index/mutable_text_index.rs
- lib/segment/src/index/field_index/full_text_index/immutable_text_index.rs
- lib/segment/src/index/field_index/full_text_index/stop_words/mod.rs
- lib/api/src/grpc/conversions.rs
- lib/segment/src/index/field_index/full_text_index/stop_words/russian.rs
- lib/segment/src/index/field_index/full_text_index/text_index.rs
- docs/redoc/master/openapi.json
- lib/segment/src/data_types/index.rs
🧰 Additional context used
🧬 Code Graph Analysis (1)
lib/segment/src/index/field_index/full_text_index/tokenizers.rs (3)
lib/segment/src/index/field_index/full_text_index/mmap_text_index.rs (1)
new
(93-103)lib/segment/src/index/field_index/full_text_index/stop_words/mod.rs (1)
new
(71-103)lib/segment/src/data_types/index.rs (2)
new_custom
(219-224)new_set
(232-237)
🪛 markdownlint-cli2 (0.17.2)
docs/grpc/docs.md
73-73: Unordered list indentation
Expected: 2; Actual: 4
(MD007, ul-indent)
⏰ Context from checks skipped due to timeout of 90000ms (13)
- GitHub Check: test-snapshot-operations-s3-minio
- GitHub Check: test-consistency
- GitHub Check: Basic TLS/HTTPS tests
- GitHub Check: test-shard-snapshot-api-s3-minio
- GitHub Check: test-low-resources
- GitHub Check: test-consensus-compose
- GitHub Check: rust-tests (windows-latest)
- GitHub Check: rust-tests (ubuntu-latest)
- GitHub Check: rust-tests (macos-latest)
- GitHub Check: integration-tests-consensus
- GitHub Check: integration-tests
- GitHub Check: lint
- GitHub Check: storage-compat-test
🔇 Additional comments (3)
docs/grpc/docs.md (3)
73-73
: Approve: TOC updated with StopwordsSetThe new
StopwordsSet
entry is correctly inserted in the Table of Contents and follows the existing indentation and ordering.🧰 Tools
🪛 markdownlint-cli2 (0.17.2)
73-73: Unordered list indentation
Expected: 2; Actual: 4(MD007, ul-indent)
1464-1474
: Approve: StopwordsSet documentation addedThe
StopwordsSet
message definition and its field descriptions align with the style of other message docs and clearly explain bothlanguages
andcustom
stopword options.
1617-1617
: Approve: TextIndexParams extended with stopwordsThe optional
stopwords
field has been added toTextIndexParams
with a clear description, enabling users to configure stopword filtering in the text index.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (1)
lib/segment/src/index/field_index/full_text_index/tokenizers.rs (1)
207-218
:⚠️ Potential issueFix logical error in
min_token_len
/max_token_len
checks – use||
, not&&
Using a logical AND means the filter is triggered only when both the byte‐length and the char count violate the limit.
For multi-byte characters this allows out-of-spec tokens to slip through (e.g."é"
haslen() == 2
,chars().count() == 1
).- .map(|min_len| token.len() < min_len && token.chars().count() < min_len) + .map(|min_len| token.len() < min_len || token.chars().count() < min_len) … - .map(|max_len| token.len() > max_len && token.chars().count() > max_len) + .map(|max_len| token.len() > max_len || token.chars().count() > max_len)This enforces the bounds consistently for both ASCII and UTF-8 multibyte tokens.
🧹 Nitpick comments (2)
lib/segment/src/index/field_index/full_text_index/tokenizers.rs (2)
89-98
: Avoid O(n²) behaviour when generating n-grams
char_indices()
is recomputed for everyn
in the loop, walking the word repeatedly.
Pre-compute the char boundaries once and reuse them – this cuts the complexity from O(L·N) to O(L + N) and allocates at most one smallVec
.Pseudocode sketch:
let boundaries: Vec<usize> = word_cow.char_indices().map(|(i, _)| i).collect(); for idx in min_ngram..=max_ngram { if let Some(&end) = boundaries.get(idx) { callback(&word_cow[..end]); } else { callback(&word_cow); break; } }
45-56
: Keep contractions intact inWordTokenizer
Splitting on every non-alphanumeric char breaks common contractions (
i'd
,t'ha
) into multiple tokens, but your tests treat them as single units and stopwords can contain apostrophes.- for token in text.split(|c| !char::is_alphanumeric(c)) { + for token in text.split(|c| !(char::is_alphanumeric(c) || c == '\'')) {This preserves English/Italian contractions without affecting other language tokens.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
lib/segment/src/index/field_index/full_text_index/stop_words/mod.rs
(1 hunks)lib/segment/src/index/field_index/full_text_index/tokenizers.rs
(12 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- lib/segment/src/index/field_index/full_text_index/stop_words/mod.rs
⏰ Context from checks skipped due to timeout of 90000ms (14)
- GitHub Check: test-low-resources
- GitHub Check: Basic TLS/HTTPS tests
- GitHub Check: test-shard-snapshot-api-s3-minio
- GitHub Check: test-snapshot-operations-s3-minio
- GitHub Check: test-consensus-compose
- GitHub Check: test-consistency
- GitHub Check: integration-tests
- GitHub Check: integration-tests-consensus
- GitHub Check: rust-tests (macos-latest)
- GitHub Check: rust-tests (windows-latest)
- GitHub Check: rust-tests (ubuntu-latest)
- GitHub Check: storage-compat-test
- GitHub Check: rust-tests-no-rocksdb (ubuntu-latest)
- GitHub Check: lint
🔇 Additional comments (1)
lib/segment/src/index/field_index/full_text_index/tokenizers.rs (1)
22-31
: Lower-casing before the stopword check is now consistent – good jobThe previous double-handling bug is gone; the filter and probe are both case-normalised once.
Implementation looks correct.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (2)
lib/segment/src/index/field_index/full_text_index/text_index.rs (2)
161-167
: Method is fine but could be simplified withas_ref()
The
get_tokenizer
helper works, yet the triple-match cascades duplicate logic already present in many other helpers.
A tiny ergonomic win is to rely onas_ref()
onself
and pattern-match only once, e.g.:- fn get_tokenizer(&self) -> &Tokenizer { - match self { - Self::Mutable(index) => &index.tokenizer, - Self::Immutable(index) => &index.tokenizer, - Self::Mmap(index) => &index.tokenizer, - } - } + fn get_tokenizer(&self) -> &Tokenizer { + match self { + FullTextIndex::Mutable(idx) + | FullTextIndex::Immutable(idx) => &idx.tokenizer, + FullTextIndex::Mmap(idx) => &idx.tokenizer, + } + }Not critical, just trims three identical arms.
245-276
: Keep test-only helpers private to tests
serialize_token_set
/deserialize_token_set
arepub(super)
but only used inside the#[cfg(test)]
module.
To avoid the symbol leaking in non-test builds you can scope them to thetests
module:-#[cfg(test)] -pub(super) fn serialize_token_set(...) +#[cfg(test)] +fn serialize_token_set(...)Same for
deserialize_token_set
.
Minor hygiene, no functional impact.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
lib/segment/src/index/field_index/full_text_index/immutable_text_index.rs
(3 hunks)lib/segment/src/index/field_index/full_text_index/mmap_text_index.rs
(5 hunks)lib/segment/src/index/field_index/full_text_index/text_index.rs
(7 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
- lib/segment/src/index/field_index/full_text_index/mmap_text_index.rs
- lib/segment/src/index/field_index/full_text_index/immutable_text_index.rs
⏰ Context from checks skipped due to timeout of 90000ms (14)
- GitHub Check: test-shard-snapshot-api-s3-minio
- GitHub Check: test-low-resources
- GitHub Check: test-snapshot-operations-s3-minio
- GitHub Check: test-consensus-compose
- GitHub Check: Basic TLS/HTTPS tests
- GitHub Check: test-consistency
- GitHub Check: integration-tests-consensus
- GitHub Check: rust-tests (ubuntu-latest)
- GitHub Check: rust-tests (macos-latest)
- GitHub Check: integration-tests
- GitHub Check: rust-tests-no-rocksdb (ubuntu-latest)
- GitHub Check: rust-tests (windows-latest)
- GitHub Check: storage-compat-test
- GitHub Check: lint
🔇 Additional comments (1)
lib/segment/src/index/field_index/full_text_index/text_index.rs (1)
323-347
: Good switch to instance tokenizerReplacing the static
Tokenizer::tokenize_*
calls with the instance obtained viaget_tokenizer()
correctly respects per-index stop-word configuration. ✔️
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
♻️ Duplicate comments (1)
lib/segment/src/index/field_index/full_text_index/tokenizers.rs (1)
207-218
:⚠️ Potential issueIncorrect logical operator lets too-short / too-long tokens slip through
Both min- and max-length checks use logical AND on byte- and char-lengths, meaning the filter triggers only when both representations violate the bound.
For UTF-8 multibyte tokens this is wrong: a 1-character, 3-byte emoji ("🦀"
) withmin_token_len = 2
is shorter in chars but longer in bytes, so it bypasses the guard.- .map(|min_len| token.len() < min_len && token.chars().count() < min_len) + .map(|min_len| token.len() < min_len || token.chars().count() < min_len) ... - .map(|max_len| token.len() > max_len && token.chars().count() > max_len) + .map(|max_len| token.len() > max_len || token.chars().count() > max_len)Same change for the max-length branch guarantees the predicate blocks a token as soon as either representation violates the configured limits.
(This was flagged previously and is still unfixed.)
🧹 Nitpick comments (1)
lib/segment/src/index/field_index/full_text_index/tokenizers.rs (1)
89-98
: O(n²) byte-walk in prefix n-gram loopInside each word you recompute
char_indices()
for everyn
in the range, resulting in quadratic work for long tokens.
Capture the boundary offsets once and reuse them:let boundaries: Vec<usize> = word_cow.char_indices().map(|(i, _)| i).collect(); for n in min_ngram..=max_ngram { match boundaries.get(n) { Some(&end) => callback(&word_cow[..end]), None => { callback(&word_cow); break; } } }Minor, but it keeps the tokenizer linear in input size.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
lib/segment/src/index/field_index/full_text_index/stop_words/catalan.rs
(1 hunks)lib/segment/src/index/field_index/full_text_index/stop_words/mod.rs
(1 hunks)lib/segment/src/index/field_index/full_text_index/tokenizers.rs
(12 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
- lib/segment/src/index/field_index/full_text_index/stop_words/catalan.rs
- lib/segment/src/index/field_index/full_text_index/stop_words/mod.rs
⏰ Context from checks skipped due to timeout of 90000ms (14)
- GitHub Check: test-snapshot-operations-s3-minio
- GitHub Check: test-shard-snapshot-api-s3-minio
- GitHub Check: test-consensus-compose
- GitHub Check: test-low-resources
- GitHub Check: test-consistency
- GitHub Check: Basic TLS/HTTPS tests
- GitHub Check: integration-tests-consensus
- GitHub Check: integration-tests
- GitHub Check: rust-tests (windows-latest)
- GitHub Check: rust-tests (macos-latest)
- GitHub Check: rust-tests (ubuntu-latest)
- GitHub Check: rust-tests-no-rocksdb (ubuntu-latest)
- GitHub Check: lint
- GitHub Check: storage-compat-test
lib/segment/src/index/field_index/full_text_index/tokenizers.rs
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
♻️ Duplicate comments (1)
lib/segment/src/index/field_index/full_text_index/tokenizers.rs (1)
207-218
:⚠️ Potential issueLength-filter logic still wrong for multi-byte tokens
The predicate still uses logical AND for both byte and char length checks, letting UTF-8 tokens slip through the limits (e.g.
"é"
passesmin_len = 2
). Switch to OR so a violation on either representation rejects the token.- if self - .min_token_len - .map(|min_len| token.len() < min_len && token.chars().count() < min_len) + if self + .min_token_len + .map(|min_len| token.len() < min_len || token.chars().count() < min_len) ... - if self - .max_token_len - .map(|max_len| token.len() > max_len && token.chars().count() > max_len) + if self + .max_token_len + .map(|max_len| token.len() > max_len || token.chars().count() > max_len)This is the same issue flagged in a prior review and is still unresolved.
🧹 Nitpick comments (2)
lib/segment/src/index/field_index/full_text_index/tokenizers.rs (2)
89-98
:char_indices()
recomputed inside inner loop – quadratic cost
word_cow.char_indices()
is rebuilt for everyn
inmin_ngram..=max_ngram
, turning the prefix generation into O(L²). Cache the char offsets once:let boundaries: Vec<usize> = word_cow.char_indices().map(|(i, _)| i).collect(); for n in min_ngram..=max_ngram { match boundaries.get(n) { Some(&end) => callback(&word_cow[..end]), None => { callback(&word_cow); break; } } }This reduces the per-word work to O(L).
22-27
: Avoid needless allocations when lower-casing
token.to_lowercase()
allocates a newString
for every token even when the token is already lowercase ASCII. Consider usingto_ascii_lowercase()
for ASCII-only fast-path and only fall back to the full Unicode case-fold when necessary, or defer allocation until a case change is actually detected.Also applies to: 50-55
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
lib/segment/src/index/field_index/full_text_index/tokenizers.rs
(12 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (14)
- GitHub Check: lint
- GitHub Check: test-shard-snapshot-api-s3-minio
- GitHub Check: rust-tests (macos-latest)
- GitHub Check: rust-tests-no-rocksdb (ubuntu-latest)
- GitHub Check: rust-tests (ubuntu-latest)
- GitHub Check: rust-tests (windows-latest)
- GitHub Check: test-snapshot-operations-s3-minio
- GitHub Check: test-low-resources
- GitHub Check: test-consensus-compose
- GitHub Check: test-consistency
- GitHub Check: Basic TLS/HTTPS tests
- GitHub Check: integration-tests
- GitHub Check: integration-tests-consensus
- GitHub Check: storage-compat-test
self.min_token_len.unwrap_or(1), | ||
self.max_token_len.unwrap_or(usize::MAX), | ||
token_filter, | ||
), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
min_token_len
reused as n-gram lower bound – semantics clash
PrefixTokenizer::tokenize
is called with self.min_token_len
/max_token_len
as the n-gram bounds, overloading a field that conceptually limits whole-token size. A document-level length limit need not equal prefix size requirements and may surprise users. Introduce dedicated min_ngram
/ max_ngram
fields or pass explicit constants instead.
🤖 Prompt for AI Agents
In lib/segment/src/index/field_index/full_text_index/tokenizers.rs around lines
248 to 251, the code incorrectly reuses min_token_len and max_token_len as
n-gram bounds in PrefixTokenizer::tokenize, causing a semantic mismatch between
whole-token length limits and n-gram size. To fix this, introduce new dedicated
fields such as min_ngram and max_ngram for n-gram bounds or pass explicit
constant values for n-gram size limits instead of reusing min_token_len and
max_token_len.
impl FromStr for Language { | ||
type Err = serde_json::Error; | ||
|
||
fn from_str(s: &str) -> Result<Self, Self::Err> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that we already have strum
in our dependency tree.
I agree that this is a bit hacky, but I like that it ensures (de)serialization is compatible.
if token.is_empty() { | ||
continue; | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
According to the docs, this already eliminates empty results.
https://doc.rust-lang.org/std/primitive.str.html#method.split_whitespace
if token.is_empty() { | |
continue; | |
} |
Or if we want to be safe:
if token.is_empty() { | |
continue; | |
} | |
debug_assert!(!token.is_empty()); |
lib/segment/src/index/field_index/full_text_index/immutable_text_index.rs
Show resolved
Hide resolved
* Add stopwords support * Update codespell configuration and improve stopword assertion message * Add stopwords schema to OpenAPI definition * Rename StopwordsOption to StopwordsInterface * Add stopwords support for multiple languages * Update codespell configuration to skip all Rust files in stop_words directory * Normalize stopwords to lowercase during insertion * Add language aliases for stopwords * + grpc * Unspecified Language * clippy fmt * openapi * Refactor to use BTreeSet * update tokenizer test to include custom stopwords with contractions * simplified grpc StopwordsInterface * alias for languages * use `lowercase` flag for stop word normalizing * grpc: enum -> string * consistent naming * Remove UnspecifiedLanguage and use From for conversions * fix conversions * fmt * use stopwords inside different tokenizer * use ahash * fix tests * fix no-rocks-db feature * simplify code and fix test * update test * fix clippy * fmt --------- Co-authored-by: generall <andrey@vasnetsov.com>
All Submissions:
dev
branch. Did you create your branch fromdev
?New Feature Submissions:
cargo +nightly fmt --all
command prior to submission?cargo clippy --all --all-features
command?Changes to Core Features: