Skip to content

Conversation

n0x29a
Copy link
Contributor

@n0x29a n0x29a commented Jun 4, 2025

All Submissions:

  • Contributions should target the dev branch. Did you create your branch from dev?
  • Have you followed the guidelines in our Contributing document?
  • Have you checked to ensure there aren't other open Pull Requests for the same update/change?

New Feature Submissions:

  1. Does your submission pass tests?
  2. Have you formatted your code locally using cargo +nightly fmt --all command prior to submission?
  3. Have you checked your code using cargo clippy --all --all-features command?

Changes to Core Features:

  • Have you added an explanation of what your changes do and why you'd like us to include them?
  • Have you written new tests for your core changes, as applicable?
  • Have you successfully ran tests with your changes locally?

@n0x29a n0x29a changed the base branch from master to dev June 4, 2025 13:56
@n0x29a n0x29a force-pushed the stop-words branch 2 times, most recently from f2ede10 to e6bd5d1 Compare June 4, 2025 14:56
@agourlay agourlay removed their request for review June 5, 2025 09:09
@n0x29a n0x29a marked this pull request as ready for review June 5, 2025 10:48
Copy link
Contributor

coderabbitai bot commented Jun 5, 2025

📝 Walkthrough

Walkthrough

The change adds extensive support for stopwords in full-text indexing. It introduces an optional stopwords field to the TextIndexParams struct and extends API schemas, including gRPC and protobuf, with a new StopwordsSet message and a Language enum listing supported languages. A StopwordsInterface enum allows specifying stopwords as either a single language or a custom set. A StopwordsFilter struct aggregates stopwords from multiple language-specific static arrays, each defined in dedicated modules. Tokenizers are enhanced to apply stopword filtering during document and query tokenization. The GitHub Codespell workflow is updated to skip Rust files in stop_words directories. The change includes new modules for stopwords, tokenizer refactoring, and comprehensive tests and documentation updates.

Suggested reviewers

  • coszio
✨ Finishing Touches
  • 📝 Generate Docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Explain this complex logic.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai explain this code block.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and explain its main purpose.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR.
  • @coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

♻️ Duplicate comments (1)
lib/segment/src/index/field_index/full_text_index/stop_words.rs (1)

1-11: Consider using wildcard import for better maintainability.

While the explicit imports show all available languages clearly, a wildcard import would be more maintainable as new languages are added.

-use super::stop_words_container::{
-    ARABIC_STOPWORDS, AZERBAIJANI_STOPWORDS, BASQUE_STOPWORDS, BENGALI_STOPWORDS,
-    CATALAN_STOPWORDS, CHINESE_STOPWORDS, DANISH_STOPWORDS, DUTCH_STOPWORDS, ENGLISH_STOPWORDS,
-    FINNISH_STOPWORDS, FRENCH_STOPWORDS, GERMAN_STOPWORDS, GREEK_STOPWORDS, HEBREW_STOPWORDS,
-    HINGLISH_STOPWORDS, HUNGARIAN_STOPWORDS, INDONESIAN_STOPWORDS, ITALIAN_STOPWORDS,
-    KAZAKH_STOPWORDS, NEPALI_STOPWORDS, NORWEGIAN_STOPWORDS, PORTUGUESE_STOPWORDS,
-    ROMANIAN_STOPWORDS, RUSSIAN_STOPWORDS, SLOVENE_STOPWORDS, SPANISH_STOPWORDS, SWEDISH_STOPWORDS,
-    TAJIK_STOPWORDS, TURKISH_STOPWORDS,
-};
+use super::stop_words_container::*;
🧹 Nitpick comments (2)
.github/workflows/codespell.yml (1)

21-25: Suggestion: Extend skip patterns for stopwords files
You’ve added skip: '**/stop_words_container.rs' to avoid Codespell noise. To prevent false positives on domain-specific terms in both modules, consider skipping stop_words.rs as well.

Example diff:

with:
-  skip: '**/stop_words_container.rs'
+  skip:
+    - '**/stop_words_container.rs'
+    - '**/stop_words.rs'
docs/redoc/master/openapi.json (1)

7670-7732: Verify enum consistency and enhance schema strictness.

  • Ensure the Language enum values here match the Rust Language enum exactly.
  • Consider adding "additionalProperties": false to StopwordsSet and "uniqueItems": true on its arrays for stricter validation.
  • Add description fields to StopwordsInterface, Language, and StopwordsSet to improve API documentation clarity.
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between bc5b131 and 16a3e36.

📒 Files selected for processing (9)
  • .github/workflows/codespell.yml (1 hunks)
  • docs/redoc/master/openapi.json (2 hunks)
  • lib/api/src/grpc/conversions.rs (2 hunks)
  • lib/segment/src/data_types/index.rs (2 hunks)
  • lib/segment/src/index/field_index/full_text_index/mod.rs (1 hunks)
  • lib/segment/src/index/field_index/full_text_index/mutable_text_index.rs (1 hunks)
  • lib/segment/src/index/field_index/full_text_index/stop_words.rs (1 hunks)
  • lib/segment/src/index/field_index/full_text_index/tests/mod.rs (1 hunks)
  • lib/segment/src/index/field_index/full_text_index/tokenizers.rs (6 hunks)
🔇 Additional comments (21)
lib/segment/src/index/field_index/full_text_index/tests/mod.rs (1)

168-168: Initialize stopwords in test config
Adding stopwords: None ensures the test reflects the updated TextIndexParams API. Good catch.

lib/segment/src/index/field_index/full_text_index/mutable_text_index.rs (1)

155-155: Update test config for stopwords support
Explicitly setting stopwords: None in the TextIndexParams aligns with the new optional field. Test update is correct.

lib/segment/src/index/field_index/full_text_index/mod.rs (2)

10-10: Add stop_words module
Declaring the stop_words module integrates the new stopword filtering implementation. Approved.


18-18: Add stop_words_container module
The stop_words_container module declaration is required for managing custom stopword sets. Approved.

lib/api/src/grpc/conversions.rs (1)

217-217:

✅ Verification successful

Verify intentional exclusion of stopwords from gRPC conversion.

The stopwords field is destructured but ignored, which means stopwords configuration is lost when converting from internal representation to gRPC. This could lead to data loss if stopwords are configured via other APIs and then processed through gRPC conversions.

Please verify whether this is intentional. If the gRPC API doesn't support stopwords yet, this should be documented. If it should support stopwords, the field needs to be properly handled in the conversion.


🏁 Script executed:

#!/bin/bash
# Verify if gRPC TextIndexParams proto includes stopwords field
echo "Searching for TextIndexParams proto definition..."
fd -e proto -x grep -l "TextIndexParams" {} \;
echo ""
echo "Searching for stopwords in proto files..."
fd -e proto -x grep -n "stopwords" {} \;

Length of output: 349


gRPC TextIndexParams doesn’t include stopwords—conversion is intentional

The protobuf schema in lib/api/src/grpc/proto/collections.proto for TextIndexParams does not define a stopwords field, so the current conversions correctly ignore it.

If you intend to expose stopwords via gRPC in the future:

  • Add stopwords to the proto definition.
  • Update both From and TryFrom implementations in lib/api/src/grpc/conversions.rs to handle it.

Otherwise, please document in the gRPC API that stopwords are not supported over gRPC at this time.

lib/segment/src/index/field_index/full_text_index/tokenizers.rs (6)

4-4: LGTM: Clean import addition.

The import for StopwordsFilter is appropriately placed and follows the existing import structure.


85-85: LGTM: Proper stopwords filter initialization.

Creating the StopwordsFilter from the configuration at the beginning of the closure is the right approach for performance and clarity.


103-105: LGTM: Optimal placement of stopword filtering.

The stopword check is correctly placed after length validation but before case normalization, ensuring consistent behavior regardless of input case while maintaining the existing validation logic.


148-148: LGTM: Necessary test imports added.

The additional imports for stopwords types are required for the new test cases.


267-267: LGTM: Backward compatibility maintained.

Adding stopwords: None to existing test configurations ensures backward compatibility and prevents test failures.


282-395: Excellent test coverage for stopword functionality.

The test suite comprehensively covers:

  • Language-based stopwords with English stopword verification
  • Custom stopwords with user-defined terms
  • Mixed configuration combining language and custom stopwords
  • Proper filtering behavior for both stopwords and non-stopwords

The tests validate that the integration works correctly across different stopword configurations.

lib/segment/src/data_types/index.rs (5)

202-203: LGTM: Well-designed stopwords field addition.

The optional stopwords field with skip_serializing_if ensures backward compatibility while enabling the new feature. The field placement and naming follow existing conventions.


206-211: LGTM: Clean interface design following established patterns.

The StopwordsInterface enum with #[serde(untagged)] provides flexible JSON handling, allowing both simple language strings and complex configuration objects. This design aligns with other Interface enums in the codebase.


247-254: LGTM: Flexible stopwords configuration structure.

The StopwordsSet struct allows combining multiple languages with custom stopwords, and the #[serde(default)] attributes ensure clean JSON handling when fields are omitted.


300-356: Excellent serialization test coverage.

The test suite thoroughly validates:

  • Language variant serialization to simple strings
  • Set variant serialization with both languages and custom words
  • Deserialization from various JSON formats
  • Proper handling of the untagged enum structure

These tests ensure robust API compatibility and data integrity.


213-245:

✅ Verification successful

Verify all languages have corresponding stopword data.

The comprehensive Language enum covers a wide range of languages with proper lowercase serialization. Ensure that all listed languages have corresponding stopword arrays in the stop_words_container module.


🏁 Script executed:

#!/bin/bash
# Description: Verify that all Language enum variants have corresponding stopword constants

# Extract language names from the enum
echo "Checking if all Language enum variants have corresponding stopword constants..."

# Get all language variants from this file
ast-grep --pattern 'enum Language {
  $$$
}' | grep -E '^\s*[A-Z]' | sed 's/,$//' | sort > /tmp/enum_languages.txt

# Search for stopword constants in the stop_words_container module
rg -o '[A-Z_]+_STOPWORDS' --type rust | sed 's/_STOPWORDS$//' | sort | uniq > /tmp/constant_languages.txt

echo "Language enum variants:"
cat /tmp/enum_languages.txt

echo -e "\nStopword constants found:"
cat /tmp/constant_languages.txt

echo -e "\nLanguages without stopword constants:"
comm -23 /tmp/enum_languages.txt /tmp/constant_languages.txt

Length of output: 5245


Couldn’t extract enum variants correctly before – let’s re-run with a more reliable script:


🏁 Script executed:

#!/usr/bin/env bash
set -euo pipefail

# 1. Pull out the enum variants
sed -n '/enum Language {/,/}/p' lib/segment/src/data_types/index.rs \
  | sed -e '1d' -e '$d' \
  | sed 's/,//g; s/^[[:space:]]*//; s/[[:space:]]*$//' \
  | tr '[:lower:]' '[:upper:]' \
  | sort > /tmp/enum_languages.txt

# 2. Pull out stopword constants
grep -R 'pub const [A-Z_]\+_STOPWORDS' lib/segment/src/index/field_index/full_text_index/stop_words_container.rs \
  | sed 's/.*pub const \([A-Z_]\+\)_STOPWORDS.*/\1/' \
  | sort > /tmp/constant_languages.txt

# 3. Report
echo "Enum variants:"
cat /tmp/enum_languages.txt

echo -e "\nStopword constants:"
cat /tmp/constant_languages.txt

echo -e "\nMissing stopword constants:"
grep -vxFf /tmp/constant_languages.txt /tmp/enum_languages.txt || echo "None"

Length of output: 371


The grep is exiting with a non‐zero status (no matches) and killing the script under set -e. Let’s ignore that failure and re-run:


🏁 Script executed:

#!/usr/bin/env bash
set -u pipefail

# 1. Extract enum variants
sed -n '/enum Language {/,/}/p' lib/segment/src/data_types/index.rs \
  | sed '1d;$d;s/,//g;s/^[[:space:]]*//;s/[[:space:]]*$//' \
  | tr '[:lower:]' '[:upper:]' \
  | sort > /tmp/enum_languages.txt

# 2. Extract stopword constants (ignore “no match”)
grep -R 'pub const [A-Z_]\+_STOPWORDS' lib/segment/src/index/field_index/full_text_index/stop_words_container.rs \
  || true \
  | sed 's/.*pub const \([A-Z_]\+\)_STOPWORDS.*/\1/' \
  | sort > /tmp/constant_languages.txt

# 3. Report
echo "Enum variants:"
cat /tmp/enum_languages.txt

echo -e "\nStopword constants:"
cat /tmp/constant_languages.txt

echo -e "\nMissing stopword constants:"
comm -23 /tmp/enum_languages.txt /tmp/constant_languages.txt || echo "None"

Length of output: 999


Let’s extract both the enum variants and all _STOPWORDS identifiers (from both modules) and diff them using grep (avoiding comm):


🏁 Script executed:

#!/usr/bin/env bash
set -u pipefail

# 1. Extract all Language enum variants
sed -n '/enum Language {/,/}/p' lib/segment/src/data_types/index.rs \
  | sed '1d;$d;s/,//g;s/^[[:space:]]*//;s/[[:space:]]*$//' \
  | tr '[:lower:]' '[:upper:]' \
  | sort > /tmp/enums.txt

# 2. Extract all *_STOPWORDS constants from both modules
rg -h -o '[A-Z_]\+_STOPWORDS' lib/segment/src/index/field_index/full_text_index/stop_words*.rs \
  | sed 's/_STOPWORDS$//' \
  | sort -u > /tmp/constants.txt

# 3. Report
echo "Enum variants:"
cat /tmp/enums.txt

echo -e "\nStopword constants:"
cat /tmp/constants.txt

echo -e "\nLanguages missing stopword constants:"
grep -vxFf /tmp/constants.txt /tmp/enums.txt || echo "None"

Length of output: 10406


Let’s pull constants from both stop_words.rs and stop_words_container.rs (dropping the mistaken -h) and then diff against the enum:


🏁 Script executed:

#!/usr/bin/env bash
set -euo pipefail

# 1. Extract enum variants
sed -n '/enum Language {/,/}/p' lib/segment/src/data_types/index.rs \
  | sed '1d;$d;s/,//g;s/^[[:space:]]*//;s/[[:space:]]*$//' \
  | tr '[:lower:]' '[:upper:]' \
  | sort > /tmp/enums.txt

# 2. Extract STOPWORDS constants from both files
grep -R 'pub const [A-Z_]\+_STOPWORDS' lib/segment/src/index/field_index/full_text_index/stop_words.rs lib/segment/src/index/field_index/full_text_index/stop_words_container.rs \
  | sed 's/.*pub const \([A-Z_]\+\)_STOPWORDS.*/\1/' \
  | sort -u > /tmp/constants.txt

# 3. Diff
echo "Languages missing stopword constants:"
grep -vxFf /tmp/constants.txt /tmp/enums.txt || echo "None"

Length of output: 422


Stopword coverage verified

All 30 Language enum variants have matching _STOPWORDS definitions in both stop_words.rs and stop_words_container.rs. No missing entries were found.

lib/segment/src/index/field_index/full_text_index/stop_words.rs (5)

14-16: LGTM: Efficient data structure choice.

Using HashSet<String> provides O(1) lookup performance for stopword checking, which is optimal for this use case.


18-42: LGTM: Clean constructor with proper configuration handling.

The constructor correctly handles both Language and Set variants of StopwordsInterface, building the stopword set efficiently. The pattern matching is clear and comprehensive.


45-47: LGTM: Simple and efficient stopword checking.

The is_stopword method provides case-insensitive matching by converting to lowercase, which is the expected behavior for stopword filtering.


50-86: LGTM: Comprehensive language mapping with proper error handling.

The exhaustive match ensures all Language enum variants are handled, and the compiler will catch any missing variants if new languages are added. The implementation correctly converts static arrays to owned strings.


89-204: Excellent test coverage for all stopword scenarios.

The test suite comprehensively covers:

  • Empty stopword sets
  • Language-specific stopwords
  • Custom stopwords
  • Mixed language and custom configurations
  • Case insensitivity
  • Coverage of all supported languages

This thorough testing ensures the implementation is robust and handles all expected use cases correctly.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
lib/segment/src/index/field_index/full_text_index/stop_words/mod.rs (1)

243-243: Remove unnecessary clone() call.

The Language enum likely implements Copy, making the clone() call redundant.

-            let option = Some(StopwordsInterface::Language(language.clone()));
+            let option = Some(StopwordsInterface::Language(language));
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 16a3e36 and 5b3ce2a.

📒 Files selected for processing (33)
  • .github/workflows/codespell.yml (1 hunks)
  • lib/segment/src/index/field_index/full_text_index/mod.rs (1 hunks)
  • lib/segment/src/index/field_index/full_text_index/stop_words/README.md (1 hunks)
  • lib/segment/src/index/field_index/full_text_index/stop_words/arabic.rs (1 hunks)
  • lib/segment/src/index/field_index/full_text_index/stop_words/azerbaijani.rs (1 hunks)
  • lib/segment/src/index/field_index/full_text_index/stop_words/basque.rs (1 hunks)
  • lib/segment/src/index/field_index/full_text_index/stop_words/bengali.rs (1 hunks)
  • lib/segment/src/index/field_index/full_text_index/stop_words/catalan.rs (1 hunks)
  • lib/segment/src/index/field_index/full_text_index/stop_words/chinese.rs (1 hunks)
  • lib/segment/src/index/field_index/full_text_index/stop_words/danish.rs (1 hunks)
  • lib/segment/src/index/field_index/full_text_index/stop_words/dutch.rs (1 hunks)
  • lib/segment/src/index/field_index/full_text_index/stop_words/english.rs (1 hunks)
  • lib/segment/src/index/field_index/full_text_index/stop_words/finnish.rs (1 hunks)
  • lib/segment/src/index/field_index/full_text_index/stop_words/french.rs (1 hunks)
  • lib/segment/src/index/field_index/full_text_index/stop_words/german.rs (1 hunks)
  • lib/segment/src/index/field_index/full_text_index/stop_words/greek.rs (1 hunks)
  • lib/segment/src/index/field_index/full_text_index/stop_words/hebrew.rs (1 hunks)
  • lib/segment/src/index/field_index/full_text_index/stop_words/hinglish.rs (1 hunks)
  • lib/segment/src/index/field_index/full_text_index/stop_words/hungarian.rs (1 hunks)
  • lib/segment/src/index/field_index/full_text_index/stop_words/indonesian.rs (1 hunks)
  • lib/segment/src/index/field_index/full_text_index/stop_words/italian.rs (1 hunks)
  • lib/segment/src/index/field_index/full_text_index/stop_words/kazakh.rs (1 hunks)
  • lib/segment/src/index/field_index/full_text_index/stop_words/mod.rs (1 hunks)
  • lib/segment/src/index/field_index/full_text_index/stop_words/nepali.rs (1 hunks)
  • lib/segment/src/index/field_index/full_text_index/stop_words/norwegian.rs (1 hunks)
  • lib/segment/src/index/field_index/full_text_index/stop_words/portuguese.rs (1 hunks)
  • lib/segment/src/index/field_index/full_text_index/stop_words/romanian.rs (1 hunks)
  • lib/segment/src/index/field_index/full_text_index/stop_words/russian.rs (1 hunks)
  • lib/segment/src/index/field_index/full_text_index/stop_words/slovene.rs (1 hunks)
  • lib/segment/src/index/field_index/full_text_index/stop_words/spanish.rs (1 hunks)
  • lib/segment/src/index/field_index/full_text_index/stop_words/swedish.rs (1 hunks)
  • lib/segment/src/index/field_index/full_text_index/stop_words/tajik.rs (1 hunks)
  • lib/segment/src/index/field_index/full_text_index/stop_words/turkish.rs (1 hunks)
✅ Files skipped from review due to trivial changes (30)
  • lib/segment/src/index/field_index/full_text_index/stop_words/finnish.rs
  • lib/segment/src/index/field_index/full_text_index/stop_words/hebrew.rs
  • lib/segment/src/index/field_index/full_text_index/stop_words/tajik.rs
  • lib/segment/src/index/field_index/full_text_index/stop_words/README.md
  • lib/segment/src/index/field_index/full_text_index/stop_words/turkish.rs
  • lib/segment/src/index/field_index/full_text_index/stop_words/russian.rs
  • lib/segment/src/index/field_index/full_text_index/stop_words/basque.rs
  • lib/segment/src/index/field_index/full_text_index/stop_words/swedish.rs
  • lib/segment/src/index/field_index/full_text_index/stop_words/bengali.rs
  • lib/segment/src/index/field_index/full_text_index/stop_words/french.rs
  • lib/segment/src/index/field_index/full_text_index/stop_words/dutch.rs
  • lib/segment/src/index/field_index/full_text_index/stop_words/romanian.rs
  • lib/segment/src/index/field_index/full_text_index/stop_words/italian.rs
  • lib/segment/src/index/field_index/full_text_index/stop_words/german.rs
  • lib/segment/src/index/field_index/full_text_index/stop_words/indonesian.rs
  • lib/segment/src/index/field_index/full_text_index/stop_words/norwegian.rs
  • lib/segment/src/index/field_index/full_text_index/stop_words/english.rs
  • lib/segment/src/index/field_index/full_text_index/stop_words/slovene.rs
  • lib/segment/src/index/field_index/full_text_index/stop_words/catalan.rs
  • lib/segment/src/index/field_index/full_text_index/stop_words/portuguese.rs
  • lib/segment/src/index/field_index/full_text_index/stop_words/nepali.rs
  • lib/segment/src/index/field_index/full_text_index/stop_words/hungarian.rs
  • lib/segment/src/index/field_index/full_text_index/stop_words/spanish.rs
  • lib/segment/src/index/field_index/full_text_index/stop_words/greek.rs
  • lib/segment/src/index/field_index/full_text_index/stop_words/hinglish.rs
  • lib/segment/src/index/field_index/full_text_index/stop_words/danish.rs
  • lib/segment/src/index/field_index/full_text_index/stop_words/azerbaijani.rs
  • lib/segment/src/index/field_index/full_text_index/stop_words/chinese.rs
  • lib/segment/src/index/field_index/full_text_index/stop_words/arabic.rs
  • lib/segment/src/index/field_index/full_text_index/stop_words/kazakh.rs
🚧 Files skipped from review as they are similar to previous changes (2)
  • lib/segment/src/index/field_index/full_text_index/mod.rs
  • .github/workflows/codespell.yml
⏰ Context from checks skipped due to timeout of 90000ms (13)
  • GitHub Check: test-snapshot-operations-s3-minio
  • GitHub Check: test-low-resources
  • GitHub Check: storage-compat-test
  • GitHub Check: lint
  • GitHub Check: test-consistency
  • GitHub Check: test-shard-snapshot-api-s3-minio
  • GitHub Check: test-consensus-compose
  • GitHub Check: integration-tests
  • GitHub Check: Basic TLS/HTTPS tests
  • GitHub Check: integration-tests-consensus
  • GitHub Check: rust-tests (macos-latest)
  • GitHub Check: rust-tests (ubuntu-latest)
  • GitHub Check: rust-tests (windows-latest)
🔇 Additional comments (2)
lib/segment/src/index/field_index/full_text_index/stop_words/mod.rs (2)

1-34: Well-organized module structure.

The module declarations and imports are clean and properly organized with each language having its dedicated module.


51-51: ⚠️ Potential issue

Fix the casing inconsistency in the Indonesian stopwords constant name.

The constant name should be INDONESIAN_STOPWORDS to match the naming convention used for all other languages.

-pub use indonesian::INDONESIAN_STOPWORDS;
+pub use indonesian::INDONESIAN_STOPWORDS;

Likely an incorrect or invalid review comment.

@generall
Copy link
Member

generall commented Jun 5, 2025

gRPC interface is also required

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (6)
lib/api/src/grpc/proto/collections.proto (3)

498-502: Rename custom field for clarity.

The field name custom is ambiguous; renaming to something like custom_stopwords makes its purpose explicit.


503-508: Consider renaming StopwordsInterface to StopwordsConfig.

To align with other config messages (VectorsConfig, SparseVectorConfig, etc.), renaming StopwordsInterface to StopwordsConfig may improve consistency and discoverability.


516-517: Document default behavior when unset.

Add a comment on the stopwords field clarifying that omitting it disables stopword filtering (i.e. default is no stopwords). This prevents any ambiguity for API consumers.

docs/grpc/docs.md (3)

73-74: Fix TOC indentation for new stopwords entries.

The - [StopwordsInterface] and - [StopwordsSet] items are indented with 4 spaces but should use 2 spaces like other second-level entries in the Table of Contents.

Apply this diff:

-    - [StopwordsInterface](#qdrant-StopwordsInterface)
-    - [StopwordsSet](#qdrant-StopwordsSet)
+  - [StopwordsInterface](#qdrant-StopwordsInterface)
+  - [StopwordsSet](#qdrant-StopwordsSet)
🧰 Tools
🪛 markdownlint-cli2 (0.17.2)

73-73: Unordered list indentation
Expected: 2; Actual: 4

(MD007, ul-indent)


74-74: Unordered list indentation
Expected: 2; Actual: 4

(MD007, ul-indent)


101-101: Align Language TOC entry with siblings.

The - [Language] entry is over-indented compared to other first-level enum links. It should be moved back to 2 spaces:

-    - [Language](#qdrant-Language)
+  - [Language](#qdrant-Language)
🧰 Tools
🪛 markdownlint-cli2 (0.17.2)

101-101: Unordered list indentation
Expected: 2; Actual: 4

(MD007, ul-indent)


1635-1636: Enhance stopwords field description in TextIndexParams.

Consider clarifying default behavior and linking to StopwordsInterface:

-| stopwords       | [StopwordsInterface](#qdrant-StopwordsInterface) | optional | Stopwords for the text index                             |
+| stopwords       | [StopwordsInterface](#qdrant-StopwordsInterface) | optional | Configures language-based or custom stopword filtering; if omitted, no filtering is applied. |
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 901c08f and fb10276.

📒 Files selected for processing (4)
  • docs/grpc/docs.md (5 hunks)
  • lib/api/src/grpc/conversions.rs (5 hunks)
  • lib/api/src/grpc/proto/collections.proto (1 hunks)
  • lib/api/src/grpc/qdrant.rs (3 hunks)
🧰 Additional context used
🪛 markdownlint-cli2 (0.17.2)
docs/grpc/docs.md

73-73: Unordered list indentation
Expected: 2; Actual: 4

(MD007, ul-indent)


74-74: Unordered list indentation
Expected: 2; Actual: 4

(MD007, ul-indent)


101-101: Unordered list indentation
Expected: 2; Actual: 4

(MD007, ul-indent)

🔇 Additional comments (7)
lib/api/src/grpc/qdrant.rs (3)

821-849: Well-designed protobuf structures for flexible stopwords configuration.

The StopwordsSet and StopwordsInterface structures provide a clean API design:

  • StopwordsSet allows combining multiple languages with custom stopwords
  • StopwordsInterface uses the oneof pattern for simple (single language) vs complex (multiple languages + custom) use cases
  • The structure aligns well with protobuf best practices

This addresses the gRPC interface requirement mentioned in the PR comments.


869-871: Proper integration with existing TextIndexParams.

The optional stopwords field is correctly added to TextIndexParams with:

  • Appropriate protobuf tag "6" for the next available field number
  • Optional type maintaining backward compatibility
  • Clear documentation

1739-1843: Comprehensive language enum with proper protobuf patterns.

The Language enum provides good coverage with 29 languages including major world languages. The implementation follows standard protobuf patterns with:

  • Proper enum value assignments starting from 0
  • Standard as_str_name() and from_str_name() helper methods
  • Comprehensive language coverage (Arabic, Chinese, European languages, etc.)

The language selection appears well-balanced for international use cases.

lib/api/src/grpc/conversions.rs (4)

148-220: Excellent comprehensive language conversion implementation.

The bidirectional conversions between internal and gRPC Language enums are thorough and well-implemented. All supported languages are explicitly mapped with proper error handling for unknown values.


283-333: Proper stopwords conversion from internal to gRPC format.

The implementation correctly handles both Language and Set variants of StopwordsInterface, converting internal representations to appropriate gRPC message formats with proper language enum conversions.


508-568: Robust gRPC to internal stopwords conversion with proper error handling.

The implementation correctly converts gRPC StopwordsInterface to internal format with comprehensive error handling for invalid language values. Both Language and Set variants are handled symmetrically to the From conversion.


47-56: Comprehensive stopwords support successfully added to gRPC API.

This implementation resolves the previous limitation where gRPC clients couldn't configure stopwords. The code now provides:

  • Complete bidirectional Language enum conversions
  • Proper StopwordsInterface conversion for both Language and Set variants
  • Comprehensive error handling for invalid language values
  • Full gRPC API parity with other interfaces for stopwords functionality

The gRPC API now fully supports stopwords configuration as intended.

Also applies to: 148-220, 283-333, 508-568

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (4)
docs/grpc/docs.md (4)

73-74: Fix unordered list indentation for new Table of Contents entries
Indentation should be 2 spaces for ToC list items. Please adjust these entries to match surrounding items.

🧰 Tools
🪛 markdownlint-cli2 (0.17.2)

73-73: Unordered list indentation
Expected: 2; Actual: 4

(MD007, ul-indent)


74-74: Unordered list indentation
Expected: 2; Actual: 4

(MD007, ul-indent)


101-101: Fix unordered list indentation for Language entry
This list item also uses 4 spaces but should use 2 to align with other entries.

🧰 Tools
🪛 markdownlint-cli2 (0.17.2)

101-101: Unordered list indentation
Expected: 2; Actual: 4

(MD007, ul-indent)


1466-1476: Ensure StopwordsInterface section follows existing Markdown patterns
The new StopwordsInterface heading and table should match the style of other sections (consistent heading depth, spacing before/after, and a brief description). Consider adding a short introductory sentence above the table.


1482-1491: Ensure StopwordsSet section follows existing Markdown patterns
Similar to StopwordsInterface, the StopwordsSet section would benefit from a brief intro and consistent table formatting (spacing, header styling).

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between fb10276 and 9aa3d92.

📒 Files selected for processing (42)
  • .github/workflows/codespell.yml (1 hunks)
  • docs/grpc/docs.md (5 hunks)
  • docs/redoc/master/openapi.json (2 hunks)
  • lib/api/src/grpc/conversions.rs (5 hunks)
  • lib/api/src/grpc/proto/collections.proto (1 hunks)
  • lib/api/src/grpc/qdrant.rs (3 hunks)
  • lib/segment/src/data_types/index.rs (2 hunks)
  • lib/segment/src/index/field_index/full_text_index/mod.rs (1 hunks)
  • lib/segment/src/index/field_index/full_text_index/mutable_text_index.rs (1 hunks)
  • lib/segment/src/index/field_index/full_text_index/stop_words/README.md (1 hunks)
  • lib/segment/src/index/field_index/full_text_index/stop_words/arabic.rs (1 hunks)
  • lib/segment/src/index/field_index/full_text_index/stop_words/azerbaijani.rs (1 hunks)
  • lib/segment/src/index/field_index/full_text_index/stop_words/basque.rs (1 hunks)
  • lib/segment/src/index/field_index/full_text_index/stop_words/bengali.rs (1 hunks)
  • lib/segment/src/index/field_index/full_text_index/stop_words/catalan.rs (1 hunks)
  • lib/segment/src/index/field_index/full_text_index/stop_words/chinese.rs (1 hunks)
  • lib/segment/src/index/field_index/full_text_index/stop_words/danish.rs (1 hunks)
  • lib/segment/src/index/field_index/full_text_index/stop_words/dutch.rs (1 hunks)
  • lib/segment/src/index/field_index/full_text_index/stop_words/english.rs (1 hunks)
  • lib/segment/src/index/field_index/full_text_index/stop_words/finnish.rs (1 hunks)
  • lib/segment/src/index/field_index/full_text_index/stop_words/french.rs (1 hunks)
  • lib/segment/src/index/field_index/full_text_index/stop_words/german.rs (1 hunks)
  • lib/segment/src/index/field_index/full_text_index/stop_words/greek.rs (1 hunks)
  • lib/segment/src/index/field_index/full_text_index/stop_words/hebrew.rs (1 hunks)
  • lib/segment/src/index/field_index/full_text_index/stop_words/hinglish.rs (1 hunks)
  • lib/segment/src/index/field_index/full_text_index/stop_words/hungarian.rs (1 hunks)
  • lib/segment/src/index/field_index/full_text_index/stop_words/indonesian.rs (1 hunks)
  • lib/segment/src/index/field_index/full_text_index/stop_words/italian.rs (1 hunks)
  • lib/segment/src/index/field_index/full_text_index/stop_words/kazakh.rs (1 hunks)
  • lib/segment/src/index/field_index/full_text_index/stop_words/mod.rs (1 hunks)
  • lib/segment/src/index/field_index/full_text_index/stop_words/nepali.rs (1 hunks)
  • lib/segment/src/index/field_index/full_text_index/stop_words/norwegian.rs (1 hunks)
  • lib/segment/src/index/field_index/full_text_index/stop_words/portuguese.rs (1 hunks)
  • lib/segment/src/index/field_index/full_text_index/stop_words/romanian.rs (1 hunks)
  • lib/segment/src/index/field_index/full_text_index/stop_words/russian.rs (1 hunks)
  • lib/segment/src/index/field_index/full_text_index/stop_words/slovene.rs (1 hunks)
  • lib/segment/src/index/field_index/full_text_index/stop_words/spanish.rs (1 hunks)
  • lib/segment/src/index/field_index/full_text_index/stop_words/swedish.rs (1 hunks)
  • lib/segment/src/index/field_index/full_text_index/stop_words/tajik.rs (1 hunks)
  • lib/segment/src/index/field_index/full_text_index/stop_words/turkish.rs (1 hunks)
  • lib/segment/src/index/field_index/full_text_index/tests/mod.rs (1 hunks)
  • lib/segment/src/index/field_index/full_text_index/tokenizers.rs (6 hunks)
✅ Files skipped from review due to trivial changes (2)
  • lib/segment/src/index/field_index/full_text_index/mutable_text_index.rs
  • lib/segment/src/index/field_index/full_text_index/stop_words/arabic.rs
🚧 Files skipped from review as they are similar to previous changes (39)
  • lib/segment/src/index/field_index/full_text_index/mod.rs
  • lib/segment/src/index/field_index/full_text_index/stop_words/README.md
  • lib/segment/src/index/field_index/full_text_index/stop_words/swedish.rs
  • lib/segment/src/index/field_index/full_text_index/tests/mod.rs
  • .github/workflows/codespell.yml
  • lib/segment/src/index/field_index/full_text_index/stop_words/basque.rs
  • lib/segment/src/index/field_index/full_text_index/stop_words/french.rs
  • lib/segment/src/index/field_index/full_text_index/stop_words/english.rs
  • lib/segment/src/index/field_index/full_text_index/stop_words/azerbaijani.rs
  • lib/segment/src/index/field_index/full_text_index/stop_words/dutch.rs
  • lib/segment/src/index/field_index/full_text_index/stop_words/danish.rs
  • lib/segment/src/index/field_index/full_text_index/stop_words/turkish.rs
  • lib/segment/src/index/field_index/full_text_index/stop_words/finnish.rs
  • lib/segment/src/index/field_index/full_text_index/stop_words/hungarian.rs
  • lib/segment/src/index/field_index/full_text_index/stop_words/indonesian.rs
  • lib/segment/src/index/field_index/full_text_index/stop_words/greek.rs
  • lib/segment/src/index/field_index/full_text_index/stop_words/hebrew.rs
  • lib/segment/src/index/field_index/full_text_index/stop_words/nepali.rs
  • lib/segment/src/index/field_index/full_text_index/stop_words/tajik.rs
  • lib/segment/src/index/field_index/full_text_index/stop_words/russian.rs
  • lib/segment/src/index/field_index/full_text_index/stop_words/norwegian.rs
  • lib/segment/src/index/field_index/full_text_index/stop_words/romanian.rs
  • lib/segment/src/index/field_index/full_text_index/stop_words/portuguese.rs
  • lib/segment/src/index/field_index/full_text_index/stop_words/german.rs
  • lib/segment/src/index/field_index/full_text_index/stop_words/slovene.rs
  • lib/segment/src/index/field_index/full_text_index/stop_words/catalan.rs
  • lib/segment/src/index/field_index/full_text_index/stop_words/italian.rs
  • lib/segment/src/index/field_index/full_text_index/stop_words/kazakh.rs
  • lib/segment/src/index/field_index/full_text_index/stop_words/hinglish.rs
  • lib/segment/src/index/field_index/full_text_index/stop_words/spanish.rs
  • lib/segment/src/index/field_index/full_text_index/tokenizers.rs
  • lib/api/src/grpc/qdrant.rs
  • lib/segment/src/index/field_index/full_text_index/stop_words/bengali.rs
  • lib/api/src/grpc/proto/collections.proto
  • lib/segment/src/data_types/index.rs
  • lib/segment/src/index/field_index/full_text_index/stop_words/mod.rs
  • lib/segment/src/index/field_index/full_text_index/stop_words/chinese.rs
  • docs/redoc/master/openapi.json
  • lib/api/src/grpc/conversions.rs
🧰 Additional context used
🪛 markdownlint-cli2 (0.17.2)
docs/grpc/docs.md

73-73: Unordered list indentation
Expected: 2; Actual: 4

(MD007, ul-indent)


74-74: Unordered list indentation
Expected: 2; Actual: 4

(MD007, ul-indent)


101-101: Unordered list indentation
Expected: 2; Actual: 4

(MD007, ul-indent)

⏰ Context from checks skipped due to timeout of 90000ms (13)
  • GitHub Check: test-snapshot-operations-s3-minio
  • GitHub Check: test-consensus-compose
  • GitHub Check: test-consistency
  • GitHub Check: Basic TLS/HTTPS tests
  • GitHub Check: test-low-resources
  • GitHub Check: test-shard-snapshot-api-s3-minio
  • GitHub Check: integration-tests
  • GitHub Check: lint
  • GitHub Check: storage-compat-test
  • GitHub Check: integration-tests-consensus
  • GitHub Check: rust-tests (windows-latest)
  • GitHub Check: rust-tests (ubuntu-latest)
  • GitHub Check: rust-tests (macos-latest)
🔇 Additional comments (2)
docs/grpc/docs.md (2)

1635-1636: Verify gRPC service definitions include the new stopwords field
Please confirm that the TextIndexParams message with its stopwords field is referenced in the RPC method docs (e.g., CreateCollection, UpdateCollection) so clients can actually use this parameter.


1928-1964: Language enumeration documented correctly
The Language enum entries align with the protobuf definitions, follow the same ordering, and include all expected values.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
lib/segment/src/data_types/index.rs (1)

208-213: Consider potential ambiguity with untagged enum.

The #[serde(untagged)] approach works but could potentially cause confusion during deserialization. If a user provides a string that happens to match both a language name and could be interpreted as a custom stopword, the Language variant will take precedence.

However, given the comprehensive language list and aliases, this is likely the intended behavior and the tests demonstrate it works correctly.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 568d8bb and 73554e5.

📒 Files selected for processing (5)
  • docs/redoc/master/openapi.json (2 hunks)
  • lib/api/src/grpc/conversions.rs (5 hunks)
  • lib/segment/src/data_types/index.rs (3 hunks)
  • lib/segment/src/index/field_index/full_text_index/stop_words/mod.rs (1 hunks)
  • lib/segment/src/index/field_index/full_text_index/tokenizers.rs (6 hunks)
🚧 Files skipped from review as they are similar to previous changes (4)
  • lib/segment/src/index/field_index/full_text_index/tokenizers.rs
  • lib/segment/src/index/field_index/full_text_index/stop_words/mod.rs
  • lib/api/src/grpc/conversions.rs
  • docs/redoc/master/openapi.json
⏰ Context from checks skipped due to timeout of 90000ms (13)
  • GitHub Check: test-shard-snapshot-api-s3-minio
  • GitHub Check: test-low-resources
  • GitHub Check: Basic TLS/HTTPS tests
  • GitHub Check: test-snapshot-operations-s3-minio
  • GitHub Check: test-consistency
  • GitHub Check: test-consensus-compose
  • GitHub Check: integration-tests-consensus
  • GitHub Check: integration-tests
  • GitHub Check: rust-tests (macos-latest)
  • GitHub Check: rust-tests (windows-latest)
  • GitHub Check: rust-tests (ubuntu-latest)
  • GitHub Check: lint
  • GitHub Check: storage-compat-test
🔇 Additional comments (6)
lib/segment/src/data_types/index.rs (6)

1-1: Import addition looks good.

The import of BTreeSet is appropriate for the stopwords collections and provides deterministic ordering.


204-206: Stopwords field integration is well-designed.

The optional stopwords field with skip_serializing_if = "Option::is_none" ensures backward compatibility as discussed in past reviews.


215-278: Excellent language coverage with ISO aliases.

The Language enum addresses the previous review suggestion about ISO abbreviations as aliases. The comprehensive coverage of languages with their standard ISO codes will provide good user experience.

The UnspecifiedLanguage variant with "unspecified" alias provides a clear semantic meaning distinct from None (no stopwords) vs unspecified language stopwords.


280-324: Custom deserialization provides good user experience.

The custom deserialization implementation handles both full language names and ISO aliases case-insensitively, which addresses the concern from past reviews about better error messages. The error message format is clear and actionable.


326-336: StopwordsSet design allows flexible configuration.

The struct design supporting both singular language and plural languages fields plus custom stopwords provides good flexibility. The use of BTreeSet for both languages and custom stopwords ensures deterministic ordering and efficient lookups, addressing previous review suggestions.

The serialization skipping for empty collections maintains clean JSON output.


382-533: Comprehensive test coverage validates the implementation.

The test suite thoroughly covers:

  • Language and Set variant serialization/deserialization
  • ISO alias handling for individual languages and in sets
  • Error handling for unsupported languages
  • Various JSON format combinations

This addresses potential edge cases and ensures the API works as expected.

@n0x29a n0x29a requested review from generall and coszio June 9, 2025 06:39
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (1)
lib/api/src/grpc/conversions.rs (1)

47-53: Addresses previous gRPC stopwords support concern.

The addition of Language and StopwordsSet imports and the comprehensive conversion implementations in this file successfully address the previous review comment about incomplete gRPC stopwords support. gRPC clients can now fully configure stopwords through the API.

🧹 Nitpick comments (2)
docs/grpc/docs.md (2)

73-73: Fix TOC indentation for StopwordsSet entry.
The bullet for StopwordsSet is indented with 4 spaces but the surrounding list uses 2-space indentation.

🧰 Tools
🪛 markdownlint-cli2 (0.17.2)

73-73: Unordered list indentation
Expected: 2; Actual: 4

(MD007, ul-indent)


100-100: Fix TOC indentation for Language entry.
The bullet for Language is indented with 4 spaces but the surrounding list uses 2-space indentation.

🧰 Tools
🪛 markdownlint-cli2 (0.17.2)

100-100: Unordered list indentation
Expected: 2; Actual: 4

(MD007, ul-indent)

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 73fbf5f and 26f5181.

📒 Files selected for processing (4)
  • docs/grpc/docs.md (5 hunks)
  • lib/api/src/grpc/conversions.rs (5 hunks)
  • lib/api/src/grpc/proto/collections.proto (1 hunks)
  • lib/api/src/grpc/qdrant.rs (3 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
  • lib/api/src/grpc/proto/collections.proto
  • lib/api/src/grpc/qdrant.rs
🧰 Additional context used
🪛 markdownlint-cli2 (0.17.2)
docs/grpc/docs.md

73-73: Unordered list indentation
Expected: 2; Actual: 4

(MD007, ul-indent)


100-100: Unordered list indentation
Expected: 2; Actual: 4

(MD007, ul-indent)

⏰ Context from checks skipped due to timeout of 90000ms (13)
  • GitHub Check: test-low-resources
  • GitHub Check: test-shard-snapshot-api-s3-minio
  • GitHub Check: Basic TLS/HTTPS tests
  • GitHub Check: test-snapshot-operations-s3-minio
  • GitHub Check: rust-tests (macos-latest)
  • GitHub Check: test-consistency
  • GitHub Check: test-consensus-compose
  • GitHub Check: storage-compat-test
  • GitHub Check: integration-tests-consensus
  • GitHub Check: lint
  • GitHub Check: rust-tests (windows-latest)
  • GitHub Check: integration-tests
  • GitHub Check: rust-tests (ubuntu-latest)
🔇 Additional comments (6)
lib/api/src/grpc/conversions.rs (3)

147-225: Excellent implementation of Language enum conversions.

The bidirectional conversion between internal and gRPC Language enums is well-implemented with:

  • Complete mapping of all language variants
  • Proper error handling for the UnspecifiedLanguage variant
  • Clear and descriptive error messages
  • Consistent with Rust conversion patterns

298-320: Well-implemented stopwords conversion from internal to gRPC types.

The conversion correctly handles both StopwordsInterface variants:

  • Language variant maps to single-language StopwordsSet
  • Set variant properly extracts primary language and custom words
  • All data is preserved during conversion

521-567: Comprehensive stopwords conversion with good optimization for common cases.

The conversion logic correctly handles all stopwords configuration scenarios:

  • Empty configurations optimized to None
  • Single-language configurations optimized to Language variant
  • Multi-language/custom configurations use Set variant with proper primary/secondary language handling

The comment about primary language being available for future lemmatizer use is helpful context.

docs/grpc/docs.md (3)

1465-1475: Approve the StopwordsSet documentation.
The new StopwordsSet section follows the established pattern for message definitions, with a clear table of fields for languages and custom.


1618-1618: Approve stopwords field in TextIndexParams.
The optional stopwords field in TextIndexParams is correctly introduced with the appropriate link and description.


1911-1918: Approve the Language enum addition.
The Language enum and its anchor are correctly added, enumerating supported languages for stopword filtering.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (1)
lib/segment/src/data_types/index.rs (1)

217-280: Address past review feedback about UnspecifiedLanguage variant.

Based on past review comments, there was consensus to remove the UnspecifiedLanguage variant from internal data structures and only keep it in the gRPC API layer. The current implementation still includes this variant in the internal Language enum.

Consider removing UnspecifiedLanguage from this internal enum and handling the conversion at the gRPC boundary as previously discussed.

🧹 Nitpick comments (1)
docs/grpc/docs.md (1)

73-73: Fix unordered list indentation
Adjust the indentation to match the rest of the Table of Contents (2 spaces).
Apply this diff:

-    - [StopwordsSet](#qdrant-StopwordsSet)
+  - [StopwordsSet](#qdrant-StopwordsSet)
🧰 Tools
🪛 markdownlint-cli2 (0.17.2)

73-73: Unordered list indentation
Expected: 2; Actual: 4

(MD007, ul-indent)

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 26f5181 and 996bcb1.

📒 Files selected for processing (8)
  • docs/grpc/docs.md (3 hunks)
  • docs/redoc/master/openapi.json (2 hunks)
  • lib/api/src/grpc/conversions.rs (5 hunks)
  • lib/api/src/grpc/proto/collections.proto (1 hunks)
  • lib/api/src/grpc/qdrant.rs (2 hunks)
  • lib/segment/src/data_types/index.rs (3 hunks)
  • lib/segment/src/index/field_index/full_text_index/stop_words/mod.rs (1 hunks)
  • lib/segment/src/index/field_index/full_text_index/tokenizers.rs (6 hunks)
✅ Files skipped from review due to trivial changes (3)
  • lib/api/src/grpc/proto/collections.proto
  • lib/api/src/grpc/qdrant.rs
  • lib/segment/src/index/field_index/full_text_index/stop_words/mod.rs
🚧 Files skipped from review as they are similar to previous changes (2)
  • docs/redoc/master/openapi.json
  • lib/api/src/grpc/conversions.rs
🧰 Additional context used
🪛 markdownlint-cli2 (0.17.2)
docs/grpc/docs.md

73-73: Unordered list indentation
Expected: 2; Actual: 4

(MD007, ul-indent)

⏰ Context from checks skipped due to timeout of 90000ms (13)
  • GitHub Check: rust-tests (ubuntu-latest)
  • GitHub Check: rust-tests (macos-latest)
  • GitHub Check: rust-tests (windows-latest)
  • GitHub Check: storage-compat-test
  • GitHub Check: test-snapshot-operations-s3-minio
  • GitHub Check: test-shard-snapshot-api-s3-minio
  • GitHub Check: test-consistency
  • GitHub Check: test-consensus-compose
  • GitHub Check: test-low-resources
  • GitHub Check: Basic TLS/HTTPS tests
  • GitHub Check: lint
  • GitHub Check: integration-tests-consensus
  • GitHub Check: integration-tests
🔇 Additional comments (13)
docs/grpc/docs.md (2)

1464-1478: Document the new StopwordsSet message
The StopwordsSet anchor, heading, and property table are correctly defined and align with the added protobuf schema.


1617-1617: Add stopwords to TextIndexParams
The optional stopwords field is now documented alongside other text-index parameters.

lib/segment/src/index/field_index/full_text_index/tokenizers.rs (5)

4-4: LGTM: Import addition for stopword filtering.

The import of StopwordsFilter is correctly placed and necessary for the new stopword filtering functionality.


85-86: LGTM: Well-designed stopword filtering integration.

The stopword filtering is correctly integrated into the token processing pipeline:

  • Filter creation uses appropriate configuration parameters
  • Filtering occurs after length validation but before case normalization
  • Early return pattern efficiently skips stopwords

This ensures consistent behavior across different tokenizer configurations.

Also applies to: 104-106


148-151: LGTM: Necessary test imports.

The additional imports are required for the comprehensive stopword filtering test cases.


270-270: LGTM: Backward compatibility maintained.

Adding stopwords: None to the existing test ensures backward compatibility with the new optional field.


285-558: LGTM: Comprehensive test coverage for stopword filtering.

The test suite thoroughly validates stopword filtering functionality across multiple scenarios:

  • Language-specific stopwords (English, Spanish combinations)
  • Custom stopword sets
  • Mixed language and custom configurations
  • Case sensitivity handling
  • Edge cases with punctuation and contractions

Each test properly validates both stopword removal and preservation of non-stopwords, ensuring robust functionality.

lib/segment/src/data_types/index.rs (6)

1-4: LGTM: Necessary imports for stopword functionality.

The additional imports support the new data structures and custom trait implementations.


206-208: LGTM: Clean addition of stopwords field.

The optional stopwords field is properly integrated into TextIndexParams with appropriate serde annotations for backward compatibility.


210-215: LGTM: Flexible interface design.

The untagged StopwordsInterface enum provides an intuitive API allowing both simple language strings and complex configurations.


282-326: LGTM: Robust custom deserialization.

The custom deserialization implementation provides excellent user experience:

  • Case-insensitive language matching
  • Support for both full names and ISO codes
  • Clear error messages for unsupported languages
  • Comprehensive coverage of supported language variants

328-341: LGTM: Consistent trait implementations.

The Display and FromStr implementations cleverly reuse the JSON serialization logic, ensuring consistency with the serde configuration.


343-376: Consider field naming consistency.

The StopwordsSet structure only defines a languages field, but some tests (e.g., line 473) suggest that a language field should also work. Based on past review comments, there was discussion about supporting both singular and plural forms.

The current implementation with custom deserialization handles single/multiple values but doesn't address the field naming flexibility that was previously discussed.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🔭 Outside diff range comments (1)
lib/segment/src/index/field_index/full_text_index/tokenizers.rs (1)

301-312: 🛠️ Refactor suggestion

⚠️ Potential issue

Unit test fails due to unexpected lower-casing

test_word_tokenizer invokes the tokenizer with lowercase = true, but the expected token list contains "Привет" (capital ‘П’).
The tokenizer lower-cases the token to "привет", causing the assertion to fail.

Fix either the call or the expectation; the minimal change is to run the tokenizer in case-sensitive mode:

-        WordTokenizer::tokenize(text, true, &StopwordsFilter::default(), |token| {
+        WordTokenizer::tokenize(text, false, &StopwordsFilter::default(), |token| {
♻️ Duplicate comments (1)
lib/segment/src/index/field_index/full_text_index/tokenizers.rs (1)

207-218: Multi-byte length checks still use logical AND – bug persists

The doc_token_filter keeps tokens when both the byte length and the char count satisfy the limit.
For UTF-8 strings this lets single-character multi-byte tokens slip through the min_token_len filter (and vice-versa for max_token_len).

This exact issue was raised in a previous review and hasn’t been fixed.

- .map(|min_len| token.len() < min_len && token.chars().count() < min_len)
+ .map(|min_len| token.len() < min_len || token.chars().count() < min_len)
...
- .map(|max_len| token.len() > max_len && token.chars().count() > max_len)
+ .map(|max_len| token.len() > max_len || token.chars().count() > max_len)
🧹 Nitpick comments (3)
lib/segment/src/index/field_index/full_text_index/tests/test_congruence.rs (1)

107-112: Use .. to future-proof the struct pattern

The test had to be updated solely because a new tokenizer field was added to MutableFullTextIndex.
Capturing only the fields you actually need and ignoring the rest with .. keeps the pattern resilient to further internal changes and avoids extra churn.

-                let MutableFullTextIndex {
-                    storage,
-                    inverted_index: _,
-                    config,
-                    tokenizer: _,
-                } = index;
+                let MutableFullTextIndex {
+                    storage,
+                    config,
+                    ..
+                } = index;
lib/segment/src/index/field_index/full_text_index/tokenizers.rs (2)

248-250: min_token_len / max_token_len conflated with n-gram size

PrefixTokenizer::tokenize is called with self.min_token_len & self.max_token_len as the ngram range.
These fields are documented/used elsewhere as whole-token length limits, so re-using them here overloads semantics and may break callers that set token-length limits unrelated to prefix size.

Consider adding dedicated min_prefix / max_prefix parameters in TextIndexParams to avoid semantic leakage.


89-97: Quadratic scanning in prefix generation

Inside every n loop you recreate the iterator and traverse char_indices() up to n (nth(n)), resulting in O(L²) work per word.
Pre-computing the offsets once avoids the repeated walk:

-                for n in min_ngram..=max_ngram {
-                    let ngram = word_cow.char_indices().map(|(i, _)| i).nth(n);
+                let offsets: Vec<usize> = word_cow.char_indices().map(|(i, _)| i).collect();
+                for n in min_ngram..=max_ngram {
+                    let ngram = offsets.get(n).copied();

This keeps complexity at O(L) and benefits long tokens.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 31fa52b and 7ded9a5.

📒 Files selected for processing (46)
  • .github/workflows/codespell.yml (1 hunks)
  • docs/grpc/docs.md (3 hunks)
  • docs/redoc/master/openapi.json (2 hunks)
  • lib/api/src/grpc/conversions.rs (6 hunks)
  • lib/api/src/grpc/proto/collections.proto (1 hunks)
  • lib/api/src/grpc/qdrant.rs (2 hunks)
  • lib/segment/src/data_types/index.rs (3 hunks)
  • lib/segment/src/index/field_index/full_text_index/immutable_text_index.rs (3 hunks)
  • lib/segment/src/index/field_index/full_text_index/mmap_text_index.rs (5 hunks)
  • lib/segment/src/index/field_index/full_text_index/mod.rs (1 hunks)
  • lib/segment/src/index/field_index/full_text_index/mutable_text_index.rs (5 hunks)
  • lib/segment/src/index/field_index/full_text_index/stop_words/README.md (1 hunks)
  • lib/segment/src/index/field_index/full_text_index/stop_words/arabic.rs (1 hunks)
  • lib/segment/src/index/field_index/full_text_index/stop_words/azerbaijani.rs (1 hunks)
  • lib/segment/src/index/field_index/full_text_index/stop_words/basque.rs (1 hunks)
  • lib/segment/src/index/field_index/full_text_index/stop_words/bengali.rs (1 hunks)
  • lib/segment/src/index/field_index/full_text_index/stop_words/catalan.rs (1 hunks)
  • lib/segment/src/index/field_index/full_text_index/stop_words/chinese.rs (1 hunks)
  • lib/segment/src/index/field_index/full_text_index/stop_words/danish.rs (1 hunks)
  • lib/segment/src/index/field_index/full_text_index/stop_words/dutch.rs (1 hunks)
  • lib/segment/src/index/field_index/full_text_index/stop_words/english.rs (1 hunks)
  • lib/segment/src/index/field_index/full_text_index/stop_words/finnish.rs (1 hunks)
  • lib/segment/src/index/field_index/full_text_index/stop_words/french.rs (1 hunks)
  • lib/segment/src/index/field_index/full_text_index/stop_words/german.rs (1 hunks)
  • lib/segment/src/index/field_index/full_text_index/stop_words/greek.rs (1 hunks)
  • lib/segment/src/index/field_index/full_text_index/stop_words/hebrew.rs (1 hunks)
  • lib/segment/src/index/field_index/full_text_index/stop_words/hinglish.rs (1 hunks)
  • lib/segment/src/index/field_index/full_text_index/stop_words/hungarian.rs (1 hunks)
  • lib/segment/src/index/field_index/full_text_index/stop_words/indonesian.rs (1 hunks)
  • lib/segment/src/index/field_index/full_text_index/stop_words/italian.rs (1 hunks)
  • lib/segment/src/index/field_index/full_text_index/stop_words/kazakh.rs (1 hunks)
  • lib/segment/src/index/field_index/full_text_index/stop_words/mod.rs (1 hunks)
  • lib/segment/src/index/field_index/full_text_index/stop_words/nepali.rs (1 hunks)
  • lib/segment/src/index/field_index/full_text_index/stop_words/norwegian.rs (1 hunks)
  • lib/segment/src/index/field_index/full_text_index/stop_words/portuguese.rs (1 hunks)
  • lib/segment/src/index/field_index/full_text_index/stop_words/romanian.rs (1 hunks)
  • lib/segment/src/index/field_index/full_text_index/stop_words/russian.rs (1 hunks)
  • lib/segment/src/index/field_index/full_text_index/stop_words/slovene.rs (1 hunks)
  • lib/segment/src/index/field_index/full_text_index/stop_words/spanish.rs (1 hunks)
  • lib/segment/src/index/field_index/full_text_index/stop_words/swedish.rs (1 hunks)
  • lib/segment/src/index/field_index/full_text_index/stop_words/tajik.rs (1 hunks)
  • lib/segment/src/index/field_index/full_text_index/stop_words/turkish.rs (1 hunks)
  • lib/segment/src/index/field_index/full_text_index/tests/mod.rs (1 hunks)
  • lib/segment/src/index/field_index/full_text_index/tests/test_congruence.rs (1 hunks)
  • lib/segment/src/index/field_index/full_text_index/text_index.rs (7 hunks)
  • lib/segment/src/index/field_index/full_text_index/tokenizers.rs (13 hunks)
✅ Files skipped from review due to trivial changes (3)
  • lib/segment/src/index/field_index/full_text_index/stop_words/nepali.rs
  • lib/api/src/grpc/qdrant.rs
  • lib/segment/src/index/field_index/full_text_index/stop_words/portuguese.rs
🚧 Files skipped from review as they are similar to previous changes (40)
  • lib/segment/src/index/field_index/full_text_index/tests/mod.rs
  • lib/segment/src/index/field_index/full_text_index/stop_words/README.md
  • .github/workflows/codespell.yml
  • lib/segment/src/index/field_index/full_text_index/stop_words/turkish.rs
  • lib/segment/src/index/field_index/full_text_index/mod.rs
  • lib/segment/src/index/field_index/full_text_index/stop_words/dutch.rs
  • lib/segment/src/index/field_index/full_text_index/stop_words/basque.rs
  • lib/segment/src/index/field_index/full_text_index/stop_words/german.rs
  • lib/segment/src/index/field_index/full_text_index/stop_words/catalan.rs
  • lib/segment/src/index/field_index/full_text_index/stop_words/french.rs
  • lib/segment/src/index/field_index/full_text_index/stop_words/azerbaijani.rs
  • lib/segment/src/index/field_index/full_text_index/stop_words/danish.rs
  • lib/api/src/grpc/proto/collections.proto
  • lib/segment/src/index/field_index/full_text_index/stop_words/hebrew.rs
  • lib/segment/src/index/field_index/full_text_index/stop_words/bengali.rs
  • lib/segment/src/index/field_index/full_text_index/stop_words/english.rs
  • lib/segment/src/index/field_index/full_text_index/stop_words/hungarian.rs
  • lib/segment/src/index/field_index/full_text_index/stop_words/indonesian.rs
  • lib/segment/src/index/field_index/full_text_index/stop_words/tajik.rs
  • lib/segment/src/index/field_index/full_text_index/stop_words/romanian.rs
  • lib/segment/src/index/field_index/full_text_index/stop_words/italian.rs
  • lib/segment/src/index/field_index/full_text_index/stop_words/spanish.rs
  • lib/segment/src/index/field_index/full_text_index/stop_words/hinglish.rs
  • lib/segment/src/index/field_index/full_text_index/stop_words/arabic.rs
  • lib/segment/src/index/field_index/full_text_index/stop_words/finnish.rs
  • lib/segment/src/index/field_index/full_text_index/stop_words/slovene.rs
  • lib/segment/src/index/field_index/full_text_index/stop_words/swedish.rs
  • lib/segment/src/index/field_index/full_text_index/stop_words/chinese.rs
  • lib/segment/src/index/field_index/full_text_index/stop_words/norwegian.rs
  • lib/segment/src/index/field_index/full_text_index/stop_words/kazakh.rs
  • lib/segment/src/index/field_index/full_text_index/stop_words/greek.rs
  • lib/segment/src/index/field_index/full_text_index/mmap_text_index.rs
  • lib/segment/src/index/field_index/full_text_index/mutable_text_index.rs
  • lib/segment/src/index/field_index/full_text_index/immutable_text_index.rs
  • lib/segment/src/index/field_index/full_text_index/stop_words/mod.rs
  • lib/api/src/grpc/conversions.rs
  • lib/segment/src/index/field_index/full_text_index/stop_words/russian.rs
  • lib/segment/src/index/field_index/full_text_index/text_index.rs
  • docs/redoc/master/openapi.json
  • lib/segment/src/data_types/index.rs
🧰 Additional context used
🧬 Code Graph Analysis (1)
lib/segment/src/index/field_index/full_text_index/tokenizers.rs (3)
lib/segment/src/index/field_index/full_text_index/mmap_text_index.rs (1)
  • new (93-103)
lib/segment/src/index/field_index/full_text_index/stop_words/mod.rs (1)
  • new (71-103)
lib/segment/src/data_types/index.rs (2)
  • new_custom (219-224)
  • new_set (232-237)
🪛 markdownlint-cli2 (0.17.2)
docs/grpc/docs.md

73-73: Unordered list indentation
Expected: 2; Actual: 4

(MD007, ul-indent)

⏰ Context from checks skipped due to timeout of 90000ms (13)
  • GitHub Check: test-snapshot-operations-s3-minio
  • GitHub Check: test-consistency
  • GitHub Check: Basic TLS/HTTPS tests
  • GitHub Check: test-shard-snapshot-api-s3-minio
  • GitHub Check: test-low-resources
  • GitHub Check: test-consensus-compose
  • GitHub Check: rust-tests (windows-latest)
  • GitHub Check: rust-tests (ubuntu-latest)
  • GitHub Check: rust-tests (macos-latest)
  • GitHub Check: integration-tests-consensus
  • GitHub Check: integration-tests
  • GitHub Check: lint
  • GitHub Check: storage-compat-test
🔇 Additional comments (3)
docs/grpc/docs.md (3)

73-73: Approve: TOC updated with StopwordsSet

The new StopwordsSet entry is correctly inserted in the Table of Contents and follows the existing indentation and ordering.

🧰 Tools
🪛 markdownlint-cli2 (0.17.2)

73-73: Unordered list indentation
Expected: 2; Actual: 4

(MD007, ul-indent)


1464-1474: Approve: StopwordsSet documentation added

The StopwordsSet message definition and its field descriptions align with the style of other message docs and clearly explain both languages and custom stopword options.


1617-1617: Approve: TextIndexParams extended with stopwords

The optional stopwords field has been added to TextIndexParams with a clear description, enabling users to configure stopword filtering in the text index.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (1)
lib/segment/src/index/field_index/full_text_index/tokenizers.rs (1)

207-218: ⚠️ Potential issue

Fix logical error in min_token_len / max_token_len checks – use ||, not &&

Using a logical AND means the filter is triggered only when both the byte‐length and the char count violate the limit.
For multi-byte characters this allows out-of-spec tokens to slip through (e.g. "é" has len() == 2, chars().count() == 1).

-                .map(|min_len| token.len() < min_len && token.chars().count() < min_len)
+                .map(|min_len| token.len() < min_len || token.chars().count() < min_len)-                .map(|max_len| token.len() > max_len && token.chars().count() > max_len)
+                .map(|max_len| token.len() > max_len || token.chars().count() > max_len)

This enforces the bounds consistently for both ASCII and UTF-8 multibyte tokens.

🧹 Nitpick comments (2)
lib/segment/src/index/field_index/full_text_index/tokenizers.rs (2)

89-98: Avoid O(n²) behaviour when generating n-grams

char_indices() is recomputed for every n in the loop, walking the word repeatedly.
Pre-compute the char boundaries once and reuse them – this cuts the complexity from O(L·N) to O(L + N) and allocates at most one small Vec.

Pseudocode sketch:

let boundaries: Vec<usize> = word_cow.char_indices().map(|(i, _)| i).collect();
for idx in min_ngram..=max_ngram {
    if let Some(&end) = boundaries.get(idx) {
        callback(&word_cow[..end]);
    } else {
        callback(&word_cow);
        break;
    }
}

45-56: Keep contractions intact in WordTokenizer

Splitting on every non-alphanumeric char breaks common contractions (i'd, t'ha) into multiple tokens, but your tests treat them as single units and stopwords can contain apostrophes.

-        for token in text.split(|c| !char::is_alphanumeric(c)) {
+        for token in text.split(|c| !(char::is_alphanumeric(c) || c == '\'')) {

This preserves English/Italian contractions without affecting other language tokens.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 7ded9a5 and af89175.

📒 Files selected for processing (2)
  • lib/segment/src/index/field_index/full_text_index/stop_words/mod.rs (1 hunks)
  • lib/segment/src/index/field_index/full_text_index/tokenizers.rs (12 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • lib/segment/src/index/field_index/full_text_index/stop_words/mod.rs
⏰ Context from checks skipped due to timeout of 90000ms (14)
  • GitHub Check: test-low-resources
  • GitHub Check: Basic TLS/HTTPS tests
  • GitHub Check: test-shard-snapshot-api-s3-minio
  • GitHub Check: test-snapshot-operations-s3-minio
  • GitHub Check: test-consensus-compose
  • GitHub Check: test-consistency
  • GitHub Check: integration-tests
  • GitHub Check: integration-tests-consensus
  • GitHub Check: rust-tests (macos-latest)
  • GitHub Check: rust-tests (windows-latest)
  • GitHub Check: rust-tests (ubuntu-latest)
  • GitHub Check: storage-compat-test
  • GitHub Check: rust-tests-no-rocksdb (ubuntu-latest)
  • GitHub Check: lint
🔇 Additional comments (1)
lib/segment/src/index/field_index/full_text_index/tokenizers.rs (1)

22-31: Lower-casing before the stopword check is now consistent – good job

The previous double-handling bug is gone; the filter and probe are both case-normalised once.
Implementation looks correct.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (2)
lib/segment/src/index/field_index/full_text_index/text_index.rs (2)

161-167: Method is fine but could be simplified with as_ref()

The get_tokenizer helper works, yet the triple-match cascades duplicate logic already present in many other helpers.
A tiny ergonomic win is to rely on as_ref() on self and pattern-match only once, e.g.:

-    fn get_tokenizer(&self) -> &Tokenizer {
-        match self {
-            Self::Mutable(index) => &index.tokenizer,
-            Self::Immutable(index) => &index.tokenizer,
-            Self::Mmap(index) => &index.tokenizer,
-        }
-    }
+    fn get_tokenizer(&self) -> &Tokenizer {
+        match self {
+            FullTextIndex::Mutable(idx)
+            | FullTextIndex::Immutable(idx) => &idx.tokenizer,
+            FullTextIndex::Mmap(idx) => &idx.tokenizer,
+        }
+    }

Not critical, just trims three identical arms.


245-276: Keep test-only helpers private to tests

serialize_token_set / deserialize_token_set are pub(super) but only used inside the #[cfg(test)] module.
To avoid the symbol leaking in non-test builds you can scope them to the tests module:

-#[cfg(test)]
-pub(super) fn serialize_token_set(...)
+#[cfg(test)]
+fn serialize_token_set(...)

Same for deserialize_token_set.
Minor hygiene, no functional impact.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between af89175 and 4144f75.

📒 Files selected for processing (3)
  • lib/segment/src/index/field_index/full_text_index/immutable_text_index.rs (3 hunks)
  • lib/segment/src/index/field_index/full_text_index/mmap_text_index.rs (5 hunks)
  • lib/segment/src/index/field_index/full_text_index/text_index.rs (7 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
  • lib/segment/src/index/field_index/full_text_index/mmap_text_index.rs
  • lib/segment/src/index/field_index/full_text_index/immutable_text_index.rs
⏰ Context from checks skipped due to timeout of 90000ms (14)
  • GitHub Check: test-shard-snapshot-api-s3-minio
  • GitHub Check: test-low-resources
  • GitHub Check: test-snapshot-operations-s3-minio
  • GitHub Check: test-consensus-compose
  • GitHub Check: Basic TLS/HTTPS tests
  • GitHub Check: test-consistency
  • GitHub Check: integration-tests-consensus
  • GitHub Check: rust-tests (ubuntu-latest)
  • GitHub Check: rust-tests (macos-latest)
  • GitHub Check: integration-tests
  • GitHub Check: rust-tests-no-rocksdb (ubuntu-latest)
  • GitHub Check: rust-tests (windows-latest)
  • GitHub Check: storage-compat-test
  • GitHub Check: lint
🔇 Additional comments (1)
lib/segment/src/index/field_index/full_text_index/text_index.rs (1)

323-347: Good switch to instance tokenizer

Replacing the static Tokenizer::tokenize_* calls with the instance obtained via get_tokenizer() correctly respects per-index stop-word configuration. ✔️

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (1)
lib/segment/src/index/field_index/full_text_index/tokenizers.rs (1)

207-218: ⚠️ Potential issue

Incorrect logical operator lets too-short / too-long tokens slip through

Both min- and max-length checks use logical AND on byte- and char-lengths, meaning the filter triggers only when both representations violate the bound.
For UTF-8 multibyte tokens this is wrong: a 1-character, 3-byte emoji ("🦀") with min_token_len = 2 is shorter in chars but longer in bytes, so it bypasses the guard.

- .map(|min_len| token.len() < min_len && token.chars().count() < min_len)
+ .map(|min_len| token.len() < min_len || token.chars().count() < min_len)
...
- .map(|max_len| token.len() > max_len && token.chars().count() > max_len)
+ .map(|max_len| token.len() > max_len || token.chars().count() > max_len)

Same change for the max-length branch guarantees the predicate blocks a token as soon as either representation violates the configured limits.
(This was flagged previously and is still unfixed.)

🧹 Nitpick comments (1)
lib/segment/src/index/field_index/full_text_index/tokenizers.rs (1)

89-98: O(n²) byte-walk in prefix n-gram loop

Inside each word you recompute char_indices() for every n in the range, resulting in quadratic work for long tokens.
Capture the boundary offsets once and reuse them:

let boundaries: Vec<usize> = word_cow.char_indices().map(|(i, _)| i).collect();
for n in min_ngram..=max_ngram {
    match boundaries.get(n) {
        Some(&end) => callback(&word_cow[..end]),
        None => { callback(&word_cow); break; }
    }
}

Minor, but it keeps the tokenizer linear in input size.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 4144f75 and 64c322e.

📒 Files selected for processing (3)
  • lib/segment/src/index/field_index/full_text_index/stop_words/catalan.rs (1 hunks)
  • lib/segment/src/index/field_index/full_text_index/stop_words/mod.rs (1 hunks)
  • lib/segment/src/index/field_index/full_text_index/tokenizers.rs (12 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
  • lib/segment/src/index/field_index/full_text_index/stop_words/catalan.rs
  • lib/segment/src/index/field_index/full_text_index/stop_words/mod.rs
⏰ Context from checks skipped due to timeout of 90000ms (14)
  • GitHub Check: test-snapshot-operations-s3-minio
  • GitHub Check: test-shard-snapshot-api-s3-minio
  • GitHub Check: test-consensus-compose
  • GitHub Check: test-low-resources
  • GitHub Check: test-consistency
  • GitHub Check: Basic TLS/HTTPS tests
  • GitHub Check: integration-tests-consensus
  • GitHub Check: integration-tests
  • GitHub Check: rust-tests (windows-latest)
  • GitHub Check: rust-tests (macos-latest)
  • GitHub Check: rust-tests (ubuntu-latest)
  • GitHub Check: rust-tests-no-rocksdb (ubuntu-latest)
  • GitHub Check: lint
  • GitHub Check: storage-compat-test

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (1)
lib/segment/src/index/field_index/full_text_index/tokenizers.rs (1)

207-218: ⚠️ Potential issue

Length-filter logic still wrong for multi-byte tokens

The predicate still uses logical AND for both byte and char length checks, letting UTF-8 tokens slip through the limits (e.g. "é" passes min_len = 2). Switch to OR so a violation on either representation rejects the token.

-            if self
-                .min_token_len
-                .map(|min_len| token.len() < min_len && token.chars().count() < min_len)
+            if self
+                .min_token_len
+                .map(|min_len| token.len() < min_len || token.chars().count() < min_len)
...
-            if self
-                .max_token_len
-                .map(|max_len| token.len() > max_len && token.chars().count() > max_len)
+            if self
+                .max_token_len
+                .map(|max_len| token.len() > max_len || token.chars().count() > max_len)

This is the same issue flagged in a prior review and is still unresolved.

🧹 Nitpick comments (2)
lib/segment/src/index/field_index/full_text_index/tokenizers.rs (2)

89-98: char_indices() recomputed inside inner loop – quadratic cost

word_cow.char_indices() is rebuilt for every n in min_ngram..=max_ngram, turning the prefix generation into O(L²). Cache the char offsets once:

let boundaries: Vec<usize> = word_cow.char_indices().map(|(i, _)| i).collect();
for n in min_ngram..=max_ngram {
    match boundaries.get(n) {
        Some(&end) => callback(&word_cow[..end]),
        None => {
            callback(&word_cow);
            break;
        }
    }
}

This reduces the per-word work to O(L).


22-27: Avoid needless allocations when lower-casing

token.to_lowercase() allocates a new String for every token even when the token is already lowercase ASCII. Consider using to_ascii_lowercase() for ASCII-only fast-path and only fall back to the full Unicode case-fold when necessary, or defer allocation until a case change is actually detected.

Also applies to: 50-55

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 4c48efb and b853c71.

📒 Files selected for processing (1)
  • lib/segment/src/index/field_index/full_text_index/tokenizers.rs (12 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (14)
  • GitHub Check: lint
  • GitHub Check: test-shard-snapshot-api-s3-minio
  • GitHub Check: rust-tests (macos-latest)
  • GitHub Check: rust-tests-no-rocksdb (ubuntu-latest)
  • GitHub Check: rust-tests (ubuntu-latest)
  • GitHub Check: rust-tests (windows-latest)
  • GitHub Check: test-snapshot-operations-s3-minio
  • GitHub Check: test-low-resources
  • GitHub Check: test-consensus-compose
  • GitHub Check: test-consistency
  • GitHub Check: Basic TLS/HTTPS tests
  • GitHub Check: integration-tests
  • GitHub Check: integration-tests-consensus
  • GitHub Check: storage-compat-test

Comment on lines +248 to 251
self.min_token_len.unwrap_or(1),
self.max_token_len.unwrap_or(usize::MAX),
token_filter,
),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

min_token_len reused as n-gram lower bound – semantics clash

PrefixTokenizer::tokenize is called with self.min_token_len/max_token_len as the n-gram bounds, overloading a field that conceptually limits whole-token size. A document-level length limit need not equal prefix size requirements and may surprise users. Introduce dedicated min_ngram / max_ngram fields or pass explicit constants instead.

🤖 Prompt for AI Agents
In lib/segment/src/index/field_index/full_text_index/tokenizers.rs around lines
248 to 251, the code incorrectly reuses min_token_len and max_token_len as
n-gram bounds in PrefixTokenizer::tokenize, causing a semantic mismatch between
whole-token length limits and n-gram size. To fix this, introduce new dedicated
fields such as min_ngram and max_ngram for n-gram bounds or pass explicit
constant values for n-gram size limits instead of reusing min_token_len and
max_token_len.

impl FromStr for Language {
type Err = serde_json::Error;

fn from_str(s: &str) -> Result<Self, Self::Err> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that we already have strum in our dependency tree.

I agree that this is a bit hacky, but I like that it ensures (de)serialization is compatible.

Comment on lines +18 to +21
if token.is_empty() {
continue;
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to the docs, this already eliminates empty results.

https://doc.rust-lang.org/std/primitive.str.html#method.split_whitespace

Example: https://play.rust-lang.org/?version=stable&mode=debug&edition=2024&gist=89b6012aff7a45372cf8e3dfd7f76c62

Suggested change
if token.is_empty() {
continue;
}

Or if we want to be safe:

Suggested change
if token.is_empty() {
continue;
}
debug_assert!(!token.is_empty());

@generall generall merged commit 5dfe4b1 into dev Jun 16, 2025
18 checks passed
@generall generall deleted the stop-words branch June 16, 2025 10:48
@generall generall added this to the Stopwords milestone Jul 17, 2025
generall added a commit that referenced this pull request Jul 17, 2025
* Add stopwords support

* Update codespell configuration and improve stopword assertion message

* Add stopwords schema to OpenAPI definition

* Rename StopwordsOption to StopwordsInterface

* Add stopwords support for multiple languages

* Update codespell configuration to skip all Rust files in stop_words directory

* Normalize stopwords to lowercase during insertion

* Add language aliases for stopwords

* + grpc

* Unspecified Language

* clippy fmt

* openapi

* Refactor to use BTreeSet

* update tokenizer test to include custom stopwords with contractions

* simplified grpc StopwordsInterface

* alias for languages

* use `lowercase` flag for stop word normalizing

* grpc: enum -> string

* consistent naming

* Remove UnspecifiedLanguage and use From for conversions

* fix conversions

* fmt

* use stopwords inside different tokenizer

* use ahash

* fix tests

* fix no-rocks-db feature

* simplify code and fix test

* update test

* fix clippy

* fmt

---------

Co-authored-by: generall <andrey@vasnetsov.com>
This was referenced Jul 21, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants