Skip to content

Conversation

JojiiOfficial
Copy link
Contributor

@JojiiOfficial JojiiOfficial commented Jul 25, 2025

Supersedes #6893 and #6904
Depends on #6891
Implements Bm25 by parsing the inference object's options.

Request example

A possible request could look like this:

{
  "query": {
    "model": "qdrant/bm25",  // Must be set but value can be arbitrary.
    "text": "Some input to embedd with bm25",
    "options": {
      "use_bm25": true,        // Required to parse this options as BM25 configuration.
      "k": 1.0,                // Optional hyperparameters. Same for `b` and `avg_len`.
      "tokenizer": "word",     // Optional 
      "lowercase": true       // Optional
    }
  },
  "using": "sparse"
}

Error handling

In case the option "use_bm25" is enabled but the configuration is invalid, a detailed error is returned:

{
  "error": "Wrong input: Invalid BM25 config: Error(\"unknown variant `ord`, expected one of `prefix`, `whitespace`, `word`, `multilingual`\", line: 0, column: 0)"
}

Test coverage

There are also tests for the new BM25 and remote-inference logic that almost cover all parts of infer() and >97% of bm25.rs. "Almost" because returning errors are not covered.
These tests 'mock' an inference server and ensure that we correctly merge bm25 and remote items together.

.collect();

// Create an HTTP mock
let mock = server
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mocking an inference server here to fully end2end test the logic of this PR.

return Ok(None);
};

if options.get("use_bm25") != Some(&Value::Bool(true))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this here and why it has hard-coded strings?

Copy link
Contributor Author

@JojiiOfficial JojiiOfficial Jul 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, it should be a constant instead of directly coded into the code.

I think it's better to explicitly state that we want to use bm25, instead of "guessing" and always trying to parse the options into Bm25Config. Therefore we have this additional check when trying to read BM25 configuration from the inputs options.

Alternatively, we could rename it to something like "custom_model" or "local_model" and check the value against "bm25". This makes it more extendible for other 'local' model implementations in future.

return VectorPersisted::empty_sparse();
}

let indices: Vec<u32> = tokens
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DimId?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't have access to DimId and DimWeight in src/common. I don't think we should add a dependency, just for this type alias.
If you prefer having some sort of alias here, I can add a new type alias here, but I think the semantics is already known from the variable names and context from the function.

The same applies to the other place in line 103

.unique()
.collect();

let values: Vec<f32> = vec![1.0; indices.len()];
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DimWeight ?

return Ok(None);
};

let Some(local_model_field) = options.get(LOCAL_MODEL_KEY) else {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still don't understand why this is needed

Copy link
Contributor Author

@JojiiOfficial JojiiOfficial Jul 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we add a different local 'model', next to bm25, which also has tokenizer configs and a user only specifies those (so no Bm25 specific parameters), we can't distinguish which model should be used.
This requires the user to specify which local model to use.

We could use some dedicated model name e.g. 'local/bm25' to specify which local model to use. However I thought hard-coding some specific model name to be potentially confusing, especially if something like 'local/bm25' works but 'bm25' throws an error about some 'address' not being found. If you prefer this approach though, I'm totally fine with changing it.

@JojiiOfficial JojiiOfficial force-pushed the bm25_execution_impl_2 branch from 0689e42 to 1bcb324 Compare July 30, 2025 15:56
@JojiiOfficial JojiiOfficial requested a review from generall July 30, 2025 15:57
@JojiiOfficial JojiiOfficial force-pushed the bm25_execution_impl_2 branch from 0c38fa9 to 82a78e5 Compare August 1, 2025 07:12
@generall generall merged commit c9f3626 into uncouple_tokenizer_and_textindex_params Aug 3, 2025
15 checks passed
@generall generall deleted the bm25_execution_impl_2 branch August 3, 2025 16:41
generall pushed a commit that referenced this pull request Aug 3, 2025
* Add Bm25

* Execute BM25 if config available

* cargo format

* Add mocked tests for inference and bm25

* Properly apply InferenceType

* Fix tests

* Review remarks

* Review remarks

* Add overwritten model name fix again

* use enum instead of multiple constants

* ensure handling all fields of InferenceInput in infer_local

---------

Co-authored-by: Luis Cossío <luis.cossio@outlook.com>
generall added a commit that referenced this pull request Aug 4, 2025
* Uncouple tokenizer and TextIndexParams

* Refactor token processing in `TokenizerConfig` (#6912)

* Refactor token processing in TokenizerConfig

* fix max length checking

* [Bm25] Execution implementation (#6939)

* Add Bm25

* Execute BM25 if config available

* cargo format

* Add mocked tests for inference and bm25

* Properly apply InferenceType

* Fix tests

* Review remarks

* Review remarks

* Add overwritten model name fix again

* use enum instead of multiple constants

* ensure handling all fields of InferenceInput in infer_local

---------

Co-authored-by: Luis Cossío <luis.cossio@outlook.com>

* review fixes

* fmt

* spell-check

* deduplicate code

---------

Co-authored-by: Luis Cossío <luis.cossio@qdrant.com>
Co-authored-by: Luis Cossío <luis.cossio@outlook.com>
Co-authored-by: Andrey Vasnetsov <andrey@vasnetsov.com>
timvisee pushed a commit that referenced this pull request Aug 11, 2025
* Uncouple tokenizer and TextIndexParams

* Refactor token processing in `TokenizerConfig` (#6912)

* Refactor token processing in TokenizerConfig

* fix max length checking

* [Bm25] Execution implementation (#6939)

* Add Bm25

* Execute BM25 if config available

* cargo format

* Add mocked tests for inference and bm25

* Properly apply InferenceType

* Fix tests

* Review remarks

* Review remarks

* Add overwritten model name fix again

* use enum instead of multiple constants

* ensure handling all fields of InferenceInput in infer_local

---------

Co-authored-by: Luis Cossío <luis.cossio@outlook.com>

* review fixes

* fmt

* spell-check

* deduplicate code

---------

Co-authored-by: Luis Cossío <luis.cossio@qdrant.com>
Co-authored-by: Luis Cossío <luis.cossio@outlook.com>
Co-authored-by: Andrey Vasnetsov <andrey@vasnetsov.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants