Skip to content

Conversation

generall
Copy link
Member

@generall generall commented Aug 13, 2025

  • Make stopwords, lowercase and stemmer enable by default.
  • Introduce language param, which is used to create default stopwords and stemmer

@generall generall changed the base branch from master to dev August 13, 2025 19:33
@generall generall marked this pull request as ready for review August 13, 2025 22:57
Comment on lines +111 to +128
let lowercase = lowercase.unwrap_or(true);

let language = language.unwrap_or_else(|| DEFAULT_LANGUAGE.to_string());

let stemmer = match stemmer {
None => Stemmer::try_default_from_language(&language),
Some(stemmer_algorithm) => Some(Stemmer::from_algorithm(&stemmer_algorithm)),
};

let stopwords_config = match stopwords {
None => {
// Try to create from the language
Language::from_str(&language)
.ok()
.map(StopwordsInterface::Language)
}
Some(stopwords_interface) => Some(stopwords_interface),
};
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

main change is here

fn hash<H: Hasher>(&self, state: &mut H) {
let options = match self {
DocumentOptions::Common(options) => Cow::Borrowed(options),
DocumentOptions::Bm25(bm25) => Cow::Owned(bm25.to_options()),
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't expect Bm25 branch to ever be used, as we don't deserialize into Bm25

This comment was marked as resolved.

coderabbitai[bot]

This comment was marked as resolved.

};

// Order of keys in the map should not affect the hash
let mut keys: Vec<_> = options.keys().collect();
Copy link
Contributor

@JojiiOfficial JojiiOfficial Aug 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we maybe just collect the options into a BTreeMap instead? I think this custom hashing approach might cause hard to detect bugs in future.
Alternatively we could just replace the initial HashMap with a BTreeMap, the APIs should be the same.

Copy link
Member

@timvisee timvisee Aug 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about this?

// Order of keys in the map should not affect the hash
let mut keys: Vec<(_, _)> = options.iter().collect();
keys.sort_by_key(|(key, _)| *key);
keys.hash(state);

I'm not too worried about it because it does not have to be portable.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll merge this now so I can roll the release. We can handle this one separately if we like.

coderabbitai[bot]

This comment was marked as resolved.

@timvisee timvisee merged commit 4a3008e into dev Aug 14, 2025
16 checks passed
@timvisee timvisee deleted the fix-bm25-params branch August 14, 2025 12:23
timvisee pushed a commit that referenced this pull request Aug 14, 2025
* make bm25 parameters compatible with default fastembed params

* bm25 params in openapi schema

* fmt

* improve bm25 config deserialization

* more explicit docstring for disabling english
@timvisee timvisee mentioned this pull request Aug 14, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants