-
Notifications
You must be signed in to change notification settings - Fork 1.8k
Fix bm25 params #7056
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix bm25 params #7056
Conversation
let lowercase = lowercase.unwrap_or(true); | ||
|
||
let language = language.unwrap_or_else(|| DEFAULT_LANGUAGE.to_string()); | ||
|
||
let stemmer = match stemmer { | ||
None => Stemmer::try_default_from_language(&language), | ||
Some(stemmer_algorithm) => Some(Stemmer::from_algorithm(&stemmer_algorithm)), | ||
}; | ||
|
||
let stopwords_config = match stopwords { | ||
None => { | ||
// Try to create from the language | ||
Language::from_str(&language) | ||
.ok() | ||
.map(StopwordsInterface::Language) | ||
} | ||
Some(stopwords_interface) => Some(stopwords_interface), | ||
}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
main change is here
fn hash<H: Hasher>(&self, state: &mut H) { | ||
let options = match self { | ||
DocumentOptions::Common(options) => Cow::Borrowed(options), | ||
DocumentOptions::Bm25(bm25) => Cow::Owned(bm25.to_options()), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't expect Bm25
branch to ever be used, as we don't deserialize into Bm25
This comment was marked as resolved.
This comment was marked as resolved.
}; | ||
|
||
// Order of keys in the map should not affect the hash | ||
let mut keys: Vec<_> = options.keys().collect(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we maybe just collect the options into a BTreeMap
instead? I think this custom hashing approach might cause hard to detect bugs in future.
Alternatively we could just replace the initial HashMap
with a BTreeMap
, the APIs should be the same.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about this?
// Order of keys in the map should not affect the hash
let mut keys: Vec<(_, _)> = options.iter().collect();
keys.sort_by_key(|(key, _)| *key);
keys.hash(state);
I'm not too worried about it because it does not have to be portable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll merge this now so I can roll the release. We can handle this one separately if we like.
* make bm25 parameters compatible with default fastembed params * bm25 params in openapi schema * fmt * improve bm25 config deserialization * more explicit docstring for disabling english
language
param, which is used to create default stopwords and stemmer