use indexed vectors instead of available points for IDF computation #6739
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Fixes: #6735
Some details for the bug:
For computing IDF we were using number of available points as total number of docs and length of the posting list as a number documents with token.
In our implementation Posting List are immutable on delete (and in case of in-ram index, it is re-created on load), so it results in inconsistent numbers passed into IDF formula.
This PR uses number of indexed vectors from vector index, which is as immutable as posting-lists. So after deleting some points we still get consistent scores.