-
Notifications
You must be signed in to change notification settings - Fork 1.8k
Optimize Text payload matcher #6766
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
c4f4006
to
16ae35a
Compare
This comment was marked as resolved.
This comment was marked as resolved.
lib/segment/src/index/field_index/full_text_index/inverted_index/mod.rs
Outdated
Show resolved
Hide resolved
lib/segment/src/index/field_index/full_text_index/text_index.rs
Outdated
Show resolved
Hide resolved
8ca10ad
to
3a99f45
Compare
3a99f45
to
4f3e5ce
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Benchmarks are controversial
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In your comment you mention we lazily compute a list of points 3 times. Could you point to the exact place where we do this? That is not immediately clear to me from this PR.
// Build the set intersection of point ids from the query tokens | ||
let points_for_token = full_text_index | ||
.filter_query(parsed_query.clone(), &hw_counter) | ||
.collect::<AHashSet<_>>(); | ||
|
||
Some(Box::new(move |point_id: PointOffsetType| { | ||
full_text_index.check_match(&parsed_query, point_id, &hw_counter) | ||
points_for_token.contains(&point_id) | ||
})) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Naive question: if we can do this here, can't we just do the same inside full_text_index.check_match(..)
?
1b0c81c
to
fced9bb
Compare
The optimization is working fine when a lot of points are being tested in the matcher. Otherwise, the cost of of preparing the intersection up front could be more expensive than the savings. To make the matter worse, we actually build the
Creating a single But making this change is not trivial so I will close the current approach as it will never work as long as |
The initial goal of this PR was to fix an over measurement of
payload_index_io_read
for theText
index.In practice, the final fix is a performance optimization decreasing the hardware counter as a side effect.
In my tests, the
payload_index_io_read
value forText
matching is almost 3x higher than what is reported by theread_bytes
of the process.In case of high cardinality
Text
matching, the search is not driven directly by the payload index.Meaning a potentially large number of points need to be tested against the payload index to check if the condition matches.
The current approach in
dev
lazily recomputes the set of points which match the query tokens for each point repeatedly.This set is a constant for all points tested.
This inflates the number of
payload_index_io_read
of ops artificially as the mmap slice is cached.The proposed fix is to pre-calculate the intersection of points for the query tokens to reuse it for each point test.
I was able to show locally that this change:
payload_index_io_read
to be much closer toread_bytes
The trade-off is to use more memory once to avoid repeated IO.
The set of points kept in memory is the intersection of the tokens posting lists, therefore it should not grow out of proportion when adding tokens to the query.
Future work
Phrase matching does not leverage the optimization yet.
Hopefully we get to follow a similar patterns where positions are handled properly.