-
Notifications
You must be signed in to change notification settings - Fork 1.8k
Fix enormous memory usage in Distance Matrix API on high sample size #6640
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
c30253d
to
df24f78
Compare
df24f78
to
2beb70e
Compare
2beb70e
to
8e0cbf0
Compare
This comment was marked as resolved.
This comment was marked as resolved.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice! Can you please share script to repro the bug/results? :)
@@ -2608,7 +2609,7 @@ impl From<JsonPath> for IsEmptyCondition { | |||
#[derive(Debug, Deserialize, Serialize, JsonSchema, Clone, PartialEq, Eq)] | |||
pub struct HasIdCondition { | |||
#[schemars(schema_with = "HashSet::<PointIdType>::json_schema")] | |||
pub has_id: AHashSet<PointIdType>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not always have an Arc
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the answer lies here:
/// Threshold determining when to use an `Arc` in `HasIdCondition` if the condition includes many points.
/// Since we're cloning filters quite a lot, using an Arc for larger conditions reduces risk of memory leaks
/// and potentially improves performance in some places.
const HAS_ID_CONDITION_ARC_THRESHOLD: usize = 1_000;
meaning, a balance between performance and memory usage.
Though, I'm also curious how always using an Arc behaves. @JojiiOfficial Would it be possible to benchmark it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here is a comparison between cloning raw+arc and creating an arc using Arc::new()
of the same hashset used in each iteration.
I suggest that we set the HAS_ID_CONDITION_ARC_THRESHOLD
to something around 60
(probably 64 due to power of 2 allocation logic of rust) to not introduce overhead in places where we might create a lot of small filters without much cloning. Otherwise we might introduce an unexpected performance regression.
Wdyt?
Arc(Arc<T>), | ||
} | ||
|
||
impl<T> MaybeArc<T> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A nice to have would be to choose arc/no_arc automatically based on the size_of_value()
of the inner value as a constructor function of this type.
Based on the provided chart, and considering ExtendedPointId
is 24 bytes, the threshold could be around 1536 bytes (24 * 64)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
size_of_value
doesn't work with HashSet though. When using MaybeArc
wrapped around HashMaps or Vec, size_of_value
would only return the stack size of the type, which is always constant regardless of the amount of items.
Therefore we can only apply this heuristic at each usage of MaybeArc
instead inside a constructor, sadly.
…6640) * Fix memory leak in distance API * Add tests
Fixes extremely high memory usage in Distance Matrix API on certain datasets with a large sample rate (>= ~50k).
Dev:

PR:
