Skip to content

Conversation

charleskorn
Copy link
Contributor

@charleskorn charleskorn commented Oct 11, 2024

What this PR does

This PR is an iteration on #8677. It changes the behaviour of instant vector selectors to not return the same FloatHistogram instance for successive points if they all select the same histogram.

This preserves the main optimisation of #8677 (avoiding creating copies of histograms in sum) without the complexity and associated bugs of needing to ensure that shared FloatHistogram instances are modified safely.

For example, I just found yet another issue related to this that was introduced in #9507, where shared FloatHistogram instances could be left in the unused portion of the histogram slice here and then corrupted if the HPoint slice was later used for a range vector selector.

I believe similar corruption could occur if a HPoint slice was used for an instant vector selector that returned the same FloatHistogram for successive points, and then that slice was reused for a range vector selector: the range vector selector would reuse the FloatHistogram instances when filling the buffer here and so all points would have the same value as the last histogram, rather than their expected values.

The performance impact of this change is minimal in our benchmarks: less than 1% degradation or less than 5% reduction in latency in all cases. In some cases things get slightly slower due to the extra copies created, and in some cases things get slightly faster due to the simpler logic. I think the simpler logic and far reduced likelihood of bugs make this change worthwhile.

In the real world, where lookback is more common, we'll only need to make copies (and therefore pay the price of those copies) if the same point is used for successive output samples, which would usually only happen if the query step is smaller than the scrape interval. Given most users run with 15s or 60s scrape intervals, and most queries tend to be either instant queries, I don't think this is likely to become a huge issue.

Which issue(s) this PR fixes or relates to

(none)

Checklist

  • Tests updated.
  • [n/a] Documentation added.
  • CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX].
  • [n/a] about-versioning.md updated with experimental features.

@charleskorn charleskorn marked this pull request as ready for review October 11, 2024 02:28
@charleskorn charleskorn requested a review from a team as a code owner October 11, 2024 02:28
Copy link
Contributor

@jhesketh jhesketh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with the rationale of this. I do wonder though if we can measure the real-world impact. Should we have a metric on number of copied samples? We can see if that rises too high if we should reconsider reimplementing such an optimisation down the road (although it may be a bigger risk to do later)

@charleskorn
Copy link
Contributor Author

Should we have a metric on number of copied samples?

We should be able to see this on profiles, including the continuous profiling we run on Mimir instances in Grafana Cloud.

@charleskorn charleskorn requested a review from jhesketh October 14, 2024 01:07
Copy link
Contributor

@jhesketh jhesketh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense to me. Are we adequately testing the series merging to catch the latest clear?

@charleskorn
Copy link
Contributor Author

Makes sense to me. Are we adequately testing the series merging to catch the latest clear?

I've added a test to ensure clear is called when we expect, and the existing tests should be sufficient to catch any other regressions.

@charleskorn charleskorn merged commit dd18bcb into main Oct 14, 2024
29 checks passed
@charleskorn charleskorn deleted the charleskorn/mqe-nh-simplification branch October 14, 2024 02:20
charleskorn added a commit that referenced this pull request Oct 14, 2024
* Fix another possible source of `FloatHistogram` corruption

* Add test
charleskorn added a commit that referenced this pull request Oct 14, 2024
* Fix another possible source of `FloatHistogram` corruption

* Add test
charleskorn added a commit that referenced this pull request Oct 14, 2024
* Fix another possible source of `FloatHistogram` corruption

* Add test
charleskorn added a commit that referenced this pull request Oct 14, 2024
* Fix another possible source of `FloatHistogram` corruption

* Add test
charleskorn added a commit that referenced this pull request Oct 14, 2024
* Fix another possible source of `FloatHistogram` corruption

* Add test
@jhesketh jhesketh mentioned this pull request Dec 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants