promql: histogram_fraction for bucket histograms #16095

MichaHoffmann · 2025-03-01T09:28:19Z

This PR extends the histogram_fraction function to also work with classic bucket histograms. This is beneficial because it allows expressions like sum(increase(my_bucket{le="0.5"}[10m]))/sum(increase(my_total[10m])) to be written without knowing the actual values of the "le" label, easing the transition to native histograms later on.
It also feels natural since histogram_quantile also can deal with classic histograms.

The code was cobbled together from the parts that let histogram_quantile work with buckets and native histograms and parts of the HistogramFraction functions adjusted for classic histograms. Being cobbled together like this it involves a bit of code duplication that could probably be refactored. I wanted to open the PR already though to see if this is reasonable in the first place though!

bwplotka · 2025-03-03T08:44:49Z

Nice! I believe the only question is -- do we (and users) accept the kind of errors you will get from calculating this from fixed buckets e.g. if you have only le=0.1 and le=10, le=100 and you do histogram_fraction(1, 15, rate(http_request_duration_seconds_bucket[1h])) -- the errors in this case will make the result pretty useless (no?). We don't have a good way to share pointers on what start and end time makes sense given buckets. Also is using _bucket series makes sense here? Will we use sum/count?

Anyway, I think it makes sense to extrapolate some usefulness here anyway, with the right documentations and maybe errors if potential start,end does not make sense? 🤔 I'm sure @beorn7 has opinions on this already too.

MichaHoffmann · 2025-03-03T08:51:22Z

Nice! I believe the only question is -- do we (and users) accept the kind of errors you will get from calculating this from fixed buckets e.g. if you have only le=0.1 and le=10, le=100 and you do histogram_fraction(1, 15, rate(http_request_duration_seconds_bucket[1h])) -- the errors in this case will make the result pretty useless (no?). We don't have a good way to share pointers on what start and end time makes sense given buckets. Also is using _bucket series makes sense here? Will we use sum/count?

Anyway, I think it makes sense to extrapolate some usefulness here anyway, with the right documentations and maybe errors if potential start,end does not make sense? 🤔 I'm sure @beorn7 has opinions on this already too.

We have to use the _buckets to get the "le" label at least. The calculation is essentially the same as for native histograms, it just depends on "good" choices for bucket boundaries more ( i wonder if linear interpolation is right choice though! ). If we agree that such a function would be valuable then we should add the documentation that its an approximation if the "upper/lower" parameters do not align with any bucket boundaries though - i think that is sensible!

beorn7 · 2025-03-04T16:01:13Z

Nitty technical thoughts and answers:

Linear interpolation makes most sense, mostly for consistency, because we have used linear interpolation for quantile estimation with classic histograms forever. We have even decided to keep it that way for NHCB. Given how users can set all kind of bucket layouts, it's hard to to something that is generally useful, so we could as well just use the simplest method, which is at least easy to reason with. (A maybe a bit better rationale is that one of the remaining use cases to prefer NHCB over standard exponential buckets are distributions that don't map nicely to the exponential buckets, and those might often be more-or-less linear distributions – like stuff that is already logarithmic (dB, star brightness, …) or "artificial" units that are designed to approach linear distributions ("scores" or "grades" of any kind: SpamAssassin, school grades, star ratings).
It would complicate things a lot if we needed the count and sum series (because it is a completely separate series with classic histograms). Luckily, we have the +Inf bucket, which is by definition the same as the count (note that this is different for native histograms because NaN observations go into no bucket, but that's just a weird edge case). This also means that we require the +Inf bucket. The implementation in this PR already handles this correctly (be returning NaN if it isn't there), but obviously, this needs to be documented explicitly in the end.

beorn7 · 2025-03-04T16:43:31Z

My general thoughts:

tl;dr: If a sufficient amount of people desire this sufficiently strongly, I guess we can add this. There is not much technical downside to adding it (implementation is not very complicated; it doesn't add "yet another function" to PromQL, just modifies the behavior of an existing function). I'm not convinced, though, that this is something that people should desire…

I haven't really bought yet the two prime incentives for this:

"easing the transition to native histograms later on": You still have to change the query when moving from classic histograms to native histograms. Yes, it is a smaller change. But you also have to change from whatever you have now to histogram_fraction, after merging this PR. So why not migrate to native histograms (could be NHCB) normally and change the query only once?
"without knowing the actual values of the 'le' label": I see this more as a problem than a benefit. Given that classic histograms are usually very low-res and in particular only cover a limited range of observed value, not knowing what your le values are is a recipe for disaster. (This echos @bwplotka's concern above.)

histogram_fraction was introduced for native histograms because there is no way of accessing individual buckets, and because of the dynamic nature of the buckets. (Or in other words: The reason was not that histogram_fraction was considered a better interface than the direct access of buckets possible with classic histograms. More the contrary…)

beorn7 · 2025-03-04T16:47:15Z

Conclusion: If you left it to me to decide, I would say no. But whoever feels strongly that we should have this, please say so. You don't even have to convince me. You just have to state that you are convinced. If there is critical mass for this (and no objections against it otherwise), I will not oppose it.

SuperQ · 2025-03-04T17:19:53Z

So why not migrate to native histograms (could be NHCB) normally and change the query only once?

The problem is this work is not "instant".

We have thousands of Prometheus instances for thousands of services. It's going to take us months to migrate and cross-coordinate tooling with teams. And we might need to rollout and rollback native histograms.

So by having the option to pre-convert queries from classic to native allows us to stagger the changes across various teams.

So, I feel fairly strongly that this is something very useful for us to support. It allows for flexibility forwards and backwards in mixed deployment environments.

beorn7 · 2025-03-05T10:17:41Z

So by having the option to pre-convert queries from classic to native allows us to stagger the changes across various teams.

But the query is still different, isn't it? Or are you doing something like histogram_fraction(1, 15, rate({__name__=~"http_request_duration_seconds.*"}[1h])) ?

beorn7 · 2025-03-12T19:38:13Z

Straw poll on Slack got a few thumbs up and no thumbs down. So let's do this. I'll review this PR ASAP.

@MichaHoffmann could you update the documentation in the meantime? It should be part of this PR.

beorn7

Generally looks good. A bit more testing would be good. Also to reduce the code duplication where it is easy. (But don't sweat it. I think a bit of duplication is better than convoluted contraptions to make the code as DRY as possible.)

And as said before, we need an update for the documentation.

promql/promqltest/testdata/histograms.test

promql/functions.go

MichaHoffmann · 2025-03-14T06:49:38Z

Generally looks good. A bit more testing would be good. Also to reduce the code duplication where it is easy. (But don't sweat it. I think a bit of duplication is better than convoluted contraptions to make the code as DRY as possible.)

And as said before, we need an update for the documentation.

Thank you! Ill address the comments over the weekend!

beorn7

Thank you very much.

Code looks good, let's just improve documentation a bit more as described in the comment.

docs/querying/functions.md

beorn7 · 2025-04-10T11:06:33Z

@MichaHoffmann are you still working on this? As said, we are only missing some documentation improvements.

MichaHoffmann · 2025-04-10T11:08:52Z

@MichaHoffmann are you still working on this? As said, we are only missing some documentation improvements.

Yeah! Will add your suggestions - I just didn't find a lot of energy and time lately to do it ! Ill do it over the weekend

This PR extends the histogram_fraction function to also work with classic bucket histograms. This is beneficial because it allows expressions like sum(increase(my_bucket{le="0.5"}[10m]))/sum(increase(my_total[10m])) to be written without knowing the actual values of the "le" label, easing the transition to native histograms later on. It also feels natural since histogram_quantile also can deal with classic histograms. Signed-off-by: Michael Hoffmann <mhoffmann@cloudflare.com>

* Add documentation and reduce code duplication * Fix a bug in linear interpolation between bucket boundaries * Add more PromQL tests Signed-off-by: Michael Hoffmann <mhoffmann@cloudflare.com>

beorn7

Thanks.

What do you think about my refinement?

docs/querying/functions.md

Co-authored-by: Björn Rabenstein <github@rabenste.in> Signed-off-by: Michael Hoffmann <mhoffm@posteo.de>

MichaHoffmann · 2025-04-21T07:52:54Z

Thanks.

What do you think about my refinement?

Yeah, reads good to me, i committed the suggestion!

beorn7

Thank you.

See prometheus/prometheus#16095

* Upgrade mimir-prometheus * Fix breaking change in `NewAPI` * Sync upstream PromQL engine test cases * Modify `histogram_stddev` and `histogram_stdvar` to match new behaviour introduced in prometheus/prometheus#16444 * Simplify code * Remove outdated comment This issue was fixed in prometheus/prometheus#15828. * Remove unnecessary variable * Rename operator and files * Extract `histogram_quantile`-specific logic * Add support for classic histograms in `histogram_fraction` See prometheus/prometheus#16095 * Bring in changes from grafana/mimir-prometheus#871 * Invert build tag logic to match prometheus/prometheus#16436 * Sync query engine tests with upstream * Fix breaking changes * Regenerate OTLP code

MichaHoffmann requested a review from roidelapluie as a code owner March 1, 2025 09:28

MichaHoffmann force-pushed the mhoffmann/histogram_fraction_for_bucket_histograms branch from aff63b6 to cebd907 Compare March 1, 2025 09:29

SuperQ requested a review from beorn7 March 1, 2025 09:30

MichaHoffmann force-pushed the mhoffmann/histogram_fraction_for_bucket_histograms branch from cebd907 to d35d324 Compare March 1, 2025 09:31

beorn7 requested changes Mar 13, 2025

View reviewed changes

promql/promqltest/testdata/histograms.test Show resolved Hide resolved

promql/functions.go Outdated Show resolved Hide resolved

promql/functions.go Outdated Show resolved Hide resolved

MichaHoffmann requested a review from beorn7 March 18, 2025 08:28

MichaHoffmann force-pushed the mhoffmann/histogram_fraction_for_bucket_histograms branch from 1535c77 to 7ae848c Compare March 18, 2025 14:39

beorn7 requested changes Mar 25, 2025

View reviewed changes

docs/querying/functions.md Outdated Show resolved Hide resolved

MichaHoffmann force-pushed the mhoffmann/histogram_fraction_for_bucket_histograms branch from 7ae848c to 61c0cf8 Compare April 13, 2025 12:03

promql: histogram_fraction for bucket histograms

02cfc46

* Add documentation and reduce code duplication * Fix a bug in linear interpolation between bucket boundaries * Add more PromQL tests Signed-off-by: Michael Hoffmann <mhoffmann@cloudflare.com>

MichaHoffmann force-pushed the mhoffmann/histogram_fraction_for_bucket_histograms branch from 61c0cf8 to 02cfc46 Compare April 13, 2025 12:06

MichaHoffmann requested a review from beorn7 April 13, 2025 14:31

beorn7 requested changes Apr 15, 2025

View reviewed changes

docs/querying/functions.md Outdated Show resolved Hide resolved

Update docs/querying/functions.md

e8790af

Co-authored-by: Björn Rabenstein <github@rabenste.in> Signed-off-by: Michael Hoffmann <mhoffm@posteo.de>

MichaHoffmann requested a review from beorn7 April 21, 2025 07:52

beorn7 approved these changes Apr 22, 2025

View reviewed changes

beorn7 merged commit d6d9f97 into prometheus:main Apr 22, 2025
27 checks passed

charleskorn added a commit to grafana/mimir that referenced this pull request May 7, 2025

Add support for classic histograms in histogram_fraction

5507c7b

See prometheus/prometheus#16095

charleskorn added a commit to grafana/mimir that referenced this pull request May 16, 2025

Add support for classic histograms in histogram_fraction

bb43346

See prometheus/prometheus#16095

promql: histogram_fraction for bucket histograms #16095

promql: histogram_fraction for bucket histograms #16095

Uh oh!

Conversation

MichaHoffmann commented Mar 1, 2025

Uh oh!

bwplotka commented Mar 3, 2025

Uh oh!

MichaHoffmann commented Mar 3, 2025

Uh oh!

beorn7 commented Mar 4, 2025

Uh oh!

beorn7 commented Mar 4, 2025

Uh oh!

beorn7 commented Mar 4, 2025

Uh oh!

SuperQ commented Mar 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

beorn7 commented Mar 5, 2025

Uh oh!

beorn7 commented Mar 12, 2025

Uh oh!

beorn7 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

MichaHoffmann commented Mar 14, 2025

Uh oh!

beorn7 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

beorn7 commented Apr 10, 2025

Uh oh!

MichaHoffmann commented Apr 10, 2025

Uh oh!

beorn7 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

MichaHoffmann commented Apr 21, 2025

Uh oh!

beorn7 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

SuperQ commented Mar 4, 2025 •

edited

Loading