Skip to content

Conversation

krajorama
Copy link
Member

Fixes: #12984

For full explanation see the related issue. The le and quantile labels
are formatted as float with trailing .0 for whole number values when
native histograms is enabled, e.g. 10.0. This changes the resulting series
in Prometheus if previously we scraped the whole number itself, e.g. 10
over the text format.

Fixes: prometheus#12984

For full explanation see the related issue. The le and quantile labels
are formatted as float with trailing .0 for whole number values when
native histograms is enabled, e.g. 10.0. This changes the resulting series
in Prometheus if previously we scraped the whole number itself, e.g. 10
over the text format.

Signed-off-by: György Krajcsovits <gyorgy.krajcsovits@grafana.com>
Signed-off-by: György Krajcsovits <gyorgy.krajcsovits@grafana.com>
Signed-off-by: György Krajcsovits <gyorgy.krajcsovits@grafana.com>
Copy link
Contributor

@matthiasr matthiasr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you also add documentation how to relabel forward (so people can move towards the new behavior)? Or would we leave that to a feature flag that enables the normalization for all scrape formats pre-3.0?

@beorn7
Copy link
Member

beorn7 commented Oct 18, 2023

I think forward-relabeling would also be cool so that people could proactively prepare for switching on native histograms.

@beorn7
Copy link
Member

beorn7 commented Oct 19, 2023

I think forward-relabeling would also be cool so that people could proactively prepare for switching on native histograms.

I now have second thoughts on this. This section is already getting quite long, and it is only about the native histogram feature flag.

But the problem isn't even about native histograms directly, it's more a problem with the migration to OpenMetrics. So I would say we should have a migration document for OpenMetrics somewhere, and that's a place where we can be comprehensive about all the tricks you can do.

@krajorama
Copy link
Member Author

I think forward-relabeling would also be cool so that people could proactively prepare for switching on native histograms.

I now have second thoughts on this. This section is already getting quite long, and it is only about the native histogram feature flag.

But the problem isn't even about native histograms directly, it's more a problem with the migration to OpenMetrics. So I would say we should have a migration document for OpenMetrics somewhere, and that's a place where we can be comprehensive about all the tricks you can do.

Yeah, it looks weird here for sure.
I am also a little bit worried how this will go down with users. When you have a million metrics and have to figure out what to apply the workarounds to , it gets hairy.

Signed-off-by: György Krajcsovits <gyorgy.krajcsovits@grafana.com>
Signed-off-by: György Krajcsovits <gyorgy.krajcsovits@grafana.com>
@krajorama
Copy link
Member Author

@beorn7 ready for another round of review. One thing that is missing I think is to give some recommendation on which path to take. I'd like to recommend the first path, which is to weather the storm and do nothing other then updating alerts/rules/dashboards that would become completely broken. Maybe even mention upcoming release?

Signed-off-by: György Krajcsovits <gyorgy.krajcsovits@grafana.com>
Copy link
Member

@beorn7 beorn7 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for a bit more involved comments. I have remembered that we have active mitigation at least in histogram_quantile, so we should change the tone from "wrong results" to "less accurate" and avoid creating unnecessary panic.

_Note about the format of `le` and `quantile` label values:_

In certain situations, the protobuf parsing changes the number formatting of
the `le` labels of all conventional histograms and the `quantile` labels of
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would leave out the "all".

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem is conveying that not only native histogram instrumented histograms are effected. I'm removing this, keeping the problem in mind.

ranges spanning over the transition period. For example the quantile query
`histogram_quantile(0.95, sum by (le) (rate(histogram_bucket)))` will be wrong
because `rate` will see a drop to stale in some buckets, while it will see some
completely new buckets. Time ranges not including the transition will be fine.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for only remembering now. Since this problem is actually not new (as discussed, happened the 1st time when moving from Prom v1 to Prom v2 many years ago, and happened the 2nd time when OpenMetrics got introduced), there was an effort to make histogram_quantile handle it: #5158

The problem is, however, that the rate calculation happens first, and it does it separately for differently formatted le labels. As long as there are many samples within the range, the error should be small. With just a few samples, you'll run into problems that a rate calculation might not be possible for one format or both, and you lose more data even if it is possible. The latter happens typically with a short range, which luckily also means that there will be a very short time where you are actually suffering from the error. While the former happens for long ranges, where you will have a long time of suffering from the error, but the error will also be small.

So in sum, the problem isn't that bad.

We definitely cannot complain all the details here, but I would refrain from plainly calling the result "wrong". More something like: "Aggregation by the le and quantile labels for vectors that contain the old and new formatting will lead to unexpected results, and range vectors that span the transition between the different formatting will contain additional series. The most common use case for both is the quantile calculation via histogram_quantile, e.g. histogram_quantile(0.95, sum by (le) (rate(histogram_bucket[10m]))). The histogram_quantile function already tries to mitigate the effects to some extent, but there will be inaccuracies, in particular for shorter ranges that cover only a few samples."

krajorama and others added 2 commits October 25, 2023 16:35
Signed-off-by: George Krajcsovits <krajorama@users.noreply.github.com>
Signed-off-by: György Krajcsovits <gyorgy.krajcsovits@grafana.com>
@beorn7
Copy link
Member

beorn7 commented Oct 25, 2023

One thing that is missing I think is to give some recommendation on which path to take. I'd like to recommend the first path, which is to weather the storm and do nothing other then updating alerts/rules/dashboards that would become completely broken. Maybe even mention upcoming release?

Yeah, I think we should mention that the OM formatting is the desired one, and everyone should move to it eventually (and then maybe mention that Prometheus 3 will always do the kind of normalization you get now with the histogram feature flag anyway). So the relabeling could be seen as either delaying the migration to a more suitable moment, or it is something that you have to apply forever…

Of course, we need something less wordy than the above. :)

Signed-off-by: György Krajcsovits <gyorgy.krajcsovits@grafana.com>
@krajorama
Copy link
Member Author

One thing that is missing I think is to give some recommendation on which path to take. I'd like to recommend the first path, which is to weather the storm and do nothing other then updating alerts/rules/dashboards that would become completely broken. Maybe even mention upcoming release?

Yeah, I think we should mention that the OM formatting is the desired one, and everyone should move to it eventually (and then maybe mention that Prometheus 3 will always do the kind of normalization you get now with the histogram feature flag anyway). So the relabeling could be seen as either delaying the migration to a more suitable moment, or it is something that you have to apply forever…

Of course, we need something less wordy than the above. :)

I've tried to give a short recommendation without going into details.

Copy link
Member

@beorn7 beorn7 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great. Just needs a few more backticks, then we can merge.

Add testcase and update test so that it can test native histograms as well.

Signed-off-by: György Krajcsovits <gyorgy.krajcsovits@grafana.com>
Signed-off-by: György Krajcsovits <gyorgy.krajcsovits@grafana.com>
@krajorama
Copy link
Member Author

Fixed merge conflict

@krajorama
Copy link
Member Author

And removed stray commit from another job :)

…ock"

This reverts commit b7eb422.

Signed-off-by: György Krajcsovits <gyorgy.krajcsovits@grafana.com>
@krajorama krajorama force-pushed the native-histograms-vs-labels branch from 46e9af2 to d7d3826 Compare November 1, 2023 16:21
@krajorama
Copy link
Member Author

Fix DCO I hope

@beorn7 beorn7 merged commit e399395 into prometheus:main Nov 1, 2023
@beorn7
Copy link
Member

beorn7 commented Nov 1, 2023

Thank you very much.

krajorama added a commit to krajorama/prometheus that referenced this pull request Nov 28, 2023
* Document le and quantile label transition due to native histograms

Fixes: prometheus#12984

For full explanation see the related issue. The le and quantile labels
are formatted as float with trailing .0 for whole number values when
native histograms is enabled, e.g. 10.0. This changes the resulting series
in Prometheus if previously we scraped the whole number itself, e.g. 10
over the text format.

Signed-off-by: György Krajcsovits <gyorgy.krajcsovits@grafana.com>

---------

Signed-off-by: György Krajcsovits <gyorgy.krajcsovits@grafana.com>
Signed-off-by: George Krajcsovits <krajorama@users.noreply.github.com>
beorn7 added a commit that referenced this pull request Dec 4, 2023
[2.48] Backport: Native histograms vs labels (#13005)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

scrape: make le and quantile normalization configurable
3 participants