Native histograms vs labels #13005

krajorama · 2023-10-18T12:01:25Z

For full explanation see the related issue. The le and quantile labels
are formatted as float with trailing .0 for whole number values when
native histograms is enabled, e.g. 10.0. This changes the resulting series
in Prometheus if previously we scraped the whole number itself, e.g. 10
over the text format.

Fixes: prometheus#12984 For full explanation see the related issue. The le and quantile labels are formatted as float with trailing .0 for whole number values when native histograms is enabled, e.g. 10.0. This changes the resulting series in Prometheus if previously we scraped the whole number itself, e.g. 10 over the text format. Signed-off-by: György Krajcsovits <gyorgy.krajcsovits@grafana.com>

Signed-off-by: György Krajcsovits <gyorgy.krajcsovits@grafana.com>

matthiasr

Could you also add documentation how to relabel forward (so people can move towards the new behavior)? Or would we leave that to a feature flag that enables the normalization for all scrape formats pre-3.0?

beorn7 · 2023-10-18T14:49:46Z

I think forward-relabeling would also be cool so that people could proactively prepare for switching on native histograms.

docs/feature_flags.md

beorn7 · 2023-10-19T13:20:39Z

I think forward-relabeling would also be cool so that people could proactively prepare for switching on native histograms.

I now have second thoughts on this. This section is already getting quite long, and it is only about the native histogram feature flag.

But the problem isn't even about native histograms directly, it's more a problem with the migration to OpenMetrics. So I would say we should have a migration document for OpenMetrics somewhere, and that's a place where we can be comprehensive about all the tricks you can do.

krajorama · 2023-10-24T12:52:49Z

I think forward-relabeling would also be cool so that people could proactively prepare for switching on native histograms.

I now have second thoughts on this. This section is already getting quite long, and it is only about the native histogram feature flag.

But the problem isn't even about native histograms directly, it's more a problem with the migration to OpenMetrics. So I would say we should have a migration document for OpenMetrics somewhere, and that's a place where we can be comprehensive about all the tricks you can do.

Yeah, it looks weird here for sure.
I am also a little bit worried how this will go down with users. When you have a million metrics and have to figure out what to apply the workarounds to , it gets hairy.

Signed-off-by: György Krajcsovits <gyorgy.krajcsovits@grafana.com>

krajorama · 2023-10-25T07:37:16Z

@beorn7 ready for another round of review. One thing that is missing I think is to give some recommendation on which path to take. I'd like to recommend the first path, which is to weather the storm and do nothing other then updating alerts/rules/dashboards that would become completely broken. Maybe even mention upcoming release?

Signed-off-by: György Krajcsovits <gyorgy.krajcsovits@grafana.com>

docs/feature_flags.md

beorn7

Sorry for a bit more involved comments. I have remembered that we have active mitigation at least in histogram_quantile, so we should change the tone from "wrong results" to "less accurate" and avoid creating unnecessary panic.

beorn7 · 2023-10-25T12:27:53Z

docs/feature_flags.md

+_Note about the format of `le` and `quantile` label values:_
+
+In certain situations, the protobuf parsing changes the number formatting of
+the `le` labels of all conventional histograms and the `quantile` labels of


I would leave out the "all".

The problem is conveying that not only native histogram instrumented histograms are effected. I'm removing this, keeping the problem in mind.

docs/feature_flags.md

beorn7 · 2023-10-25T12:58:00Z

docs/feature_flags.md

+ranges spanning over the transition period. For example the quantile query
+`histogram_quantile(0.95, sum by (le) (rate(histogram_bucket)))` will be wrong
+because `rate` will see a drop to stale in some buckets, while it will see some
+completely new buckets. Time ranges not including the transition will be fine.


Sorry for only remembering now. Since this problem is actually not new (as discussed, happened the 1st time when moving from Prom v1 to Prom v2 many years ago, and happened the 2nd time when OpenMetrics got introduced), there was an effort to make histogram_quantile handle it: #5158

The problem is, however, that the rate calculation happens first, and it does it separately for differently formatted le labels. As long as there are many samples within the range, the error should be small. With just a few samples, you'll run into problems that a rate calculation might not be possible for one format or both, and you lose more data even if it is possible. The latter happens typically with a short range, which luckily also means that there will be a very short time where you are actually suffering from the error. While the former happens for long ranges, where you will have a long time of suffering from the error, but the error will also be small.

So in sum, the problem isn't that bad.

We definitely cannot complain all the details here, but I would refrain from plainly calling the result "wrong". More something like: "Aggregation by the le and quantile labels for vectors that contain the old and new formatting will lead to unexpected results, and range vectors that span the transition between the different formatting will contain additional series. The most common use case for both is the quantile calculation via histogram_quantile, e.g. histogram_quantile(0.95, sum by (le) (rate(histogram_bucket[10m]))). The histogram_quantile function already tries to mitigate the effects to some extent, but there will be inaccuracies, in particular for shorter ranges that cover only a few samples."

docs/feature_flags.md

Signed-off-by: George Krajcsovits <krajorama@users.noreply.github.com>

Signed-off-by: György Krajcsovits <gyorgy.krajcsovits@grafana.com>

beorn7 · 2023-10-25T14:55:26Z

One thing that is missing I think is to give some recommendation on which path to take. I'd like to recommend the first path, which is to weather the storm and do nothing other then updating alerts/rules/dashboards that would become completely broken. Maybe even mention upcoming release?

Yeah, I think we should mention that the OM formatting is the desired one, and everyone should move to it eventually (and then maybe mention that Prometheus 3 will always do the kind of normalization you get now with the histogram feature flag anyway). So the relabeling could be seen as either delaying the migration to a more suitable moment, or it is something that you have to apply forever…

Of course, we need something less wordy than the above. :)

Signed-off-by: György Krajcsovits <gyorgy.krajcsovits@grafana.com>

krajorama · 2023-10-26T11:39:18Z

One thing that is missing I think is to give some recommendation on which path to take. I'd like to recommend the first path, which is to weather the storm and do nothing other then updating alerts/rules/dashboards that would become completely broken. Maybe even mention upcoming release?

Yeah, I think we should mention that the OM formatting is the desired one, and everyone should move to it eventually (and then maybe mention that Prometheus 3 will always do the kind of normalization you get now with the histogram feature flag anyway). So the relabeling could be seen as either delaying the migration to a more suitable moment, or it is something that you have to apply forever…

Of course, we need something less wordy than the above. :)

I've tried to give a short recommendation without going into details.

beorn7

Great. Just needs a few more backticks, then we can merge.

docs/feature_flags.md

Add testcase and update test so that it can test native histograms as well. Signed-off-by: György Krajcsovits <gyorgy.krajcsovits@grafana.com>

Signed-off-by: György Krajcsovits <gyorgy.krajcsovits@grafana.com>

krajorama · 2023-11-01T16:19:21Z

Fixed merge conflict

krajorama · 2023-11-01T16:21:05Z

And removed stray commit from another job :)

…ock" This reverts commit b7eb422. Signed-off-by: György Krajcsovits <gyorgy.krajcsovits@grafana.com>

krajorama · 2023-11-01T16:22:05Z

Fix DCO I hope

beorn7 · 2023-11-01T17:30:45Z

Thank you very much.

* Document le and quantile label transition due to native histograms Fixes: prometheus#12984 For full explanation see the related issue. The le and quantile labels are formatted as float with trailing .0 for whole number values when native histograms is enabled, e.g. 10.0. This changes the resulting series in Prometheus if previously we scraped the whole number itself, e.g. 10 over the text format. Signed-off-by: György Krajcsovits <gyorgy.krajcsovits@grafana.com> --------- Signed-off-by: György Krajcsovits <gyorgy.krajcsovits@grafana.com> Signed-off-by: George Krajcsovits <krajorama@users.noreply.github.com>

[2.48] Backport: Native histograms vs labels (#13005)

krajorama added 3 commits October 18, 2023 11:41

Unit test histogram le label conversion

3a91360

Signed-off-by: György Krajcsovits <gyorgy.krajcsovits@grafana.com>

Add summary quantile test.

e28f35c

Signed-off-by: György Krajcsovits <gyorgy.krajcsovits@grafana.com>

matthiasr reviewed Oct 18, 2023

View reviewed changes

beorn7 requested changes Oct 19, 2023

View reviewed changes

krajorama added 2 commits October 24, 2023 18:01

Updates from review comments

66e2671

Signed-off-by: György Krajcsovits <gyorgy.krajcsovits@grafana.com>

Do not callback, be more explicit.

ac79b7c

Signed-off-by: György Krajcsovits <gyorgy.krajcsovits@grafana.com>

krajorama added 2 commits October 25, 2023 10:06

Merge branch 'main' into native-histograms-vs-labels

18d3d31

Fix test due to merge main

46fdde5

Signed-off-by: György Krajcsovits <gyorgy.krajcsovits@grafana.com>

krajorama commented Oct 25, 2023

View reviewed changes

docs/feature_flags.md Outdated Show resolved Hide resolved

beorn7 requested changes Oct 25, 2023

View reviewed changes

krajorama commented Oct 25, 2023

View reviewed changes

docs/feature_flags.md Outdated Show resolved Hide resolved

krajorama commented Oct 25, 2023

View reviewed changes

docs/feature_flags.md Outdated Show resolved Hide resolved

krajorama and others added 2 commits October 25, 2023 16:35

Apply suggestions from code review

52426f0

Signed-off-by: George Krajcsovits <krajorama@users.noreply.github.com>

Updates from review

30315fd

Signed-off-by: György Krajcsovits <gyorgy.krajcsovits@grafana.com>

Try to give a short recommendation.

e7463f6

Signed-off-by: György Krajcsovits <gyorgy.krajcsovits@grafana.com>

beorn7 requested changes Oct 31, 2023

View reviewed changes

docs/feature_flags.md Outdated Show resolved Hide resolved

docs/feature_flags.md Outdated Show resolved Hide resolved

docs/feature_flags.md Outdated Show resolved Hide resolved

krajorama added 2 commits November 1, 2023 15:52

tsdb/compact_test.go: test mixed typed series with PopulateBlock

b7eb422

Add testcase and update test so that it can test native histograms as well. Signed-off-by: György Krajcsovits <gyorgy.krajcsovits@grafana.com>

Consistent formatting.

fdaea8e

Signed-off-by: György Krajcsovits <gyorgy.krajcsovits@grafana.com>

beorn7 approved these changes Nov 1, 2023

View reviewed changes

Merge branch 'main' into native-histograms-vs-labels

9eec3ab

krajorama requested a review from jesusvazquez as a code owner November 1, 2023 16:17

Revert "tsdb/compact_test.go: test mixed typed series with PopulateBl…

d7d3826

…ock" This reverts commit b7eb422. Signed-off-by: György Krajcsovits <gyorgy.krajcsovits@grafana.com>

krajorama force-pushed the native-histograms-vs-labels branch from 46e9af2 to d7d3826 Compare November 1, 2023 16:21

beorn7 merged commit e399395 into prometheus:main Nov 1, 2023

beorn7 mentioned this pull request Nov 28, 2023

[2.48] Backport: Native histograms vs labels (#13005) #13202

Merged

beorn7 added a commit that referenced this pull request Dec 4, 2023

Merge pull request #13202 from krajorama/backport-13005-to-2.28

0bd4e73

[2.48] Backport: Native histograms vs labels (#13005)

Native histograms vs labels #13005

Native histograms vs labels #13005

Uh oh!

Conversation

krajorama commented Oct 18, 2023

Uh oh!

matthiasr left a comment

Choose a reason for hiding this comment

Uh oh!

beorn7 commented Oct 18, 2023

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

beorn7 commented Oct 19, 2023

Uh oh!

krajorama commented Oct 24, 2023

Uh oh!

krajorama commented Oct 25, 2023

Uh oh!

Uh oh!

beorn7 left a comment

Choose a reason for hiding this comment

Uh oh!

beorn7 Oct 25, 2023

Choose a reason for hiding this comment

Uh oh!

krajorama Oct 25, 2023

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

beorn7 Oct 25, 2023

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

beorn7 commented Oct 25, 2023

Uh oh!

krajorama commented Oct 26, 2023

Uh oh!

beorn7 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

krajorama commented Nov 1, 2023

Uh oh!

krajorama commented Nov 1, 2023

Uh oh!

krajorama commented Nov 1, 2023

Uh oh!

beorn7 commented Nov 1, 2023

Uh oh!

Uh oh!