-
Notifications
You must be signed in to change notification settings - Fork 9.8k
Native histograms vs labels #13005
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Native histograms vs labels #13005
Conversation
Fixes: prometheus#12984 For full explanation see the related issue. The le and quantile labels are formatted as float with trailing .0 for whole number values when native histograms is enabled, e.g. 10.0. This changes the resulting series in Prometheus if previously we scraped the whole number itself, e.g. 10 over the text format. Signed-off-by: György Krajcsovits <gyorgy.krajcsovits@grafana.com>
Signed-off-by: György Krajcsovits <gyorgy.krajcsovits@grafana.com>
Signed-off-by: György Krajcsovits <gyorgy.krajcsovits@grafana.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you also add documentation how to relabel forward (so people can move towards the new behavior)? Or would we leave that to a feature flag that enables the normalization for all scrape formats pre-3.0?
I think forward-relabeling would also be cool so that people could proactively prepare for switching on native histograms. |
I now have second thoughts on this. This section is already getting quite long, and it is only about the native histogram feature flag. But the problem isn't even about native histograms directly, it's more a problem with the migration to OpenMetrics. So I would say we should have a migration document for OpenMetrics somewhere, and that's a place where we can be comprehensive about all the tricks you can do. |
Yeah, it looks weird here for sure. |
Signed-off-by: György Krajcsovits <gyorgy.krajcsovits@grafana.com>
Signed-off-by: György Krajcsovits <gyorgy.krajcsovits@grafana.com>
@beorn7 ready for another round of review. One thing that is missing I think is to give some recommendation on which path to take. I'd like to recommend the first path, which is to weather the storm and do nothing other then updating alerts/rules/dashboards that would become completely broken. Maybe even mention upcoming release? |
Signed-off-by: György Krajcsovits <gyorgy.krajcsovits@grafana.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for a bit more involved comments. I have remembered that we have active mitigation at least in histogram_quantile
, so we should change the tone from "wrong results" to "less accurate" and avoid creating unnecessary panic.
docs/feature_flags.md
Outdated
_Note about the format of `le` and `quantile` label values:_ | ||
|
||
In certain situations, the protobuf parsing changes the number formatting of | ||
the `le` labels of all conventional histograms and the `quantile` labels of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would leave out the "all".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem is conveying that not only native histogram instrumented histograms are effected. I'm removing this, keeping the problem in mind.
docs/feature_flags.md
Outdated
ranges spanning over the transition period. For example the quantile query | ||
`histogram_quantile(0.95, sum by (le) (rate(histogram_bucket)))` will be wrong | ||
because `rate` will see a drop to stale in some buckets, while it will see some | ||
completely new buckets. Time ranges not including the transition will be fine. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for only remembering now. Since this problem is actually not new (as discussed, happened the 1st time when moving from Prom v1 to Prom v2 many years ago, and happened the 2nd time when OpenMetrics got introduced), there was an effort to make histogram_quantile
handle it: #5158
The problem is, however, that the rate
calculation happens first, and it does it separately for differently formatted le
labels. As long as there are many samples within the range, the error should be small. With just a few samples, you'll run into problems that a rate calculation might not be possible for one format or both, and you lose more data even if it is possible. The latter happens typically with a short range, which luckily also means that there will be a very short time where you are actually suffering from the error. While the former happens for long ranges, where you will have a long time of suffering from the error, but the error will also be small.
So in sum, the problem isn't that bad.
We definitely cannot complain all the details here, but I would refrain from plainly calling the result "wrong". More something like: "Aggregation by the le
and quantile
labels for vectors that contain the old and new formatting will lead to unexpected results, and range vectors that span the transition between the different formatting will contain additional series. The most common use case for both is the quantile calculation via histogram_quantile
, e.g. histogram_quantile(0.95, sum by (le) (rate(histogram_bucket[10m])))
. The histogram_quantile
function already tries to mitigate the effects to some extent, but there will be inaccuracies, in particular for shorter ranges that cover only a few samples."
Signed-off-by: George Krajcsovits <krajorama@users.noreply.github.com>
Signed-off-by: György Krajcsovits <gyorgy.krajcsovits@grafana.com>
Yeah, I think we should mention that the OM formatting is the desired one, and everyone should move to it eventually (and then maybe mention that Prometheus 3 will always do the kind of normalization you get now with the histogram feature flag anyway). So the relabeling could be seen as either delaying the migration to a more suitable moment, or it is something that you have to apply forever… Of course, we need something less wordy than the above. :) |
Signed-off-by: György Krajcsovits <gyorgy.krajcsovits@grafana.com>
I've tried to give a short recommendation without going into details. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great. Just needs a few more backticks, then we can merge.
Add testcase and update test so that it can test native histograms as well. Signed-off-by: György Krajcsovits <gyorgy.krajcsovits@grafana.com>
Signed-off-by: György Krajcsovits <gyorgy.krajcsovits@grafana.com>
Fixed merge conflict |
And removed stray commit from another job :) |
…ock" This reverts commit b7eb422. Signed-off-by: György Krajcsovits <gyorgy.krajcsovits@grafana.com>
46e9af2
to
d7d3826
Compare
Fix DCO I hope |
Thank you very much. |
* Document le and quantile label transition due to native histograms Fixes: prometheus#12984 For full explanation see the related issue. The le and quantile labels are formatted as float with trailing .0 for whole number values when native histograms is enabled, e.g. 10.0. This changes the resulting series in Prometheus if previously we scraped the whole number itself, e.g. 10 over the text format. Signed-off-by: György Krajcsovits <gyorgy.krajcsovits@grafana.com> --------- Signed-off-by: György Krajcsovits <gyorgy.krajcsovits@grafana.com> Signed-off-by: George Krajcsovits <krajorama@users.noreply.github.com>
[2.48] Backport: Native histograms vs labels (#13005)
Fixes: #12984
For full explanation see the related issue. The le and quantile labels
are formatted as float with trailing .0 for whole number values when
native histograms is enabled, e.g. 10.0. This changes the resulting series
in Prometheus if previously we scraped the whole number itself, e.g. 10
over the text format.