fix(promql): histogram_count and histogram_sum inconsistent #16682

krajorama · 2025-06-02T13:20:29Z

The problem is in the counter reset detection. The code that loads the
samples is matrixIterSlice which uses the typed Buffer iterator, which
will preload the integer histogram samples, however the last sample is
always(!) loaded as a float histogram sample in matrixIterSlice and the
optimized iterator fails to detect counter resets in that case.

This means that the error happens fairly rarely:

query needs to have rate/irate/increase
query needs to have histogram_count, histogram_sum, histogram_avg around the rate
last sample in the range should trigger reset from previous sample

Also the iterator does not reset lastH, lastFH properly.

In light of this bug, I wonder if we should rewrite the iterator to always return float histograms. Or maybe add a version that does that. On the other hand maybe it's better to rewrite the engine to use if for distinguishing between float samples and other samples and use generics for the "other" samples.

Fixes: #16681

krajorama · 2025-06-02T13:36:57Z

Note the test does not fail if I disable the count/sum optimization:

			case "histogram_count", "histogram_sum", "histogram_avg":
				n.SkipHistogramBuckets = false  // instead of true

krajorama · 2025-06-02T13:39:46Z

Python script used to convert user's data acquired by range vector selector on query API into test data:

import json
import math
from string import Template
import sys



def convert_json_to_promql_unittest(json_file_path):
    with open(json_file_path, 'r') as json_file:
        data = json.load(json_file)

    if "status" not in data or data["status"] != "success":
        print("Invalid JSON format: 'status' key is missing or not 'success'")
        return

    result = data.get("data", {}).get("result", [])
    if not result:
        print("No results found in the JSON data.")
        return

    for item in result:
        convertMetric(item)


def convertMetric(item):
    metric = item.get("metric", {})
    if not metric:
        print("No metric found in the item.")
        return
    metric_name = metric.get("__name__", "unknown_metric")
    labels = ["{}=\"{}\"".format(k, v) for k, v in metric.items() if k != "__name__"]
    series = metric_name + ",".join(labels)

    
    loadValues = []

    histograms = item.get("histograms", [])
    schema = 0
    prevTs = 0
    mockTs = 0
    idx = 0
    for histogram in histograms:
        Ts = int(histogram[0])
        # print("dTs: {}".format(Ts - prevTs))
        # if prevTs != 0:
        #     dTs = Ts - prevTs
        #     if dTs > 29:
        #         loadValues.append("NaN")
        #         for i in range(2, int(dTs/15)):
        #             loadValues.append("_")
        FH = histogram[1]
        schema = getSchema(FH, schema)

        if prevTs != 0:
            mockTs += Ts-prevTs

        printHistogram(FH, schema, mockTs, idx)
        idx += 1
        
        prevTs = Ts

    # series += " " + " ".join(loadValues)
    # print(series)
    # print("len: {}".format(len(loadValues)))

def getSchema(FH, schema):
    if "buckets" not in FH:
        return schema
    b = None
    for bucket in FH["buckets"]:
        l = float(bucket[1])
        h = float(bucket[2])
        if l < 0 and h > 0:
            # Zero bucket
            continue
        l=math.fabs(l)
        h=math.fabs(h)
        f=h/l
        ex=round(math.log2(math.log2(f)), 0)
        return int(-ex)
    return schema

def printHistogram(FH, schema, timestamp, idx):
    offset, deltas = convertHistogram(FH, schema)

    # t = Template('''h$idx = prompb.Histogram{
	# 	Count: &prompb.Histogram_CountInt{CountInt: $count},
	# 	Sum: $sum,
	# 	Schema: $schema,
	# 	ZeroThreshold: 0.0001,
	# 	ZeroCount: &prompb.Histogram_ZeroCountInt{ZeroCountInt: 0},
	# 	NegativeSpans: nil,
	# 	PositiveSpans: []prompb.BucketSpan{
	# 		{Offset: $offset, Length: $length},
	# 	},
	# 	NegativeDeltas: nil,
	# 	NegativeCounts: nil,
	# 	PositiveDeltas: []int64{$positiveDeltas},
	# 	PositiveCounts: nil,
	# 	ResetHint: prompb.Histogram_UNKNOWN,
	# 	Timestamp: $timestamp,
	# }''')
    # tZero = Template('''h$idx = prompb.Histogram{
    #     Count: &prompb.Histogram_CountInt{CountInt: $count},
	# 	Sum: $sum,
	# 	Schema: $schema,
	# 	ZeroThreshold: 0.0001,
	# 	ZeroCount: &prompb.Histogram_ZeroCountInt{ZeroCountInt: 0},
	# 	NegativeSpans: nil,
	# 	PositiveSpans: nil,
	# 	NegativeDeltas: nil,
	# 	NegativeCounts: nil,
	# 	PositiveDeltas: nil,
	# 	PositiveCounts: nil,
	# 	ResetHint: prompb.Histogram_UNKNOWN,
	# 	Timestamp: $timestamp,
	# }''')
    t = Template('''{Ts: $timestamp, H: &histogram.Histogram{
    CounterResetHint: histogram.UnknownCounterReset,
    Schema: $schema,
    ZeroThreshold: 0.0001,
    ZeroCount: 0,
    Count: $count,
    Sum: $sum,
    PositiveSpans: []histogram.Span{
	    {Offset: $offset, Length: $length},
    },
    NegativeSpans: nil,
    PositiveBuckets: []int64{
        $positiveDeltas,
    },
    NegativeBuckets: nil,
    CustomValues: nil,
}},
''')
    tZero = Template('''{Ts: $timestamp, H: &histogram.Histogram{
                         CounterResetHint: histogram.UnknownCounterReset,
    Schema: $schema,
    ZeroThreshold: 0.0001,
    ZeroCount: 0,
    Count: $count,
    Sum: $sum,
    PositiveSpans: nil,
    NegativeSpans: nil,
    PositiveBuckets: nil,
    NegativeBuckets: nil,
    CustomValues: nil,
}},
''')
    if len(deltas) == 0:
        print(tZero.substitute(
                idx=idx,
                schema=schema,
                count=FH["count"],
                sum=FH["sum"],
                timestamp=timestamp,
        ))
        return
    print(t.substitute(
                idx=idx,
                schema=schema,
                count=FH["count"],
                sum=FH["sum"],
                offset=offset,
                length=len(deltas),
                positiveDeltas=",".join([str(x) for x in deltas]),
                timestamp=timestamp,
    ))

def convertHistogram(FH, schema):
    count = FH["count"]
    sum = FH["sum"]
    value = "{{" + "schema:{} count:{} sum:{}".format(schema, count, sum)
    base = math.pow(2, math.pow(2, -schema))
    offset = None
    prevOffset = None
    positives = []
    for bucket in FH["buckets"]:
        l = float(bucket[1])
        h = float(bucket[2])
        if l < 0 and h > 0:
            # Zero bucket
            continue
        l=math.fabs(l)
        h=math.fabs(h)
        delta = h - l
        thisOffset = int(round(math.log(delta, base), 0))
        if offset is None:
            offset = thisOffset
            prevOffset = thisOffset
        else:
            for i in range(prevOffset, thisOffset):
                positives.append(0)
            prevOffset = thisOffset
        positives.append(int(bucket[3]))

    deltas = []
    if len(positives) > 0:
        deltas.append(positives[0])
        for i in range(1, len(positives)):
            deltas.append(positives[i] - positives[i - 1])

    return (offset, deltas)



convert_json_to_promql_unittest(sys.argv[1])

krajorama · 2025-06-03T06:48:35Z

I think the problem is in the counter reset detection. The code that loads the samples matrixIterSlice uses the typed Buffer iterator, which will preload the integer histogram samples, however the last sample is always(!) loaded as a float histogram sample and the optimized iterator fails to detect counter resets in that case.

Proof: I can hack func (f *histogramStatsIterator) getFloatResetHint(hint histogram.CounterResetHint) histogram.CounterResetHint to make the test pass here.

krajorama · 2025-06-03T09:37:58Z

Also the bug explains why I couldn't reproduce with the unit test: the unit test uses float histograms for everything, this transition from integer to float does not happen.

Ref: #16681 Signed-off-by: György Krajcsovits <gyorgy.krajcsovits@grafana.com>

The problem is in the counter reset detection. The code that loads the samples is matrixIterSlice which uses the typed Buffer iterator, which will preload the integer histogram samples, however the last sample is always(!) loaded as a float histogram sample in matrixIterSlice and the optimized iterator fails to detect counter resets in that case. Also the iterator does not reset lastH, lastFH properly. Ref: #16681 Signed-off-by: György Krajcsovits <gyorgy.krajcsovits@grafana.com>

krajorama · 2025-06-03T11:01:37Z

cc @fpetkovski

krajorama · 2025-06-04T08:05:49Z

cc @charleskorn

…tect Signed-off-by: György Krajcsovits <gyorgy.krajcsovits@grafana.com>

beorn7

Epic bug hunt.

Thank you very much.

fpetkovski · 2025-06-04T12:00:25Z

Thank you @krajorama!

krajorama force-pushed the krajo-fix-histogram-count-bug-main branch 2 times, most recently from 35765d1 to 696fbbb Compare June 2, 2025 13:25

krajorama changed the title ~~fix(promql): histogram_count inconsistent~~ fix(promql): histogram_count and histogram_sum inconsistent Jun 2, 2025

krajorama force-pushed the krajo-fix-histogram-count-bug-main branch from 696fbbb to 867e4a8 Compare June 3, 2025 10:04

test(promql): histogram_count inconsistent

e6b8383

Ref: #16681 Signed-off-by: György Krajcsovits <gyorgy.krajcsovits@grafana.com>

krajorama force-pushed the krajo-fix-histogram-count-bug-main branch 2 times, most recently from 4d29962 to acb6f9d Compare June 3, 2025 10:25

krajorama force-pushed the krajo-fix-histogram-count-bug-main branch from acb6f9d to df94ece Compare June 3, 2025 10:31

krajorama marked this pull request as ready for review June 3, 2025 10:56

krajorama requested a review from roidelapluie as a code owner June 3, 2025 10:56

krajorama requested a review from beorn7 June 3, 2025 10:58

krajorama mentioned this pull request Jun 3, 2025

fix(promql): histogram_count returns wrong value grafana/mimir#11601

Closed

4 tasks

krajorama requested a review from bwplotka June 4, 2025 08:05

krajorama marked this pull request as draft June 4, 2025 08:07

fix(promql): do not loose information about buckets when doing the de…

bccb1d6

…tect Signed-off-by: György Krajcsovits <gyorgy.krajcsovits@grafana.com>

krajorama marked this pull request as ready for review June 4, 2025 08:31

beorn7 approved these changes Jun 4, 2025

View reviewed changes

krajorama merged commit ecd99bd into main Jun 4, 2025
45 checks passed

krajorama deleted the krajo-fix-histogram-count-bug-main branch June 4, 2025 12:40

beorn7 mentioned this pull request Jun 5, 2025

promql: reuse histogramStatsIterator where possible, and expose it for other implementations to use #16686

Merged

This was referenced Jun 11, 2025

Sync prometheus at 19848bb grafana/mimir-prometheus#888

Merged

Sync upstream prometheus @ 94d5e0f41cc2adf238e596353578d30b0c8dbf53 grafana/mimir-prometheus#889

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(promql): histogram_count and histogram_sum inconsistent #16682

fix(promql): histogram_count and histogram_sum inconsistent #16682

Uh oh!

krajorama commented Jun 2, 2025 •

edited

Loading

Uh oh!

krajorama commented Jun 2, 2025

Uh oh!

krajorama commented Jun 2, 2025 •

edited

Loading

Uh oh!

krajorama commented Jun 3, 2025

Uh oh!

krajorama commented Jun 3, 2025

Uh oh!

krajorama commented Jun 3, 2025

Uh oh!

krajorama commented Jun 4, 2025

Uh oh!

beorn7 left a comment

Uh oh!

fpetkovski commented Jun 4, 2025

Uh oh!

Uh oh!

Uh oh!

fix(promql): histogram_count and histogram_sum inconsistent #16682

fix(promql): histogram_count and histogram_sum inconsistent #16682

Uh oh!

Conversation

krajorama commented Jun 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

krajorama commented Jun 2, 2025

Uh oh!

krajorama commented Jun 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

krajorama commented Jun 3, 2025

Uh oh!

krajorama commented Jun 3, 2025

Uh oh!

krajorama commented Jun 3, 2025

Uh oh!

krajorama commented Jun 4, 2025

Uh oh!

beorn7 left a comment

Choose a reason for hiding this comment

Uh oh!

fpetkovski commented Jun 4, 2025

Uh oh!

Uh oh!

Uh oh!

krajorama commented Jun 2, 2025 •

edited

Loading

krajorama commented Jun 2, 2025 •

edited

Loading