Skip to content

Conversation

krajorama
Copy link
Member

@krajorama krajorama commented Jun 2, 2025

The problem is in the counter reset detection. The code that loads the
samples is matrixIterSlice which uses the typed Buffer iterator, which
will preload the integer histogram samples, however the last sample is
always(!) loaded as a float histogram sample in matrixIterSlice and the
optimized iterator fails to detect counter resets in that case.

This means that the error happens fairly rarely:

  • query needs to have rate/irate/increase
  • query needs to have histogram_count, histogram_sum, histogram_avg around the rate
  • last sample in the range should trigger reset from previous sample

Also the iterator does not reset lastH, lastFH properly.

In light of this bug, I wonder if we should rewrite the iterator to always return float histograms. Or maybe add a version that does that. On the other hand maybe it's better to rewrite the engine to use if for distinguishing between float samples and other samples and use generics for the "other" samples.

Fixes: #16681

@krajorama krajorama force-pushed the krajo-fix-histogram-count-bug-main branch 2 times, most recently from 35765d1 to 696fbbb Compare June 2, 2025 13:25
@krajorama
Copy link
Member Author

Note the test does not fail if I disable the count/sum optimization:

			case "histogram_count", "histogram_sum", "histogram_avg":
				n.SkipHistogramBuckets = false  // instead of true

@krajorama
Copy link
Member Author

krajorama commented Jun 2, 2025

Python script used to convert user's data acquired by range vector selector on query API into test data:

import json
import math
from string import Template
import sys



def convert_json_to_promql_unittest(json_file_path):
    with open(json_file_path, 'r') as json_file:
        data = json.load(json_file)

    if "status" not in data or data["status"] != "success":
        print("Invalid JSON format: 'status' key is missing or not 'success'")
        return

    result = data.get("data", {}).get("result", [])
    if not result:
        print("No results found in the JSON data.")
        return

    for item in result:
        convertMetric(item)


def convertMetric(item):
    metric = item.get("metric", {})
    if not metric:
        print("No metric found in the item.")
        return
    metric_name = metric.get("__name__", "unknown_metric")
    labels = ["{}=\"{}\"".format(k, v) for k, v in metric.items() if k != "__name__"]
    series = metric_name + ",".join(labels)

    
    loadValues = []

    histograms = item.get("histograms", [])
    schema = 0
    prevTs = 0
    mockTs = 0
    idx = 0
    for histogram in histograms:
        Ts = int(histogram[0])
        # print("dTs: {}".format(Ts - prevTs))
        # if prevTs != 0:
        #     dTs = Ts - prevTs
        #     if dTs > 29:
        #         loadValues.append("NaN")
        #         for i in range(2, int(dTs/15)):
        #             loadValues.append("_")
        FH = histogram[1]
        schema = getSchema(FH, schema)

        if prevTs != 0:
            mockTs += Ts-prevTs

        printHistogram(FH, schema, mockTs, idx)
        idx += 1
        
        prevTs = Ts

    # series += " " + " ".join(loadValues)
    # print(series)
    # print("len: {}".format(len(loadValues)))

def getSchema(FH, schema):
    if "buckets" not in FH:
        return schema
    b = None
    for bucket in FH["buckets"]:
        l = float(bucket[1])
        h = float(bucket[2])
        if l < 0 and h > 0:
            # Zero bucket
            continue
        l=math.fabs(l)
        h=math.fabs(h)
        f=h/l
        ex=round(math.log2(math.log2(f)), 0)
        return int(-ex)
    return schema

def printHistogram(FH, schema, timestamp, idx):
    offset, deltas = convertHistogram(FH, schema)

    # t = Template('''h$idx = prompb.Histogram{
	# 	Count: &prompb.Histogram_CountInt{CountInt: $count},
	# 	Sum: $sum,
	# 	Schema: $schema,
	# 	ZeroThreshold: 0.0001,
	# 	ZeroCount: &prompb.Histogram_ZeroCountInt{ZeroCountInt: 0},
	# 	NegativeSpans: nil,
	# 	PositiveSpans: []prompb.BucketSpan{
	# 		{Offset: $offset, Length: $length},
	# 	},
	# 	NegativeDeltas: nil,
	# 	NegativeCounts: nil,
	# 	PositiveDeltas: []int64{$positiveDeltas},
	# 	PositiveCounts: nil,
	# 	ResetHint: prompb.Histogram_UNKNOWN,
	# 	Timestamp: $timestamp,
	# }''')
    # tZero = Template('''h$idx = prompb.Histogram{
    #     Count: &prompb.Histogram_CountInt{CountInt: $count},
	# 	Sum: $sum,
	# 	Schema: $schema,
	# 	ZeroThreshold: 0.0001,
	# 	ZeroCount: &prompb.Histogram_ZeroCountInt{ZeroCountInt: 0},
	# 	NegativeSpans: nil,
	# 	PositiveSpans: nil,
	# 	NegativeDeltas: nil,
	# 	NegativeCounts: nil,
	# 	PositiveDeltas: nil,
	# 	PositiveCounts: nil,
	# 	ResetHint: prompb.Histogram_UNKNOWN,
	# 	Timestamp: $timestamp,
	# }''')
    t = Template('''{Ts: $timestamp, H: &histogram.Histogram{
    CounterResetHint: histogram.UnknownCounterReset,
    Schema: $schema,
    ZeroThreshold: 0.0001,
    ZeroCount: 0,
    Count: $count,
    Sum: $sum,
    PositiveSpans: []histogram.Span{
	    {Offset: $offset, Length: $length},
    },
    NegativeSpans: nil,
    PositiveBuckets: []int64{
        $positiveDeltas,
    },
    NegativeBuckets: nil,
    CustomValues: nil,
}},
''')
    tZero = Template('''{Ts: $timestamp, H: &histogram.Histogram{
                         CounterResetHint: histogram.UnknownCounterReset,
    Schema: $schema,
    ZeroThreshold: 0.0001,
    ZeroCount: 0,
    Count: $count,
    Sum: $sum,
    PositiveSpans: nil,
    NegativeSpans: nil,
    PositiveBuckets: nil,
    NegativeBuckets: nil,
    CustomValues: nil,
}},
''')
    if len(deltas) == 0:
        print(tZero.substitute(
                idx=idx,
                schema=schema,
                count=FH["count"],
                sum=FH["sum"],
                timestamp=timestamp,
        ))
        return
    print(t.substitute(
                idx=idx,
                schema=schema,
                count=FH["count"],
                sum=FH["sum"],
                offset=offset,
                length=len(deltas),
                positiveDeltas=",".join([str(x) for x in deltas]),
                timestamp=timestamp,
    ))

def convertHistogram(FH, schema):
    count = FH["count"]
    sum = FH["sum"]
    value = "{{" + "schema:{} count:{} sum:{}".format(schema, count, sum)
    base = math.pow(2, math.pow(2, -schema))
    offset = None
    prevOffset = None
    positives = []
    for bucket in FH["buckets"]:
        l = float(bucket[1])
        h = float(bucket[2])
        if l < 0 and h > 0:
            # Zero bucket
            continue
        l=math.fabs(l)
        h=math.fabs(h)
        delta = h - l
        thisOffset = int(round(math.log(delta, base), 0))
        if offset is None:
            offset = thisOffset
            prevOffset = thisOffset
        else:
            for i in range(prevOffset, thisOffset):
                positives.append(0)
            prevOffset = thisOffset
        positives.append(int(bucket[3]))

    deltas = []
    if len(positives) > 0:
        deltas.append(positives[0])
        for i in range(1, len(positives)):
            deltas.append(positives[i] - positives[i - 1])

    return (offset, deltas)



convert_json_to_promql_unittest(sys.argv[1])

@krajorama krajorama changed the title fix(promql): histogram_count inconsistent fix(promql): histogram_count and histogram_sum inconsistent Jun 2, 2025
@krajorama
Copy link
Member Author

I think the problem is in the counter reset detection. The code that loads the samples matrixIterSlice uses the typed Buffer iterator, which will preload the integer histogram samples, however the last sample is always(!) loaded as a float histogram sample and the optimized iterator fails to detect counter resets in that case.

Proof: I can hack func (f *histogramStatsIterator) getFloatResetHint(hint histogram.CounterResetHint) histogram.CounterResetHint to make the test pass here.

@krajorama
Copy link
Member Author

Also the bug explains why I couldn't reproduce with the unit test: the unit test uses float histograms for everything, this transition from integer to float does not happen.

@krajorama krajorama force-pushed the krajo-fix-histogram-count-bug-main branch from 696fbbb to 867e4a8 Compare June 3, 2025 10:04
Ref: #16681

Signed-off-by: György Krajcsovits <gyorgy.krajcsovits@grafana.com>
@krajorama krajorama force-pushed the krajo-fix-histogram-count-bug-main branch 2 times, most recently from 4d29962 to acb6f9d Compare June 3, 2025 10:25
The problem is in the counter reset detection. The code that loads the
samples is matrixIterSlice which uses the typed Buffer iterator, which
will preload the integer histogram samples, however the last sample is
always(!) loaded as a float histogram sample in matrixIterSlice and the
optimized iterator fails to detect counter resets in that case.

Also the iterator does not reset lastH, lastFH properly.

Ref: #16681

Signed-off-by: György Krajcsovits <gyorgy.krajcsovits@grafana.com>
@krajorama krajorama force-pushed the krajo-fix-histogram-count-bug-main branch from acb6f9d to df94ece Compare June 3, 2025 10:31
@krajorama krajorama marked this pull request as ready for review June 3, 2025 10:56
@krajorama krajorama requested a review from roidelapluie as a code owner June 3, 2025 10:56
@krajorama krajorama requested a review from beorn7 June 3, 2025 10:58
@krajorama
Copy link
Member Author

cc @fpetkovski

@krajorama
Copy link
Member Author

cc @charleskorn

@krajorama krajorama marked this pull request as draft June 4, 2025 08:07
…tect

Signed-off-by: György Krajcsovits <gyorgy.krajcsovits@grafana.com>
@krajorama krajorama marked this pull request as ready for review June 4, 2025 08:31
Copy link
Member

@beorn7 beorn7 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Epic bug hunt.

Thank you very much.

@fpetkovski
Copy link
Contributor

Thank you @krajorama!

@krajorama krajorama merged commit ecd99bd into main Jun 4, 2025
45 checks passed
@krajorama krajorama deleted the krajo-fix-histogram-count-bug-main branch June 4, 2025 12:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

native histograms: histogram_count and histogram_sum wrong result
3 participants