fix(promql): Ensure native histogram values are copied in subqueries #16879

harry671003 · 2025-07-16T03:32:46Z

Notes

While testing Thanos PromQL engine I found a very subtle bug in Prometheus engine. The NH samples returned from the iterator from subquery is not copied. This causes the query results from the Prometheus engine to be incorrect results.

Reproducing the bug

I originally discovered this this bug by comparing the results against Thanos engine. I also added a test in this PR.
This bug occurs when a subquery is run before other queries.

In the below test, when a subquery is included min_over_time({__name__="http_request_duration_seconds"}[1h:1m]) the results of the rate() and delta() are wrong. I'm suspecting this is because of array reuse in matrixIterSlice() (and maybe other places?).

func TestQueriesAgainstThanosEngine(t *testing.T) {
	start := time.Unix(0, 0)
	end := time.Unix(144, 0)
	step := time.Second * 3
	// Negative offset and at modifier are enabled by default
	// since Prometheus v2.33.0 so we also enable them.
	opts := promql.EngineOpts{
		Timeout:              1 * time.Hour,
		MaxSamples:           1e10,
		EnableNegativeOffset: true,
		EnableAtModifier:     true,
	}

	load := `load 2m
			http_request_duration_seconds{pod="nginx-1"} {{schema:0 count:110 sum:818.00 buckets:[1 14 95]}}+{{schema:0 count:110 buckets:[1 14 95]}}x20
			http_request_duration_seconds{pod="nginx-2"} {{schema:0 count:210 sum:1598.00 buckets:[1 19 190]}}+{{schema:0 count:210 buckets:[1 19 190]}}x30`

	queryStrings := []string{
		`min_over_time({__name__="http_request_duration_seconds"}[1h:1m])`,
		`rate({__name__="http_request_duration_seconds"}[3m] offset -16s)`,
		`delta({__name__="http_request_duration_seconds"}[3m] offset -16s)`,
	}

	expected := make([]*promql.Result, len(queryStrings))

	thanosEngine := engine.New(engine.Opts{EngineOpts: opts})
	queryable := promqltest.LoadedStorage(t, load)
	t.Cleanup(func() {
		queryable.Close()
	})

	ctx := context.Background()
	for j, qs := range queryStrings {
		thanosQuery, err := thanosEngine.NewRangeQuery(ctx, queryable, nil, qs, start, end, step)
		testutil.Ok(t, err)
		thanosResult := thanosQuery.Exec(ctx)
		expected[j] = thanosResult
		t.Cleanup(func() {
			thanosQuery.Close()
		})
	}

	for i := 0; i < 1000; i++ {
		for j, qs := range queryStrings {
			t.Run(fmt.Sprintf("%s_%d_%d", "run", i, j), func(t *testing.T) {
				queryable2 := promqltest.LoadedStorage(t, load)
				defer queryable2.Close()
				promEngine := promql.NewEngine(opts)
				promQuery, err := promEngine.NewRangeQuery(ctx, queryable2, nil, qs, start, end, step)
				testutil.Ok(t, err)
				defer promQuery.Close()
				promResult := promQuery.Exec(ctx)

				testutil.WithGoCmp(comparer).Equals(t, promResult, expected[j])
			})
		}
	}
}

Failure

Diff:

        --- Expected

        +++ Actual

        @@ -1,31 +1,31 @@

         (*promql.Result)({pod="nginx-1"} =>

        -{count:0, sum:0} @[105000]

        -{count:0, sum:0} @[108000]

        -{count:0, sum:0} @[111000]

        -{count:0, sum:0} @[114000]

        -{count:0, sum:0} @[117000]

        -{count:0, sum:0} @[120000]

        -{count:0, sum:0} @[123000]

        +{count:0.9166666666666666, sum:0, (0.5,1]:0.008333333333333333, (1,2]:0.11666666666666667, (2,4]:0.7916666666666666} @[105000]

        +{count:0.9166666666666666, sum:0, (0.5,1]:0.008333333333333333, (1,2]:0.11666666666666667, (2,4]:0.7916666666666666} @[108000]

        +{count:0.9166666666666666, sum:0, (0.5,1]:0.008333333333333333, (1,2]:0.11666666666666667, (2,4]:0.7916666666666666} @[111000]

        +{count:0.9166666666666666, sum:0, (0.5,1]:0.008333333333333333, (1,2]:0.11666666666666667, (2,4]:0.7916666666666666} @[114000]

        +{count:0...(output trimmed)

Signed-off-by: 🌲 Harry 🌊 John 🏔 <johrry@amazon.com>

harry671003 · 2025-07-18T18:06:17Z

Hi @beorn7 @fpetkovski,
Since you both last worked on this part of the code, I wanted to check if you could kindly take a look at this PR. I’d appreciate any feedback when you get a chance. Thanks!

beorn7 · 2025-07-19T17:38:42Z

Thanks for catching this. We suspected there are more cases where we don't copy enough, see #16581.

But it needs some investigation if this PR is the right way to solve this particular instance.

@krajorama might have state on this (actually better than myself). I'll preliminarily assign this to him.

I'm out sick and will take some time to look at this myself.

krajorama · 2025-07-19T18:35:22Z

I've literally made the same patch more than a year ago https://github.com/prometheus/prometheus/pull/13659/files for a different reason 😞 .

I don't remember the exact reason we abandoned that PR, but I'm sure we thought we can sort out the copy/not to copy problem later, as described in the spec and #16581.

To me copying is always safe, and the code is the same as mine, so I'm just reviewing the test part basically.

krajorama · 2025-07-20T08:07:05Z

I've used the debugger to step through subquery eval and the histograms returned from the storageSeriesIterator are indeed going directly into the pool used by matrixIterSlice , which means the pool is poisoned.

We could fix the issue in matrixIterSlice as well by checking if what the iterator returned is the same or different pointer. But that's not the current pattern and I'd rather be consistent to not complicate the code further. Using the pool means this fix won't add too much overhead either.

krajorama · 2025-07-20T17:04:59Z

I've spent half a day trying to figure out a simpler test to trigger the error. But the error isn't as simple as setting the same pointer twice in the matrixSelectorHPool pool in matrixIterSlice.

As far as I understand the problem really starts in MemoizedSeriesIterator which is used in evaluating the subquery. This iterator calls into chunkenc.FloatHistogram.AtFloatHistogram which does a shallow copy of the histogram. Which means that the result of the subquery will have histograms that share the same physical spans, buckets. Thus setting any of the spans/buckets will result in setting all.

I've tested an alternative fix where I patch the MemoizedSeriesIterator:

diff --git a/storage/memoized_iterator.go b/storage/memoized_iterator.go
index 273b3caa1..11b8be7da 100644
--- a/storage/memoized_iterator.go
+++ b/storage/memoized_iterator.go
@@ -133,7 +133,8 @@ func (b *MemoizedSeriesIterator) At() (int64, float64) {
 
 // AtFloatHistogram returns the current float-histogram element of the iterator.
 func (b *MemoizedSeriesIterator) AtFloatHistogram() (int64, *histogram.FloatHistogram) {
-       return b.it.AtFloatHistogram(nil)
+       fh := &histogram.FloatHistogram{}
+       return b.it.AtFloatHistogram(fh)
 }
 
 // AtT returns the timestamp of the current element of the iterator.

Anyway, this is not as efficient as your fix, so I've added it here for demonstration only.

What really threw me was that the error isn't deterministic, even though the code looks deterministic. The only thing running in parallel is the GC, but I thought that we didn't really use that due to the pooling. But setting GOGC=off makes the problem much worse, one iteration of the subquery is enough to trigger the problem. The bug comes out non-deterministic even then , due to sync.Pool still doing it's own garbage collection. Only if I replace sync.Pool with a dumb slice based implementation, do I get deterministic failure right away.

And the failure is that the pool contains histograms with shared spans/buckets, so when we do the rate , the normal copy takes place from the TSDB and data is copied multiple times into the same memory. The last one wins.

On the other hand if sync.Pool is faster at letting go of the unused pool item, then we get to rate with clean pool and the problem doesn't come up.

I don't really see how we could make the test deterministic, unless we make the use of the pool optional. Which could make actual sense in some scenarios, when we do not have to make a copy after all.

krajorama

LGTM, I've tried to come up with a simpler test, but turned out to be too hard as explained in #16879 (comment)

krajorama · 2025-07-21T07:39:50Z

@bboreham suggested to check if https://pkg.go.dev/testing/synctest has some way of getting rid of the non determinism in sync.Pool for the test. If not , could be worth an issue on Go.

I've put this on my backlog, but @harry671003 if you got some time, feel free to check yourself please. NOTE: Prometheus needs to update the GO version first as this is a new feature since 1.24.

harry671003 · 2025-07-21T19:51:25Z

Thanks for the reviews and merge! Appreciate all the insights - learned a lot through this.

Regarding the synctest suggestion, I’ll keep this in mind and once Prometheus upgrades to Go 1.24, I’ll explore whether synctest can help stabilize the test.

…metheus#16879) Signed-off-by: 🌲 Harry 🌊 John 🏔 <johrry@amazon.com> Signed-off-by: Andrew Hall <andrew.hall@grafana.com>

harry671003 requested a review from roidelapluie as a code owner July 16, 2025 03:32

harry671003 mentioned this pull request Jul 16, 2025

Test: Enable fuzz tests for native histograms thanos-io/promql-engine#600

Merged

harry671003 force-pushed the fix_series_iterator branch from 17ff99f to df388e6 Compare July 16, 2025 03:52

fix(promql): Ensure native histogram values copied in subqueries

ae8ed03

Signed-off-by: 🌲 Harry 🌊 John 🏔 <johrry@amazon.com>

harry671003 force-pushed the fix_series_iterator branch from df388e6 to ae8ed03 Compare July 16, 2025 03:54

beorn7 requested review from krajorama and removed request for roidelapluie July 19, 2025 17:30

krajorama approved these changes Jul 21, 2025

View reviewed changes

krajorama merged commit b09cf6b into prometheus:main Jul 21, 2025
27 checks passed

krajorama mentioned this pull request Jul 21, 2025

wip: wip allow PromQL engine to operate on immutable histogram pointers #13679

Closed

harry671003 mentioned this pull request Jul 21, 2025

Update prometheus to b09cf6be8d56 thanos-io/promql-engine#609

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(promql): Ensure native histogram values are copied in subqueries #16879

fix(promql): Ensure native histogram values are copied in subqueries #16879

Uh oh!

harry671003 commented Jul 16, 2025 •

edited

Loading

Uh oh!

harry671003 commented Jul 18, 2025

Uh oh!

beorn7 commented Jul 19, 2025

Uh oh!

krajorama commented Jul 19, 2025

Uh oh!

krajorama commented Jul 20, 2025

Uh oh!

krajorama commented Jul 20, 2025

Uh oh!

krajorama left a comment

Uh oh!

Uh oh!

krajorama commented Jul 21, 2025 •

edited

Loading

Uh oh!

harry671003 commented Jul 21, 2025

Uh oh!

Uh oh!

fix(promql): Ensure native histogram values are copied in subqueries #16879

fix(promql): Ensure native histogram values are copied in subqueries #16879

Uh oh!

Conversation

harry671003 commented Jul 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Notes

Reproducing the bug

Failure

Uh oh!

harry671003 commented Jul 18, 2025

Uh oh!

beorn7 commented Jul 19, 2025

Uh oh!

krajorama commented Jul 19, 2025

Uh oh!

krajorama commented Jul 20, 2025

Uh oh!

krajorama commented Jul 20, 2025

Uh oh!

krajorama left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

krajorama commented Jul 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

harry671003 commented Jul 21, 2025

Uh oh!

Uh oh!

harry671003 commented Jul 16, 2025 •

edited

Loading

krajorama commented Jul 21, 2025 •

edited

Loading