Skip to content

Conversation

harry671003
Copy link
Contributor

@harry671003 harry671003 commented Jul 16, 2025

Notes

While testing Thanos PromQL engine I found a very subtle bug in Prometheus engine. The NH samples returned from the iterator from subquery is not copied. This causes the query results from the Prometheus engine to be incorrect results.

Reproducing the bug

I originally discovered this this bug by comparing the results against Thanos engine. I also added a test in this PR.
This bug occurs when a subquery is run before other queries.

In the below test, when a subquery is included min_over_time({__name__="http_request_duration_seconds"}[1h:1m]) the results of the rate() and delta() are wrong. I'm suspecting this is because of array reuse in matrixIterSlice() (and maybe other places?).

func TestQueriesAgainstThanosEngine(t *testing.T) {
	start := time.Unix(0, 0)
	end := time.Unix(144, 0)
	step := time.Second * 3
	// Negative offset and at modifier are enabled by default
	// since Prometheus v2.33.0 so we also enable them.
	opts := promql.EngineOpts{
		Timeout:              1 * time.Hour,
		MaxSamples:           1e10,
		EnableNegativeOffset: true,
		EnableAtModifier:     true,
	}

	load := `load 2m
			http_request_duration_seconds{pod="nginx-1"} {{schema:0 count:110 sum:818.00 buckets:[1 14 95]}}+{{schema:0 count:110 buckets:[1 14 95]}}x20
			http_request_duration_seconds{pod="nginx-2"} {{schema:0 count:210 sum:1598.00 buckets:[1 19 190]}}+{{schema:0 count:210 buckets:[1 19 190]}}x30`

	queryStrings := []string{
		`min_over_time({__name__="http_request_duration_seconds"}[1h:1m])`,
		`rate({__name__="http_request_duration_seconds"}[3m] offset -16s)`,
		`delta({__name__="http_request_duration_seconds"}[3m] offset -16s)`,
	}

	expected := make([]*promql.Result, len(queryStrings))

	thanosEngine := engine.New(engine.Opts{EngineOpts: opts})
	queryable := promqltest.LoadedStorage(t, load)
	t.Cleanup(func() {
		queryable.Close()
	})

	ctx := context.Background()
	for j, qs := range queryStrings {
		thanosQuery, err := thanosEngine.NewRangeQuery(ctx, queryable, nil, qs, start, end, step)
		testutil.Ok(t, err)
		thanosResult := thanosQuery.Exec(ctx)
		expected[j] = thanosResult
		t.Cleanup(func() {
			thanosQuery.Close()
		})
	}

	for i := 0; i < 1000; i++ {
		for j, qs := range queryStrings {
			t.Run(fmt.Sprintf("%s_%d_%d", "run", i, j), func(t *testing.T) {
				queryable2 := promqltest.LoadedStorage(t, load)
				defer queryable2.Close()
				promEngine := promql.NewEngine(opts)
				promQuery, err := promEngine.NewRangeQuery(ctx, queryable2, nil, qs, start, end, step)
				testutil.Ok(t, err)
				defer promQuery.Close()
				promResult := promQuery.Exec(ctx)

				testutil.WithGoCmp(comparer).Equals(t, promResult, expected[j])
			})
		}
	}
}

Failure

Diff:

        --- Expected

        +++ Actual

        @@ -1,31 +1,31 @@

         (*promql.Result)({pod="nginx-1"} =>

        -{count:0, sum:0} @[105000]

        -{count:0, sum:0} @[108000]

        -{count:0, sum:0} @[111000]

        -{count:0, sum:0} @[114000]

        -{count:0, sum:0} @[117000]

        -{count:0, sum:0} @[120000]

        -{count:0, sum:0} @[123000]

        +{count:0.9166666666666666, sum:0, (0.5,1]:0.008333333333333333, (1,2]:0.11666666666666667, (2,4]:0.7916666666666666} @[105000]

        +{count:0.9166666666666666, sum:0, (0.5,1]:0.008333333333333333, (1,2]:0.11666666666666667, (2,4]:0.7916666666666666} @[108000]

        +{count:0.9166666666666666, sum:0, (0.5,1]:0.008333333333333333, (1,2]:0.11666666666666667, (2,4]:0.7916666666666666} @[111000]

        +{count:0.9166666666666666, sum:0, (0.5,1]:0.008333333333333333, (1,2]:0.11666666666666667, (2,4]:0.7916666666666666} @[114000]

        +{count:0...(output trimmed)

Signed-off-by: 🌲 Harry 🌊 John 🏔 <johrry@amazon.com>
@harry671003 harry671003 force-pushed the fix_series_iterator branch from df388e6 to ae8ed03 Compare July 16, 2025 03:54
@harry671003
Copy link
Contributor Author

Hi @beorn7 @fpetkovski,
Since you both last worked on this part of the code, I wanted to check if you could kindly take a look at this PR. I’d appreciate any feedback when you get a chance. Thanks!

@beorn7 beorn7 requested review from krajorama and removed request for roidelapluie July 19, 2025 17:30
@beorn7
Copy link
Member

beorn7 commented Jul 19, 2025

Thanks for catching this. We suspected there are more cases where we don't copy enough, see #16581.

But it needs some investigation if this PR is the right way to solve this particular instance.

@krajorama might have state on this (actually better than myself). I'll preliminarily assign this to him.

I'm out sick and will take some time to look at this myself.

@krajorama
Copy link
Member

I've literally made the same patch more than a year ago https://github.com/prometheus/prometheus/pull/13659/files for a different reason 😞 .

I don't remember the exact reason we abandoned that PR, but I'm sure we thought we can sort out the copy/not to copy problem later, as described in the spec and #16581.

To me copying is always safe, and the code is the same as mine, so I'm just reviewing the test part basically.

@krajorama
Copy link
Member

I've used the debugger to step through subquery eval and the histograms returned from the storageSeriesIterator are indeed going directly into the pool used by matrixIterSlice , which means the pool is poisoned.

We could fix the issue in matrixIterSlice as well by checking if what the iterator returned is the same or different pointer. But that's not the current pattern and I'd rather be consistent to not complicate the code further. Using the pool means this fix won't add too much overhead either.

@krajorama
Copy link
Member

I've spent half a day trying to figure out a simpler test to trigger the error. But the error isn't as simple as setting the same pointer twice in the matrixSelectorHPool pool in matrixIterSlice.

As far as I understand the problem really starts in MemoizedSeriesIterator which is used in evaluating the subquery. This iterator calls into chunkenc.FloatHistogram.AtFloatHistogram which does a shallow copy of the histogram. Which means that the result of the subquery will have histograms that share the same physical spans, buckets. Thus setting any of the spans/buckets will result in setting all.

I've tested an alternative fix where I patch the MemoizedSeriesIterator:

diff --git a/storage/memoized_iterator.go b/storage/memoized_iterator.go
index 273b3caa1..11b8be7da 100644
--- a/storage/memoized_iterator.go
+++ b/storage/memoized_iterator.go
@@ -133,7 +133,8 @@ func (b *MemoizedSeriesIterator) At() (int64, float64) {
 
 // AtFloatHistogram returns the current float-histogram element of the iterator.
 func (b *MemoizedSeriesIterator) AtFloatHistogram() (int64, *histogram.FloatHistogram) {
-       return b.it.AtFloatHistogram(nil)
+       fh := &histogram.FloatHistogram{}
+       return b.it.AtFloatHistogram(fh)
 }
 
 // AtT returns the timestamp of the current element of the iterator.

Anyway, this is not as efficient as your fix, so I've added it here for demonstration only.

What really threw me was that the error isn't deterministic, even though the code looks deterministic. The only thing running in parallel is the GC, but I thought that we didn't really use that due to the pooling. But setting GOGC=off makes the problem much worse, one iteration of the subquery is enough to trigger the problem. The bug comes out non-deterministic even then , due to sync.Pool still doing it's own garbage collection. Only if I replace sync.Pool with a dumb slice based implementation, do I get deterministic failure right away.

And the failure is that the pool contains histograms with shared spans/buckets, so when we do the rate , the normal copy takes place from the TSDB and data is copied multiple times into the same memory. The last one wins.

On the other hand if sync.Pool is faster at letting go of the unused pool item, then we get to rate with clean pool and the problem doesn't come up.

I don't really see how we could make the test deterministic, unless we make the use of the pool optional. Which could make actual sense in some scenarios, when we do not have to make a copy after all.

Copy link
Member

@krajorama krajorama left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, I've tried to come up with a simpler test, but turned out to be too hard as explained in #16879 (comment)

@krajorama krajorama merged commit b09cf6b into prometheus:main Jul 21, 2025
27 checks passed
@krajorama
Copy link
Member

krajorama commented Jul 21, 2025

@bboreham suggested to check if https://pkg.go.dev/testing/synctest has some way of getting rid of the non determinism in sync.Pool for the test. If not , could be worth an issue on Go.

I've put this on my backlog, but @harry671003 if you got some time, feel free to check yourself please. NOTE: Prometheus needs to update the GO version first as this is a new feature since 1.24.

@harry671003
Copy link
Contributor Author

Thanks for the reviews and merge! Appreciate all the insights - learned a lot through this.

Regarding the synctest suggestion, I’ll keep this in mind and once Prometheus upgrades to Go 1.24, I’ll explore whether synctest can help stabilize the test.

tcp13equals2 pushed a commit to tcp13equals2/prometheus that referenced this pull request Aug 18, 2025
…metheus#16879)

Signed-off-by: 🌲 Harry 🌊 John 🏔 <johrry@amazon.com>
Signed-off-by: Andrew Hall <andrew.hall@grafana.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants