Skip to content

Conversation

krajorama
Copy link
Member

@krajorama krajorama commented Jul 26, 2024

Description of problem

Fix bug in compaction of overlapping chunks: if you have two identical histogram chunks A and B , the merge over them produces a single chunk C that has the same data as A and B.

Now add one sample to chunk B without creating a new chunk and call it B'. The merge produces a list of samples that contains all common elements plus the new one at the end.

And you would think that the merge produces a single chunk equal to B' , however that's not true in all cases!

If the extra histogram added to B has a new bucket, then B is recoded into B' to contain the new bucket. Which means that iterating over B' yields samples that have en extra empty bucket.

When we merge the chunks A, B' the histogram samples can come from either A or B' randomly(*) and if we get a sample from B' and then from A, then the sample from A is now not appendable to the result and in fact causes a new chunk and a counter reset!

The real case where I've seen it is a bit more complicated, but the upshot is that we see spikes in the rate/increase of the data after compaction.

Solution

Accept this case in appendable: previous histogram has extra empty buckets. This should be appendable and append should just fill in the empty buckets in the incoming histogram.

I ended up making two functions: expandIntSpansAndBuckets which is a merge of counterResetInAnyBucket and expandSpansForward AND expandFloatSpansAndBuckets which is a merge of counterResetInAnyFloatBucket and expandSpansForward . The merge was pretty straightforward, however it didn't allow for a single generic function. Also the new functions have repeated code, but I couldn't inline it without destroying performance.

Alternate solution discarded 1

I was thinking that maybe selecting the source chunk wisely in a merge to always choose the sample with most buckets would be a fix, but that's not going to work. Imagine chunk B' from time T1 to T3 that was recoded and has "big" histograms due to a "big" sample at time T3 (big=has more buckets). However for whatever reason it lacks a sample at time T2, which another chunk A has and that sample will be "small" , causing the counter reset on merge. Since B' doesn't have a sample at T2, there's no way to choose the bigger sample.

Alternate solution discarded 2

I'm not 100% sure we should discard this.

When returning histograms from a chunk iterator, do not return empty buckets. So basically sanitize the buckets. This means adding extra iteration and check on every readout :(

Alternate solution discarded 3

Add h = h.Copy().Compact(0) before line

t, h = seriesIter.AtHistogram(nil)
. This would mean making a copy for every histogram. Doubles the compaction time pretty much.

Signed-off-by: György Krajcsovits <gyorgy.krajcsovits@grafana.com>
@krajorama krajorama requested a review from jesusvazquez as a code owner July 26, 2024 08:00
@krajorama krajorama changed the title Native hisotgram: fix spurios counter reset when merging recoded chunk to normal chunk Native histograms: fix spurios counter reset when merging recoded chunk to normal chunk Jul 26, 2024
@krajorama krajorama marked this pull request as draft July 26, 2024 08:10
Allow appending to chunks when the histogram to be added is missing
some buckets, but the missing buckets are empty in the chunk.
For example bucket at index 5 is present in the chunk, but its value
is 0 and the new histogram doesn't have a bucket at index 5.

This fixes an issue of merging chunks where one chunk was recoded to
retroactively have some empty buckets in all the histograms and we are
merging in a histogram that doesn't have the empty bucket (because it
was not recoded yet).

The operation alters the histogram that is being added, however this has
already been the case when appending gauge histograms. Thus the test
TestHistogramSeriesToChunks in storage package is changed to explicitly
test what happened to the appended histogram - Compact(0) call is removed.

The new expandIntSpansAndBuckets and expandFloatSpansAndBuckets functions
are a merge of expandSpansForward and counterResetInAnyBucket and
counterResetInAnyFloatBucket.

Signed-off-by: György Krajcsovits <gyorgy.krajcsovits@grafana.com>
@krajorama krajorama force-pushed the fix-histogram-overlap-compaction-bug branch from 493c5fb to 265655d Compare July 28, 2024 13:33
@krajorama krajorama marked this pull request as ready for review July 28, 2024 13:36
Signed-off-by: György Krajcsovits <gyorgy.krajcsovits@grafana.com>
krajorama added a commit to grafana/mimir-prometheus that referenced this pull request Jul 29, 2024
prometheus/prometheus#14513

Signed-off-by: György Krajcsovits <gyorgy.krajcsovits@grafana.com>
krajorama added a commit to grafana/mimir that referenced this pull request Jul 29, 2024
prometheus/prometheus#14513

Signed-off-by: György Krajcsovits <gyorgy.krajcsovits@grafana.com>
@krajorama krajorama closed this Jul 30, 2024
@krajorama krajorama reopened this Jul 30, 2024
Copy link
Member

@jesusvazquez jesusvazquez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

I did not see anything that was clearly wrong and I appreciate the effort in writing the extensive description of the pull request. The topic is a bit complex and that guided me a bit.

If @fionaliao @carrieedwards or @zenador have a chance to have a look I'd appreciate that as well since they are more familiar with this than I am and also because maybe this also affects the out of order implementation.

@@ -1368,3 +1447,50 @@ func TestHistogramAppendOnlyErrors(t *testing.T) {
require.EqualError(t, err, "histogram counter reset")
})
}

func BenchmarkAppendable(b *testing.B) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you post these benchmark results?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

             │ /tmp/old.bench │           /tmp/new.bench            │
             │     sec/op     │   sec/op     vs base                │
Appendable-8      173.1µ ± 2%   146.1µ ± 1%  -15.61% (p=0.000 n=10)

             │ /tmp/old.bench │           /tmp/new.bench           │
             │      B/op      │    B/op     vs base                │
Appendable-8       172.0 ± 1%   145.0 ± 1%  -15.70% (p=0.000 n=10)

             │ /tmp/old.bench │         /tmp/new.bench         │
             │   allocs/op    │ allocs/op   vs base            │
Appendable-8       0.000 ± 0%   0.000 ± 0%  ~ (p=1.000 n=10) ¹
¹ all samples are equal

Copy link
Contributor

@fionaliao fionaliao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Comment on lines +358 to +359
var aInter Insert
var bInter Insert
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are these called aInter/bInter rather than aInsert/bInsert?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Historical reasons:

No idea, but I wanted to keep the resemblance to the original

@krajorama krajorama merged commit 00ab05c into prometheus:main Aug 1, 2024
45 checks passed
krajorama added a commit to grafana/mimir that referenced this pull request Aug 1, 2024
For prometheus/prometheus#14513

Signed-off-by: György Krajcsovits <gyorgy.krajcsovits@grafana.com>
krajorama added a commit to grafana/mimir that referenced this pull request Aug 1, 2024
For prometheus/prometheus#14513

Signed-off-by: György Krajcsovits <gyorgy.krajcsovits@grafana.com>
krajorama added a commit that referenced this pull request Jul 24, 2025
* test(chunkenc): appending histograms with empty buckets and gaps

Append such native histograms that have empty buckets and gaps
between the indexes of those buckets.

There is a special case for appending counter native histograms to a chunk in TSDB: if we append a histogram that is missing some buckets that are already in chunk, then usually that's a counter reset. However if the missing bucket is empty, meaning its value is 0, then we don't consider it missing.

For this case to trigger , we need to write empty buckets into the chunk. Normally native histograms are compacted when we emit them , so this is very rare and compact make sure that there are no multiple continuous empty buckets with gaps between them.

The code that I've added in #14513 did not take into account that you can bypass compact and write histograms with many empty buckets, with gaps between them. These are still valid, so the code has to account for them.

Main fix in the expandIntSpansAndBuckets and expandFloatSpansAndBuckets function. I've also refactored them for clarity. Consequently needed to fix insert and adjustForInserts to also allow gaps between inserts.

I've added some new test cases (data driven test would be nice here, too many cases). And removed the deprecated old function that didn't deal with empty buckets at all.

Signed-off-by: György Krajcsovits <gyorgy.krajcsovits@grafana.com>
Signed-off-by: George Krajcsovits <krajorama@users.noreply.github.com>
Co-authored-by: Björn Rabenstein <beorn@grafana.com>
thc1006 added a commit to thc1006/prometheus that referenced this pull request Jul 27, 2025
)

* test(chunkenc): appending histograms with empty buckets and gaps

Append such native histograms that have empty buckets and gaps
between the indexes of those buckets.

There is a special case for appending counter native histograms to a chunk in TSDB: if we append a histogram that is missing some buckets that are already in chunk, then usually that's a counter reset. However if the missing bucket is empty, meaning its value is 0, then we don't consider it missing.

For this case to trigger , we need to write empty buckets into the chunk. Normally native histograms are compacted when we emit them , so this is very rare and compact make sure that there are no multiple continuous empty buckets with gaps between them.

The code that I've added in prometheus#14513 did not take into account that you can bypass compact and write histograms with many empty buckets, with gaps between them. These are still valid, so the code has to account for them.

Main fix in the expandIntSpansAndBuckets and expandFloatSpansAndBuckets function. I've also refactored them for clarity. Consequently needed to fix insert and adjustForInserts to also allow gaps between inserts.

I've added some new test cases (data driven test would be nice here, too many cases). And removed the deprecated old function that didn't deal with empty buckets at all.

Signed-off-by: György Krajcsovits <gyorgy.krajcsovits@grafana.com>
Signed-off-by: George Krajcsovits <krajorama@users.noreply.github.com>
Co-authored-by: Björn Rabenstein <beorn@grafana.com>
tcp13equals2 pushed a commit to tcp13equals2/prometheus that referenced this pull request Aug 18, 2025
)

* test(chunkenc): appending histograms with empty buckets and gaps

Append such native histograms that have empty buckets and gaps
between the indexes of those buckets.

There is a special case for appending counter native histograms to a chunk in TSDB: if we append a histogram that is missing some buckets that are already in chunk, then usually that's a counter reset. However if the missing bucket is empty, meaning its value is 0, then we don't consider it missing.

For this case to trigger , we need to write empty buckets into the chunk. Normally native histograms are compacted when we emit them , so this is very rare and compact make sure that there are no multiple continuous empty buckets with gaps between them.

The code that I've added in prometheus#14513 did not take into account that you can bypass compact and write histograms with many empty buckets, with gaps between them. These are still valid, so the code has to account for them.

Main fix in the expandIntSpansAndBuckets and expandFloatSpansAndBuckets function. I've also refactored them for clarity. Consequently needed to fix insert and adjustForInserts to also allow gaps between inserts.

I've added some new test cases (data driven test would be nice here, too many cases). And removed the deprecated old function that didn't deal with empty buckets at all.

Signed-off-by: György Krajcsovits <gyorgy.krajcsovits@grafana.com>
Signed-off-by: George Krajcsovits <krajorama@users.noreply.github.com>
Co-authored-by: Björn Rabenstein <beorn@grafana.com>
Signed-off-by: Andrew Hall <andrew.hall@grafana.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants