Skip to content

Conversation

beorn7
Copy link
Member

@beorn7 beorn7 commented May 7, 2025

As it turns out, if we combine Kahan summation and incremental mean calculation properly, it works quite well and we do not need to switch between simple mean calculation and incremental calculation based on overflow.

This simplifies the code quite a bit.

@beorn7 beorn7 requested a review from roidelapluie as a code owner May 7, 2025 14:01
@beorn7
Copy link
Member Author

beorn7 commented May 7, 2025

Note: This is on top of the beorn7/promql branch to simplify review. I'll rebase once #16566 is merged.

@beorn7
Copy link
Member Author

beorn7 commented May 7, 2025

@crush-on-anechka please have a look.

@machine424
Copy link
Member

machine424 commented May 19, 2025

Thanks for this, I'll look at this.

@machine424 machine424 self-assigned this May 19, 2025
@bboreham
Copy link
Member

Having discussed this when the previous code was added, can you comment whether this argument still holds?

@machine424
Copy link
Member

Apparently @crush-on-anechka (#15687 (comment)) has some extra test cases that we may want to agree on and add before this change.

@beorn7
Copy link
Member Author

beorn7 commented May 29, 2025

Apparently @crush-on-anechka (#15687 (comment)) has some extra test cases that we may want to agree on and add before this change.

I still have to analyze those extra cases. Hopefully I get to it next week.

@beorn7
Copy link
Member Author

beorn7 commented May 29, 2025

Having discussed this when the previous code was added, can you comment whether this argument still holds?

Back then, I assumed incremental mean calculation is just less accurate than non-incremental. Now it has turned out that we just applied Kahan summation in a non-optimal (or arguably wrong) way. When I created #14413, I did not question the existing implementation. To better understand the line of thinking, I'll give a summary of the history:

  1. The initial implementation used simple mean calculation, but no Kahan summation yet.
  2. One day, somebody ran into an overflow of float64 and had the idea that we should use incremental mean calculation. That was clearly better, so we implemented it.
  3. Some time later, somebody ran into float accuracy problems and thought we should better use Kahan summation. Somebody added Kahan summation to both sum and avg (and their _over_time variants). Nothing broke, everyone was happy. However, at this time, the Kahan summation wasn't done in the best possible way (or arguably in a wrong way), which wasn't caught.
  4. Again some time later, somebody noted a scenario where sum worked just fine, but avg didn't. Which was weird because avg is just doing sum first and then divide by a relatively small integer. When I debugged the problem, I blamed the incremental mean calculation as being less accurate, back then assuming that Kahan summation and incremental mean calculation were combined in an appropriate way, and it would be just inherent to the problem that the incremental mean calculation is less accurate, so that the optimal way is to do simple mean calculation if there is no overflow and switch to incremental mean calculation once we run into an overflow. This resulted in promql: more Kahan summation (avg) and less incremental mean calculation (avg, avg_over_time) #14413, which made everyone happy because now the "obvious" case where sum worked but avg not was fixed, while the cases where sum would overflow but avg not (thanks to incremental mean calculation) continued to work as expected.
  5. Again some time later, @crush-on-anechka started to work on Kahan summation for histograms. They raised interesting points, and in the course of the discussion, I started some more fundamental research, which lead me to the revelation that our existing combination of Kahan summation and incremental mean calculation could be improved. Which I did.
  6. With the improved combined Kahan and incremental mean calculation, I could not see any practical differences in accuracy anymore between simple and incremental mean calculation. I don't know the details of the numerical accuracy here well enough to tell if there is maybe still an accuracy advantage of one over the other (it could also be the other way around, of course), but my current thinking is that if there is any, it is probably not very relevant in practice. So I decided to better simplify the code by a good deal (accepting a theoretical increase in compute cost, which again isn't really relevant in practice – remember that we always did incremental calculation historically).

Copy link
Member

@bboreham bboreham left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whilst I am happy with the code, I think it would be improved by a little explanation in comments. Perhaps linking to a reference.

@machine424 machine424 removed their assignment Jun 4, 2025
@machine424
Copy link
Member

unassigned myself so it doesn't look like I'm blocking this, I can always take a look post-merge.

Base automatically changed from beorn7/promql to main June 5, 2025 14:45
@beorn7
Copy link
Member Author

beorn7 commented Jun 5, 2025

I have now incorporated @crush-on-anechka's test cases from #15687 (in this comment).

With my proposed fixed, the two most interesting ones are metric10 and metric11:

  • For metric10, the incremental mean calculation (as introduced in this PR) return a pretty precise result, while the simple mean calculation (prior to this PR) is quite a bit off.
  • metric11 throws off Kahan summation completely. (Broadly simplified, if you handle large numbers of significantly different orders of magnitudes, they will wash out the small values from the Kahan summation term.) Both simple and incremental mean calculation are far off. Naively, one might say the incremental calculation is way worse (-1.881783551706252e+203 instead of -44.848083237000004), but the plain zero calculated by simple mean calculation would be for worse on a logarithmic scale.

The metric11 case is handled just fine with proper statistics tools (e.g. the Python3 statistics package). However, it would require processing the values in a certain order, which is hard to implement, given how the PromQL engine works. I would propose to not invest that effort until we see practical relevance of those cases. (Even the cases that we used to justify Kahan summation are already fairly contrived.)

As it turns out, if we combine Kahan summation and incremental mean
calculation properly, it works quite well and we do not need to switch
between simple mean calculation and incremental calculation based on
overflow.

This simplifies the code quite a bit.

Signed-off-by: beorn7 <beorn@grafana.com>
@beorn7
Copy link
Member Author

beorn7 commented Jun 5, 2025

I have now also added a note as suggested by @bboreham .

@beorn7
Copy link
Member Author

beorn7 commented Jun 5, 2025

Another insight (reflected in the note added): For avg_over_time, the promql engine gives us all the relevant values in a slice, so we could even play the tricks that statistics libraries do for a more accurate mean calculation (and also sum_over_time of course). But it would be more expensive, and it would break consistency with the behavior of the aggregation operators, where it would be much harder to control the order of processing samples.

Copy link
Member

@bboreham bboreham left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great, thank you.

These tests were initially created by @crush-on-anechka. I modified
them slightly.

Signed-off-by: beorn7 <beorn@grafana.com>
@beorn7 beorn7 merged commit 8fc1750 into main Jun 6, 2025
45 checks passed
@beorn7 beorn7 deleted the beorn7/promql2 branch June 6, 2025 22:16
@bboreham
Copy link
Member

bboreham commented Jun 9, 2025

Using go version go1.24.2 darwin/arm64 I get test failures from the merged repo at commit 8fc1750:

--- FAIL: TestEvaluations (1.48s)
    --- FAIL: TestEvaluations/testdata/aggregators.test (0.14s)
        --- FAIL: TestEvaluations/testdata/aggregators.test/line_596/avg(data{test="bigzero"}) (0.00s)
            test.go:1321: 
                        Error Trace:    /Users/bryan/src/github.com/prometheus/prometheus/promql/promqltest/test.go:1321
                        Error:          Received unexpected error:
                                        error in eval avg(data{test="bigzero"}) (line 596): expected 0 for {} but got 2.4948003869184e+291
                        Test:           TestEvaluations/testdata/aggregators.test/line_596/avg(data{test="bigzero"})
    --- FAIL: TestEvaluations/testdata/functions.test (0.47s)
        --- FAIL: TestEvaluations/testdata/functions.test/line_1013/avg_over_time(metric[2m]) (0.00s)
            test.go:1321: 
                        Error Trace:    /Users/bryan/src/github.com/prometheus/prometheus/promql/promqltest/test.go:1321
                        Error:          Received unexpected error:
                                        error in eval avg_over_time(metric[2m]) (line 1013): expected 0.5 for {} but got -1.3877787807814457e+83
                        Test:           TestEvaluations/testdata/functions.test/line_1013/avg_over_time(metric[2m])
        --- FAIL: TestEvaluations/testdata/functions.test/line_1085/avg_over_time(metric11[1m]) (0.00s)
            test.go:1321: 
                        Error Trace:    /Users/bryan/src/github.com/prometheus/prometheus/promql/promqltest/test.go:1321
                        Error:          Received unexpected error:
                                        error in eval avg_over_time(metric11[1m]) (line 1085): expected -1.881783551706252e+203 for {} but got 2.303079268822384e+202
                        Test:           TestEvaluations/testdata/functions.test/line_1085/avg_over_time(metric11[1m])
FAIL
FAIL    github.com/prometheus/prometheus/promql 5.077s

I don't get these at 6d10fce.

@beorn7
Copy link
Member Author

beorn7 commented Jun 10, 2025

Thanks, I'll look into it.

@beorn7
Copy link
Member Author

beorn7 commented Jun 10, 2025

Note for my investigation: Test pass with go1.24.1 linux/amd64. Will update to go1.24.2 and try again. (But I guess this could totally be arch dependent.)

@beorn7
Copy link
Member Author

beorn7 commented Jun 10, 2025

Tests pass with go1.24.4 linux/amd64, too.

@beorn7
Copy link
Member Author

beorn7 commented Jun 10, 2025

So the last failure is not very surprising because that's the test that goes wrong, and I just put in the wrong result to satisfy the test, but it's not surprising that different architectures get different wrong results.

The other two failures are tough, though. The 1st one is overflowing float64 if you just sum, so you have to use incremental mean calculation, and it is weird that the previous "wronger" version of the incremental mean calculation got the right result on all platforms. The 2nd one is relatively easy and should "just work". I'm very curious where things go wrong here, but I don't have access to a darwin/arm64 system to debug in detail.

I'll file a separate bug so that we can track debugging over there.

@beorn7
Copy link
Member Author

beorn7 commented Jun 10, 2025

#16714 is the bug.

charleskorn added a commit to charleskorn/prometheus that referenced this pull request Jun 24, 2025
…omql2"

This reverts commit 8fc1750, reversing
changes made to 6d10fce.

# Conflicts:
#	promql/functions.go
#	promql/promqltest/testdata/functions.test
charleskorn added a commit to charleskorn/prometheus that referenced this pull request Jun 24, 2025
…omql2"

This reverts commit 8fc1750, reversing
changes made to 6d10fce.

# Conflicts:
#	promql/functions.go
#	promql/promqltest/testdata/functions.test
beorn7 added a commit that referenced this pull request Jun 24, 2025
The test in question actually worked fine even before #16569. The
finding reported in the comment has turned out to be caused by
something else.

Signed-off-by: beorn7 <beorn@grafana.com>
beorn7 added a commit that referenced this pull request Jun 24, 2025
This commit brings back direct mean calculation (for `avg` and
`avg_over_time`) but isn't an outright revert of #16569. It keeps the
improved incremental mean calculation and features generally a bit
cleaner code than before.

Also, this commit...

- ...updates the lengthy comment explaining the whole situation and
  trade-offs.

- ...divides the running sum and the Kahan compensation term
  separately (in direct mean calculation) to avoid the (unlikely)
  possibility that sum and Kahan compensation together ovorflow
  float64.

- ...uncomments the tests that should now work again on darwin/arm64.

- ...uncomments the test that should now reliably yield the
  (inaccurate) value 0 on all hardware platforms. Also, the test
  description has been updated accordingly.

Signed-off-by: beorn7 <beorn@grafana.com>
beorn7 added a commit that referenced this pull request Jun 25, 2025
This commit brings back direct mean calculation (for `avg` and
`avg_over_time`) but isn't an outright revert of #16569. It keeps the
improved incremental mean calculation and features generally a bit
cleaner code than before.

Also, this commit...

- ...updates the lengthy comment explaining the whole situation and
  trade-offs.

- ...divides the running sum and the Kahan compensation term
  separately (in direct mean calculation) to avoid the (unlikely)
  possibility that sum and Kahan compensation together ovorflow
  float64.

- ...uncomments the tests that should now work again on darwin/arm64.

- ...uncomments the test that should now reliably yield the
  (inaccurate) value 0 on all hardware platforms. Also, the test
  description has been updated accordingly.

- ...adds avg_over_time tests for zero and one sample in the range.

Signed-off-by: beorn7 <beorn@grafana.com>
beorn7 added a commit that referenced this pull request Jun 26, 2025
The test in question actually worked fine even before #16569. The
finding reported in the comment has turned out to be caused by
something else.

Signed-off-by: beorn7 <beorn@grafana.com>
beorn7 added a commit that referenced this pull request Jun 26, 2025
This commit brings back direct mean calculation (for `avg` and
`avg_over_time`) but isn't an outright revert of #16569. It keeps the
improved incremental mean calculation and features generally a bit
cleaner code than before.

Also, this commit...

- ...updates the lengthy comment explaining the whole situation and
  trade-offs.

- ...divides the running sum and the Kahan compensation term
  separately (in direct mean calculation) to avoid the (unlikely)
  possibility that sum and Kahan compensation together ovorflow
  float64.

- ...uncomments the tests that should now work again on darwin/arm64.

- ...uncomments the test that should now reliably yield the
  (inaccurate) value 0 on all hardware platforms. Also, the test
  description has been updated accordingly.

- ...adds avg_over_time tests for zero and one sample in the range.

Signed-off-by: beorn7 <beorn@grafana.com>
beorn7 added a commit that referenced this pull request Jun 27, 2025
The test in question actually worked fine even before #16569. The
finding reported in the comment has turned out to be caused by
something else.

Signed-off-by: beorn7 <beorn@grafana.com>
beorn7 added a commit that referenced this pull request Jun 27, 2025
This commit brings back direct mean calculation (for `avg` and
`avg_over_time`) but isn't an outright revert of #16569. It keeps the
improved incremental mean calculation and features generally a bit
cleaner code than before.

Also, this commit...

- ...updates the lengthy comment explaining the whole situation and
  trade-offs.

- ...divides the running sum and the Kahan compensation term
  separately (in direct mean calculation) to avoid the (unlikely)
  possibility that sum and Kahan compensation together ovorflow
  float64.

- ...uncomments the tests that should now work again on darwin/arm64.

- ...uncomments the test that should now reliably yield the
  (inaccurate) value 0 on all hardware platforms. Also, the test
  description has been updated accordingly.

- ...adds avg_over_time tests for zero and one sample in the range.

Signed-off-by: beorn7 <beorn@grafana.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants