-
Notifications
You must be signed in to change notification settings - Fork 9.7k
promql: Simplify avg aggregation and avg_over_time #16569
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Note: This is on top of the beorn7/promql branch to simplify review. I'll rebase once #16566 is merged. |
@crush-on-anechka please have a look. |
Thanks for this, I'll look at this. |
Having discussed this when the previous code was added, can you comment whether this argument still holds? |
Apparently @crush-on-anechka (#15687 (comment)) has some extra test cases that we may want to agree on and add before this change. |
I still have to analyze those extra cases. Hopefully I get to it next week. |
Back then, I assumed incremental mean calculation is just less accurate than non-incremental. Now it has turned out that we just applied Kahan summation in a non-optimal (or arguably wrong) way. When I created #14413, I did not question the existing implementation. To better understand the line of thinking, I'll give a summary of the history:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Whilst I am happy with the code, I think it would be improved by a little explanation in comments. Perhaps linking to a reference.
unassigned myself so it doesn't look like I'm blocking this, I can always take a look post-merge. |
I have now incorporated @crush-on-anechka's test cases from #15687 (in this comment). With my proposed fixed, the two most interesting ones are
The |
As it turns out, if we combine Kahan summation and incremental mean calculation properly, it works quite well and we do not need to switch between simple mean calculation and incremental calculation based on overflow. This simplifies the code quite a bit. Signed-off-by: beorn7 <beorn@grafana.com>
I have now also added a note as suggested by @bboreham . |
Another insight (reflected in the note added): For |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great, thank you.
These tests were initially created by @crush-on-anechka. I modified them slightly. Signed-off-by: beorn7 <beorn@grafana.com>
Using
I don't get these at 6d10fce. |
Thanks, I'll look into it. |
Note for my investigation: Test pass with |
Tests pass with |
So the last failure is not very surprising because that's the test that goes wrong, and I just put in the wrong result to satisfy the test, but it's not surprising that different architectures get different wrong results. The other two failures are tough, though. The 1st one is overflowing float64 if you just sum, so you have to use incremental mean calculation, and it is weird that the previous "wronger" version of the incremental mean calculation got the right result on all platforms. The 2nd one is relatively easy and should "just work". I'm very curious where things go wrong here, but I don't have access to a darwin/arm64 system to debug in detail. I'll file a separate bug so that we can track debugging over there. |
#16714 is the bug. |
The test in question actually worked fine even before #16569. The finding reported in the comment has turned out to be caused by something else. Signed-off-by: beorn7 <beorn@grafana.com>
This commit brings back direct mean calculation (for `avg` and `avg_over_time`) but isn't an outright revert of #16569. It keeps the improved incremental mean calculation and features generally a bit cleaner code than before. Also, this commit... - ...updates the lengthy comment explaining the whole situation and trade-offs. - ...divides the running sum and the Kahan compensation term separately (in direct mean calculation) to avoid the (unlikely) possibility that sum and Kahan compensation together ovorflow float64. - ...uncomments the tests that should now work again on darwin/arm64. - ...uncomments the test that should now reliably yield the (inaccurate) value 0 on all hardware platforms. Also, the test description has been updated accordingly. Signed-off-by: beorn7 <beorn@grafana.com>
This commit brings back direct mean calculation (for `avg` and `avg_over_time`) but isn't an outright revert of #16569. It keeps the improved incremental mean calculation and features generally a bit cleaner code than before. Also, this commit... - ...updates the lengthy comment explaining the whole situation and trade-offs. - ...divides the running sum and the Kahan compensation term separately (in direct mean calculation) to avoid the (unlikely) possibility that sum and Kahan compensation together ovorflow float64. - ...uncomments the tests that should now work again on darwin/arm64. - ...uncomments the test that should now reliably yield the (inaccurate) value 0 on all hardware platforms. Also, the test description has been updated accordingly. - ...adds avg_over_time tests for zero and one sample in the range. Signed-off-by: beorn7 <beorn@grafana.com>
The test in question actually worked fine even before #16569. The finding reported in the comment has turned out to be caused by something else. Signed-off-by: beorn7 <beorn@grafana.com>
This commit brings back direct mean calculation (for `avg` and `avg_over_time`) but isn't an outright revert of #16569. It keeps the improved incremental mean calculation and features generally a bit cleaner code than before. Also, this commit... - ...updates the lengthy comment explaining the whole situation and trade-offs. - ...divides the running sum and the Kahan compensation term separately (in direct mean calculation) to avoid the (unlikely) possibility that sum and Kahan compensation together ovorflow float64. - ...uncomments the tests that should now work again on darwin/arm64. - ...uncomments the test that should now reliably yield the (inaccurate) value 0 on all hardware platforms. Also, the test description has been updated accordingly. - ...adds avg_over_time tests for zero and one sample in the range. Signed-off-by: beorn7 <beorn@grafana.com>
The test in question actually worked fine even before #16569. The finding reported in the comment has turned out to be caused by something else. Signed-off-by: beorn7 <beorn@grafana.com>
This commit brings back direct mean calculation (for `avg` and `avg_over_time`) but isn't an outright revert of #16569. It keeps the improved incremental mean calculation and features generally a bit cleaner code than before. Also, this commit... - ...updates the lengthy comment explaining the whole situation and trade-offs. - ...divides the running sum and the Kahan compensation term separately (in direct mean calculation) to avoid the (unlikely) possibility that sum and Kahan compensation together ovorflow float64. - ...uncomments the tests that should now work again on darwin/arm64. - ...uncomments the test that should now reliably yield the (inaccurate) value 0 on all hardware platforms. Also, the test description has been updated accordingly. - ...adds avg_over_time tests for zero and one sample in the range. Signed-off-by: beorn7 <beorn@grafana.com>
As it turns out, if we combine Kahan summation and incremental mean calculation properly, it works quite well and we do not need to switch between simple mean calculation and incremental calculation based on overflow.
This simplifies the code quite a bit.