-
Notifications
You must be signed in to change notification settings - Fork 597
[bugfix] Correct instant query calculation #5252
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[bugfix] Correct instant query calculation #5252
Conversation
eca0ced
to
44d6dd7
Compare
pkg/traceql/engine_metrics.go
Outdated
if step == 0 { | ||
return 0 | ||
} | ||
if isInstant(start, end, step) { | ||
return start |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I made a copy-paste mistake here, it should return end. Now I wonder, how it works well with that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It worked because with end == start, the number of intervals is 1 due to https://github.com/ruslan-mikhailov/tempo/blob/main/pkg/traceql/engine_metrics.go#L59
If return end
as it should be, then the number of samples will be 2. See https://github.com/ruslan-mikhailov/tempo/blob/main/tempodb/tempodb_metrics_test.go#L62
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To fix properly, I force instant queries to always use only one interval
8aec5ae
to
b0b6c40
Compare
cb72e98
to
2778368
Compare
return start - start%step | ||
} | ||
|
||
// End time is rounded up to next step | ||
func alignEnd(end, step uint64) uint64 { | ||
func alignEnd(start, end, step uint64) uint64 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We chatted about this offline - this is the root cause of too many intervals both for instant and query_range. Because we round start down, and end up, it always creates 1 more step. This is modeling behavior of prometheus where start and end timestamps are inclusive.
At some point we want to fix this for query_range too. I.e. 1h with 1m steps should return 60 data points not 61. Is it worth fixing in this PR, or are you thinking to address this in a separate PR? I think we could revisit and simplify some of these changes after doing that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel like we should decide on the a behaviour that makes sense of us, even if it's slightly different from prometheus.
we don't promise to be prometheus format compliant, we are prometheus inspired and similar so it's okay to diverge where it makes sense for us to diverge.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My suggestion is to fix it with this workaround so far to not be blocked by discussions as it fixes at least 3 issues. Then, when decision is made, we can refactor the code.
+ additional fix |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm ok with the changes as-is because the fixes are good, but I want to mention another possible direction:
The original vision I had for instant queries was to use a different range aggregator instead of StepAggregator here: https://github.com/grafana/tempo/blob/main/pkg/traceql/ast_metrics.go#L156 StepAggregator does the work to sort values into intervals in its Observe function. If we made a new InstantAggregator
it would always have and return 1 interval, and it wouldn't need to do time calculations at all. I think this is the ideal fix, because it is simpler but also more efficient. Similar to how we swap between Grouping or Ungrouped aggregators based on presence of any by() clause.
Not sure why this wasn't done originally when introducing instant queries. I think I just didn't realize the number of edge cases being introduced without it.
If you are interested in exploring this direction a bit, I think it would be worthwhile. It should be less churn overall.
+ linter fixes |
8d53cc0
to
a4d7f9b
Compare
when query both generator and backend
a4d7f9b
to
bcf4706
Compare
+ rebase from latest main in order to resolve conflicts |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
* [bugfix] TraceQL Metrics: correct interval number calculation * [bugfix] Force to return one interval for instant queries * linter fix: apply De Morgan's law * e2e: refactoring * e2e: refactoring * e2e: function to call instant query * e2e: instant query * e2e: remove redundant param * e2e: add traces with high cardinality attributes * e2e: test instant bottomk/topk * Add tests for topk and bottomk Instant queries * Changelog * test refactoring * Basic test to check corner cases for instant query * [bugfix] Correct instant query calculation when query both generator and backend --------- Co-authored-by: Suraj Nath <9503187+electron0zero@users.noreply.github.com>
* Bump github.com/go-viper/mapstructure/v2 from 2.2.1 to 2.3.0 (#5333) Bumps [github.com/go-viper/mapstructure/v2](https://github.com/go-viper/mapstructure) from 2.2.1 to 2.3.0. - [Release notes](https://github.com/go-viper/mapstructure/releases) - [Changelog](https://github.com/go-viper/mapstructure/blob/main/CHANGELOG.md) - [Commits](go-viper/mapstructure@v2.2.1...v2.3.0) --- updated-dependencies: - dependency-name: github.com/go-viper/mapstructure/v2 dependency-version: 2.3.0 dependency-type: indirect ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * Update go tools to 1.24.4 (#5322) * cherry pick * Update changelog * cherry pick * Bugfix: Correctly assign backend shard numbers depending on ingester shards (#5438) * Correctly assign backend shard numbers depending on ingester shards Signed-off-by: Joe Elliott <number101010@gmail.com> * changelog Signed-off-by: Joe Elliott <number101010@gmail.com> * test cleanup Signed-off-by: Joe Elliott <number101010@gmail.com> * a single space raised up as an offering to the lint gods Signed-off-by: Joe Elliott <number101010@gmail.com> --------- Signed-off-by: Joe Elliott <number101010@gmail.com> * cherry picking * Propagate tracing context in distributor for HTTP requests (#5312) * Propagate OTel context in distributor * Changelog * chlog v2 * Only propagate sampled traces * Refactor # Conflicts: # CHANGELOG.md * [bugfix] Correct instant query calculation (#5252) * [bugfix] TraceQL Metrics: correct interval number calculation * [bugfix] Force to return one interval for instant queries * linter fix: apply De Morgan's law * e2e: refactoring * e2e: refactoring * e2e: function to call instant query * e2e: instant query * e2e: remove redundant param * e2e: add traces with high cardinality attributes * e2e: test instant bottomk/topk * Add tests for topk and bottomk Instant queries * Changelog * test refactoring * Basic test to check corner cases for instant query * [bugfix] Correct instant query calculation when query both generator and backend --------- Co-authored-by: Suraj Nath <9503187+electron0zero@users.noreply.github.com> * Add nil check to `partitionAssignmentVar` (#5198) * Nil check partitionAssignmentVar * Changelog # Conflicts: # CHANGELOG.md * True up changelog * True up linter * Apply suggestions from code review Co-authored-by: Kim Nylander <104772500+knylander-grafana@users.noreply.github.com> * Update CHANGELOG.md Co-authored-by: Kim Nylander <104772500+knylander-grafana@users.noreply.github.com> * Update CHANGELOG.md Co-authored-by: Kim Nylander <104772500+knylander-grafana@users.noreply.github.com> * add release notes --------- Signed-off-by: dependabot[bot] <support@github.com> Signed-off-by: Joe Elliott <number101010@gmail.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: A. Stoewer <adrian@stoewer.me> Co-authored-by: J Pham <94262131+ie-pham@users.noreply.github.com> Co-authored-by: Joe Elliott <number101010@gmail.com> Co-authored-by: Ruslan Mikhailov <195758209+ruslan-mikhailov@users.noreply.github.com> Co-authored-by: Mario <mariorvinas@gmail.com> Co-authored-by: Suraj Nath <9503187+electron0zero@users.noreply.github.com> Co-authored-by: Kim Nylander <104772500+knylander-grafana@users.noreply.github.com>
What this PR does:
Instant query should have only one interval, while it has 2 or more because of alignment.
For example, if we have start value
from req:
it will be aligned:
End param from req:
and aligned:
In total, it gives 2 intervals: 13 - 14, 14-15 with step=1h. Then it gets only the first value here
More details here: #5221 (comment)
The fix is not ideal and uses the same approach of identifying instant queries as before:
Possibly, the better solution would be to pass "instant bool" into QueryRangeRequest, but that requires changes in a bunch of function signatures. Let me know what you think
Which issue(s) this PR fixes:
Fixes #5164 and #5221
Please note that for #5164 there is at least one more bug: querying in range that forces to query both generator and backend gives inaccurate results. It will be fixed in a separate PR.
Checklist
CHANGELOG.md
updated - the order of entries should be[CHANGE]
,[FEATURE]
,[ENHANCEMENT]
,[BUGFIX]