Skip to content

[bugfix] Correct instant query calculation #5252

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 15 commits into from
Jun 17, 2025

Conversation

ruslan-mikhailov
Copy link
Contributor

@ruslan-mikhailov ruslan-mikhailov commented Jun 10, 2025

What this PR does:

Instant query should have only one interval, while it has 2 or more because of alignment.
For example, if we have start value
from req:

1749037020 - Wed Jun  4 13:37:00 CEST 2025

it will be aligned:

1749034800 - Wed Jun  4 13:00:00 CEST 2025

End param from req:

1749040620 - Wed Jun  4 14:37:00 CEST 2025

and aligned:

1749042000 - Wed Jun  4 15:00:00 CEST 2025

In total, it gives 2 intervals: 13 - 14, 14-15 with step=1h. Then it gets only the first value here

More details here: #5221 (comment)

The fix is not ideal and uses the same approach of identifying instant queries as before:

func IsInstant(req tempopb.QueryRangeRequest) bool {
	return req.End-req.Start == req.Step
}

Possibly, the better solution would be to pass "instant bool" into QueryRangeRequest, but that requires changes in a bunch of function signatures. Let me know what you think

Which issue(s) this PR fixes:
Fixes #5164 and #5221
Please note that for #5164 there is at least one more bug: querying in range that forces to query both generator and backend gives inaccurate results. It will be fixed in a separate PR.

Checklist

  • Tests updated
  • Documentation added
  • CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]

@ruslan-mikhailov ruslan-mikhailov force-pushed the bugfix/instant-query branch 5 times, most recently from eca0ced to 44d6dd7 Compare June 10, 2025 13:54
@ruslan-mikhailov ruslan-mikhailov marked this pull request as ready for review June 10, 2025 15:05
if step == 0 {
return 0
}
if isInstant(start, end, step) {
return start
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made a copy-paste mistake here, it should return end. Now I wonder, how it works well with that

Copy link
Contributor Author

@ruslan-mikhailov ruslan-mikhailov Jun 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It worked because with end == start, the number of intervals is 1 due to https://github.com/ruslan-mikhailov/tempo/blob/main/pkg/traceql/engine_metrics.go#L59
If return end as it should be, then the number of samples will be 2. See https://github.com/ruslan-mikhailov/tempo/blob/main/tempodb/tempodb_metrics_test.go#L62

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To fix properly, I force instant queries to always use only one interval

@ruslan-mikhailov ruslan-mikhailov marked this pull request as draft June 10, 2025 15:17
@ruslan-mikhailov ruslan-mikhailov force-pushed the bugfix/instant-query branch 2 times, most recently from 8aec5ae to b0b6c40 Compare June 10, 2025 15:50
@ruslan-mikhailov ruslan-mikhailov force-pushed the bugfix/instant-query branch 3 times, most recently from cb72e98 to 2778368 Compare June 11, 2025 09:12
@ruslan-mikhailov ruslan-mikhailov marked this pull request as ready for review June 11, 2025 09:46
return start - start%step
}

// End time is rounded up to next step
func alignEnd(end, step uint64) uint64 {
func alignEnd(start, end, step uint64) uint64 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We chatted about this offline - this is the root cause of too many intervals both for instant and query_range. Because we round start down, and end up, it always creates 1 more step. This is modeling behavior of prometheus where start and end timestamps are inclusive.

At some point we want to fix this for query_range too. I.e. 1h with 1m steps should return 60 data points not 61. Is it worth fixing in this PR, or are you thinking to address this in a separate PR? I think we could revisit and simplify some of these changes after doing that.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like we should decide on the a behaviour that makes sense of us, even if it's slightly different from prometheus.

we don't promise to be prometheus format compliant, we are prometheus inspired and similar so it's okay to diverge where it makes sense for us to diverge.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My suggestion is to fix it with this workaround so far to not be blocked by discussions as it fixes at least 3 issues. Then, when decision is made, we can refactor the code.

@ruslan-mikhailov
Copy link
Contributor Author

+ additional fix

Copy link
Contributor

@mdisibio mdisibio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm ok with the changes as-is because the fixes are good, but I want to mention another possible direction:

The original vision I had for instant queries was to use a different range aggregator instead of StepAggregator here: https://github.com/grafana/tempo/blob/main/pkg/traceql/ast_metrics.go#L156 StepAggregator does the work to sort values into intervals in its Observe function. If we made a new InstantAggregator it would always have and return 1 interval, and it wouldn't need to do time calculations at all. I think this is the ideal fix, because it is simpler but also more efficient. Similar to how we swap between Grouping or Ungrouped aggregators based on presence of any by() clause.

Not sure why this wasn't done originally when introducing instant queries. I think I just didn't realize the number of edge cases being introduced without it.

If you are interested in exploring this direction a bit, I think it would be worthwhile. It should be less churn overall.

@ruslan-mikhailov
Copy link
Contributor Author

+ linter fixes

@ruslan-mikhailov
Copy link
Contributor Author

+ rebase from latest main in order to resolve conflicts

Copy link
Member

@electron0zero electron0zero left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm :shipit:

@ruslan-mikhailov ruslan-mikhailov merged commit 7776e70 into grafana:main Jun 17, 2025
21 checks passed
@ruslan-mikhailov ruslan-mikhailov deleted the bugfix/instant-query branch June 17, 2025 12:09
mattdurham pushed a commit to mattdurham/tempo that referenced this pull request Aug 6, 2025
* [bugfix] TraceQL Metrics: correct interval number calculation

* [bugfix] Force to return one interval for instant queries

* linter fix: apply De Morgan's law

* e2e: refactoring

* e2e: refactoring

* e2e: function to call instant query

* e2e: instant query

* e2e: remove redundant param

* e2e: add traces with high cardinality attributes

* e2e: test instant bottomk/topk

* Add tests for topk and bottomk Instant queries

* Changelog

* test refactoring

* Basic test to check corner cases for instant query

* [bugfix] Correct instant query calculation
when query both generator and backend

---------

Co-authored-by: Suraj Nath <9503187+electron0zero@users.noreply.github.com>
mattdurham added a commit that referenced this pull request Aug 6, 2025
* Bump github.com/go-viper/mapstructure/v2 from 2.2.1 to 2.3.0 (#5333)

Bumps [github.com/go-viper/mapstructure/v2](https://github.com/go-viper/mapstructure) from 2.2.1 to 2.3.0.
- [Release notes](https://github.com/go-viper/mapstructure/releases)
- [Changelog](https://github.com/go-viper/mapstructure/blob/main/CHANGELOG.md)
- [Commits](go-viper/mapstructure@v2.2.1...v2.3.0)

---
updated-dependencies:
- dependency-name: github.com/go-viper/mapstructure/v2
  dependency-version: 2.3.0
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* Update go tools to 1.24.4 (#5322)

* cherry pick

* Update changelog

* cherry pick

* Bugfix: Correctly assign backend shard numbers depending on ingester shards (#5438)

* Correctly assign backend shard numbers depending on ingester shards

Signed-off-by: Joe Elliott <number101010@gmail.com>

* changelog

Signed-off-by: Joe Elliott <number101010@gmail.com>

* test cleanup

Signed-off-by: Joe Elliott <number101010@gmail.com>

* a single space raised up as an offering to the lint gods

Signed-off-by: Joe Elliott <number101010@gmail.com>

---------

Signed-off-by: Joe Elliott <number101010@gmail.com>

* cherry picking

* Propagate tracing context in distributor for HTTP requests (#5312)

* Propagate OTel context in distributor

* Changelog

* chlog v2

* Only propagate sampled traces

* Refactor
# Conflicts:
#	CHANGELOG.md

* [bugfix] Correct instant query calculation  (#5252)

* [bugfix] TraceQL Metrics: correct interval number calculation

* [bugfix] Force to return one interval for instant queries

* linter fix: apply De Morgan's law

* e2e: refactoring

* e2e: refactoring

* e2e: function to call instant query

* e2e: instant query

* e2e: remove redundant param

* e2e: add traces with high cardinality attributes

* e2e: test instant bottomk/topk

* Add tests for topk and bottomk Instant queries

* Changelog

* test refactoring

* Basic test to check corner cases for instant query

* [bugfix] Correct instant query calculation
when query both generator and backend

---------

Co-authored-by: Suraj Nath <9503187+electron0zero@users.noreply.github.com>

* Add nil check to `partitionAssignmentVar` (#5198)

* Nil check partitionAssignmentVar

* Changelog
# Conflicts:
#	CHANGELOG.md

* True up changelog

* True up linter

* Apply suggestions from code review

Co-authored-by: Kim Nylander <104772500+knylander-grafana@users.noreply.github.com>

* Update CHANGELOG.md

Co-authored-by: Kim Nylander <104772500+knylander-grafana@users.noreply.github.com>

* Update CHANGELOG.md

Co-authored-by: Kim Nylander <104772500+knylander-grafana@users.noreply.github.com>

* add release notes

---------

Signed-off-by: dependabot[bot] <support@github.com>
Signed-off-by: Joe Elliott <number101010@gmail.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: A. Stoewer <adrian@stoewer.me>
Co-authored-by: J Pham <94262131+ie-pham@users.noreply.github.com>
Co-authored-by: Joe Elliott <number101010@gmail.com>
Co-authored-by: Ruslan Mikhailov <195758209+ruslan-mikhailov@users.noreply.github.com>
Co-authored-by: Mario <mariorvinas@gmail.com>
Co-authored-by: Suraj Nath <9503187+electron0zero@users.noreply.github.com>
Co-authored-by: Kim Nylander <104772500+knylander-grafana@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

TraceQL metrics count_over_time yields different results for different time boundaries (and instant vs. range queries)
3 participants