Skip to content

Exemplars UX improvements #5158

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

ruslan-mikhailov
Copy link
Contributor

@ruslan-mikhailov ruslan-mikhailov commented May 23, 2025

What this PR does:

  1. Distribute exemplars over time.
    a. Change the approach to calculating requested exemplars from frontend to querier. Instead of equal share, calculate it based on block time range. The biggest block will receive more exemplars than a small block
    b. Change sampling algorithm by fixing number of sampling buckets

Before the change:
image

After the change:
image

Results:
Before:
Pasted image 20250523145320

After:
Pasted image 20250523144347

  1. Hard limit returned exemplars from query-frontend during final aggregation.

Which issue(s) this PR fixes:
Fixes #4856
Fixes #4632

Checklist

  • Tests updated
  • Documentation added
  • CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]

@ruslan-mikhailov ruslan-mikhailov force-pushed the exemplars-improvements branch 7 times, most recently from 74e9718 to 147cca9 Compare May 23, 2025 14:33
@ruslan-mikhailov ruslan-mikhailov marked this pull request as ready for review May 23, 2025 14:48
@@ -228,8 +263,8 @@ func (s *queryRangeSharder) buildBackendRequests(ctx context.Context, tenantID s
queryHash := hashForQueryRangeRequest(&searchReq)
colsToJSON := api.NewDedicatedColumnsToJSON()

exemplarsPerBlock := s.exemplarsPerShard(uint32(len(metas)), searchReq.Exemplars)
for _, m := range metas {
exemplarsPerBlock := s.exemplarsPerShard(metas, searchReq.Exemplars)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we also scale the number of exemplars for generatorRequest based on the time range? I think it sends over the whole limit, which was also a reason exemplars seemed to clump in recent data.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point! With bucket fix, it still should distribute fairly, but by cutting the exemplars parameter, we can reduce the load to generators.

I pushed exemplar splitting between generator and querier in a separate commit. Could you please take a look?

@knylander-grafana
Copy link
Contributor

knylander-grafana commented May 23, 2025

@ruslan-mikhailov Do we need to update the docs so people understand the changes you've made in this PR? Maybe this doc? https://grafana.com/docs/tempo/latest/traceql/metrics-queries/#exemplars

Also, have you seen this PR? #5129

Copy link
Contributor

@mapno mapno left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Determine the number of requested exemplars from a block
based on its time range, not equally
@ruslan-mikhailov ruslan-mikhailov force-pushed the exemplars-improvements branch from f7d231c to 4b72208 Compare June 2, 2025 08:37
@ruslan-mikhailov
Copy link
Contributor Author

+ rebase from latest main

@ruslan-mikhailov
Copy link
Contributor Author

+ linter fixes

@ruslan-mikhailov ruslan-mikhailov force-pushed the exemplars-improvements branch from 833a0e8 to 6b9bafe Compare June 2, 2025 08:46
@ruslan-mikhailov ruslan-mikhailov merged commit a75b0bb into grafana:main Jun 2, 2025
20 checks passed
@ruslan-mikhailov ruslan-mikhailov deleted the exemplars-improvements branch June 2, 2025 09:52
ruslan-mikhailov added a commit to ruslan-mikhailov/tempo that referenced this pull request Jun 2, 2025
* TraceQL Metrics: more fair exemplars distribution

Determine the number of requested exemplars from a block
based on its time range, not equally

* TraceQL Metrics: distribute exemplars over time

UX improvement

* TraceQL Metrics: Parametrise exemplars limit

* TraceQL Metrics: pass exemplars limit to combiners

* TraceQL Metrics: hard limit exemplars

* TraceQL Metrics: restore default behaviour

* TraceQL Metrics: e2e test for number of returned exemplars

* TraceQL Metrics: state bug in quantile_over_time

* Changelog

* Stabilise flaky test

* nolint for integer overflow

* minor linter fixes

* TraceQL Metrics: share exemplars between generator and backend

This will reduce load on generators

* Add comment with link to bug

* minor linter fixes
ruslan-mikhailov added a commit that referenced this pull request Jun 2, 2025
* TraceQL Metrics: more fair exemplars distribution

Determine the number of requested exemplars from a block
based on its time range, not equally

* TraceQL Metrics: distribute exemplars over time

UX improvement

* TraceQL Metrics: Parametrise exemplars limit

* TraceQL Metrics: pass exemplars limit to combiners

* TraceQL Metrics: hard limit exemplars

* TraceQL Metrics: restore default behaviour

* TraceQL Metrics: e2e test for number of returned exemplars

* TraceQL Metrics: state bug in quantile_over_time

* Changelog

* Stabilise flaky test

* nolint for integer overflow

* minor linter fixes

* TraceQL Metrics: share exemplars between generator and backend

This will reduce load on generators

* Add comment with link to bug

* minor linter fixes
knylander-grafana pushed a commit to knylander-grafana/tempo-doc-work that referenced this pull request Jun 2, 2025
* TraceQL Metrics: more fair exemplars distribution

Determine the number of requested exemplars from a block
based on its time range, not equally

* TraceQL Metrics: distribute exemplars over time

UX improvement

* TraceQL Metrics: Parametrise exemplars limit

* TraceQL Metrics: pass exemplars limit to combiners

* TraceQL Metrics: hard limit exemplars

* TraceQL Metrics: restore default behaviour

* TraceQL Metrics: e2e test for number of returned exemplars

* TraceQL Metrics: state bug in quantile_over_time

* Changelog

* Stabilise flaky test

* nolint for integer overflow

* minor linter fixes

* TraceQL Metrics: share exemplars between generator and backend

This will reduce load on generators

* Add comment with link to bug

* minor linter fixes
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

TraceQL Metrics exemplar distribution is skewed Query exemplars field not respected
4 participants