Skip to content

Fix traceql exemplar distribution #5129

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

samuelarogbonlo
Copy link

What this PR does:
Adds a new package exemplardist that implements an algorithm to fix the issue where TraceQL metrics exemplars cluster on one side of the visualization instead of being distributed homogeneously across the time range. This package provides a bucketing algorithm to evenly distribute exemplars across the time range while preserving their representative quality.

Which issue(s) this PR fixes:
Fixes #4856

Checklist

  • Tests updated - Added comprehensive tests for the exemplar distribution algorithm
  • Documentation added - Added README.md in the package with explanation and usage examples
  • CHANGELOG.md updated - Not updated since this is a standalone utility package

Additional Notes

This PR only adds the exemplardist package without modifying the core Tempo code. Direct integration into the core codebase was attempted but caused test failures due to the complex interactions with existing code.

The package provides a clean API that can be used at various integration points:

  1. In API handlers that process TraceQL metrics results
  2. In middleware that processes responses
  3. In the frontend rendering code

This solution uses a bucketing algorithm that:

  1. Divides the time range into equal buckets (number of buckets = max exemplars)
  2. Assigns exemplars to the appropriate bucket based on timestamp
  3. Selects one exemplar from each bucket
  4. Fills any empty buckets with exemplars from dense areas to maintain even distribution

The implementation is tested to work with various distributions (uniform, left-skewed, right-skewed, clustered) and correctly improves the distribution quality in all cases.

@CLAassistant
Copy link

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
You have signed the CLA already but the status is still pending? Let us recheck it.

Copy link
Contributor

@knylander-grafana knylander-grafana left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for adding a readme to your package!

@knylander-grafana knylander-grafana mentioned this pull request May 23, 2025
3 tasks
@ruslan-mikhailov
Copy link
Contributor

Thank you for your contribution and for taking the time to put this well-explained proposal together!

To simplify a bit, TraceQL Metrics are calculated independently in the queriers for each block, and then merged together in the query-frontend. One of the challenges with exemplar distribution over time is ensuring a fair distribution at both of these stages.

While we do have bucket sampling, it doesn’t work particularly well when it comes to distribution fairness. To address the issue, we needed to change the bucketing algorithm and the approach to calculating requested exemplars from frontend to determine its fair share based on the block's time range.

I’ve implemented this approach here: #5158. It avoids additional sampling steps and has linear complexity.

Feel free to take a look and leave any feedback or questions on the PR. Thanks again for your interest and ideas!

@ruslan-mikhailov
Copy link
Contributor

The issue is fixed in #5158

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

TraceQL Metrics exemplar distribution is skewed
4 participants