Skip to content

Fix distributor rebatch bug #5186

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
May 29, 2025
Merged

Conversation

mdisibio
Copy link
Contributor

@mdisibio mdisibio commented May 29, 2025

What this PR does:
Distributors rebatch incoming writes by trace ID, in order to ensure that all spans for a given trace end up on the same ingesters over time. It's using a 32-bit hash which can lead to collisions. When 2 traces in the incoming write request have the same hash (collide), their spans will get intermixed, i.e. bad data.

The core issue is that the distributor logic was conflating hashing for the ring, and hashing for dedupe. These don't need to be the same hash, nor should they be. Because the ring requires 32-bit hashes and collisions there don't matter, it just means 2 traces go the same ingesters (normal and unavoidable).

But collisions for dedupe must be avoided, and we do this by swapping that part to a 64-bit hash. Same approach as in trace combiner for spans, and there are test cases for known trace ID collisions under the old method and a test to check the collision rate of the new method.

Which issue(s) this PR fixes:
Fixes #

Checklist

  • Tests updated
  • Documentation added
  • CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]

@@ -472,7 +471,7 @@ func (d *Distributor) PushTraces(ctx context.Context, traces ptrace.Traces) (*te

maxAttributeBytes := d.getMaxAttributeBytes(userID)

keys, rebatchedTraces, truncatedAttributeCount, err := requestsByTraceID(batches, userID, spanCount, maxAttributeBytes)
ringTokens, rebatchedTraces, truncatedAttributeCount, err := requestsByTraceID(batches, userID, spanCount, maxAttributeBytes)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to the rename

@mdisibio mdisibio marked this pull request as ready for review May 29, 2025 22:30
@mdisibio mdisibio merged commit b9a611e into grafana:main May 29, 2025
20 checks passed
knylander-grafana pushed a commit to knylander-grafana/tempo-doc-work that referenced this pull request Jun 2, 2025
* Fix distributor rebatch bug, by not using a 32-bit hash for deduping, only for ring sharding (as required)

* lint

* changelog
@mdisibio mdisibio added type/bug Something isn't working backport release-v2.8 labels Jun 3, 2025
mdisibio added a commit to mdisibio/tempo that referenced this pull request Jun 3, 2025
* Fix distributor rebatch bug, by not using a 32-bit hash for deduping, only for ring sharding (as required)

* lint

* changelog
mdisibio added a commit that referenced this pull request Jun 3, 2025
* Fix distributor rebatch bug (#5186)

* Fix distributor rebatch bug, by not using a 32-bit hash for deduping, only for ring sharding (as required)

* lint

* changelog

* changelog
carles-grafana pushed a commit to carles-grafana/tempo that referenced this pull request Jun 4, 2025
* Fix distributor rebatch bug, by not using a 32-bit hash for deduping, only for ring sharding (as required)

* lint

* changelog
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport release-v2.8 type/bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants