-
Notifications
You must be signed in to change notification settings - Fork 597
[Rhythm] Improve metrics generator + Kafka performance and stability #4721
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
…he number of traces
…option to uniqify strings
…cy within a tenant
… local blocks processor is stopped
3 tasks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome
mapno
reviewed
Feb 20, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for updating the manifest.
javiermolinar
approved these changes
Feb 21, 2025
mapno
approved these changes
Feb 21, 2025
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
What this PR does:
This is a variety of stability and performance improvements for the new architecture where the generators read from a kafka queue. At higher volumes or lag, OOMs and degraded performance occurred.
Main changes:
Back pressure
The previous version had no protection mechanism against a large backlog of data. It would populate live traces and fill up WAL blocks as fast as possible, causing frequent OOMs. Filling up live traces is straight-forward explanation, but the generators filling up with WAL blocks is also a problem. WAL blocks have increased memory requirement because they maintain the IDMap for trace lookups and resorting. Complete blocks on disk have very little memory requirement. After an OOM, having to replay all WAL blocks further compounds the problem.
Therefore now it applies back pressure to slow down ingesting data from the queue when things get bogged down. There are 2 cases: too much live traces data controlled by the new
max_live_traces_bytes
(default is 50% of max block, i.e. 250MB), and too many outstanding WAL blocks. This can be monitored via the newback_pressure_seconds_total
metric.Concurrency on reads
A single goroutine read from the kafka queue and deserialized the proto. This was a bottleneck and now that step is parallelized (using internal channel of kgo records). I saw benefits of this in a heavier cluster all the way up to 16 goroutines, so that is the default. Returns started diminishing after 10 or 12. This might sound like a lot of goroutines, but a good comparison is the number of concurrent gRPC calls from distributors in the previous architecture, which would likely be dozens or hundreds. The effectiveness of this can be monitored via the new
enqueue_time_seconds_total
metric.Concurrency on completing WAL blocks
Converting a WAL block to complete block is necessary and was a bottleneck for larger volume tenants because it was handled by single goroutine running the
completeLoop
. Now this is handled by a priority queue and multiple goroutines. Taking a page from the ingesters, this is a shared priority queue across all tenants, which lets us get very good control over the total amount of work being done, but also scale up for 1 single high volume tenant when needed. Concurrency is configurable (default 4). The effectiveness of this can be monitored via the newcomplete_queue_length
metric.This part is tricky because of the way the generator processors can be started/stopped and the overall code structure. The shared priority queue is reference counted so that the first
local-blocks
processor starts it, and the last one to shutdown stops it. Happy taking suggestions or other ideas here, the main thing we need is the concurrency for this step in the pipeline.Other changes:
strutil.SanitizeLabelName
in span-metrics, for a reduction in cpu and memory. Toying with a super minimal approach to caching here, interested to hear feedback.Which issue(s) this PR fixes:
Fixes #
Checklist
CHANGELOG.md
updated - the order of entries should be[CHANGE]
,[FEATURE]
,[ENHANCEMENT]
,[BUGFIX]