Skip to content

[Rhythm] Improve metrics generator + Kafka performance and stability #4721

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 23 commits into from
Feb 21, 2025

Conversation

mdisibio
Copy link
Contributor

@mdisibio mdisibio commented Feb 18, 2025

What this PR does:
This is a variety of stability and performance improvements for the new architecture where the generators read from a kafka queue. At higher volumes or lag, OOMs and degraded performance occurred.

Main changes:

  1. Back pressure
    The previous version had no protection mechanism against a large backlog of data. It would populate live traces and fill up WAL blocks as fast as possible, causing frequent OOMs. Filling up live traces is straight-forward explanation, but the generators filling up with WAL blocks is also a problem. WAL blocks have increased memory requirement because they maintain the IDMap for trace lookups and resorting. Complete blocks on disk have very little memory requirement. After an OOM, having to replay all WAL blocks further compounds the problem.

    Therefore now it applies back pressure to slow down ingesting data from the queue when things get bogged down. There are 2 cases: too much live traces data controlled by the new max_live_traces_bytes (default is 50% of max block, i.e. 250MB), and too many outstanding WAL blocks. This can be monitored via the new back_pressure_seconds_total metric.

  2. Concurrency on reads
    A single goroutine read from the kafka queue and deserialized the proto. This was a bottleneck and now that step is parallelized (using internal channel of kgo records). I saw benefits of this in a heavier cluster all the way up to 16 goroutines, so that is the default. Returns started diminishing after 10 or 12. This might sound like a lot of goroutines, but a good comparison is the number of concurrent gRPC calls from distributors in the previous architecture, which would likely be dozens or hundreds. The effectiveness of this can be monitored via the new enqueue_time_seconds_total metric.

  3. Concurrency on completing WAL blocks
    Converting a WAL block to complete block is necessary and was a bottleneck for larger volume tenants because it was handled by single goroutine running the completeLoop. Now this is handled by a priority queue and multiple goroutines. Taking a page from the ingesters, this is a shared priority queue across all tenants, which lets us get very good control over the total amount of work being done, but also scale up for 1 single high volume tenant when needed. Concurrency is configurable (default 4). The effectiveness of this can be monitored via the new complete_queue_length metric.

    This part is tricky because of the way the generator processors can be started/stopped and the overall code structure. The shared priority queue is reference counted so that the first local-blocks processor starts it, and the last one to shutdown stops it. Happy taking suggestions or other ideas here, the main thing we need is the concurrency for this step in the pipeline.

Other changes:

  • Add missing steps to validate and default configs
  • Make partition_lag_seconds a common metric in pkg/ingest
  • Honor max_traces_per_user in localblocks like we do in the ingesters, but only for the non-flushing instance of local-blocks
  • To handle processing backlogged data, don't adjust ingestion time slack, but only for the non-flushing instance of local-blocks
  • Preallocate IDMap buffers when replaying WAL blocks
  • Cache the results of strutil.SanitizeLabelName in span-metrics, for a reduction in cpu and memory. Toying with a super minimal approach to caching here, interested to hear feedback.

Which issue(s) this PR fixes:
Fixes #

Checklist

  • Tests updated
  • Documentation added
  • CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]

Copy link
Contributor

@javiermolinar javiermolinar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome

Copy link
Contributor

@knylander-grafana knylander-grafana left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for updating the manifest.

@mdisibio mdisibio merged commit d8bf8fe into grafana:main Feb 21, 2025
14 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants