[Rhythm] Improve metrics generator + Kafka performance and stability #4721

mdisibio · 2025-02-18T15:52:42Z

What this PR does:
This is a variety of stability and performance improvements for the new architecture where the generators read from a kafka queue. At higher volumes or lag, OOMs and degraded performance occurred.

Main changes:

Back pressure
The previous version had no protection mechanism against a large backlog of data. It would populate live traces and fill up WAL blocks as fast as possible, causing frequent OOMs. Filling up live traces is straight-forward explanation, but the generators filling up with WAL blocks is also a problem. WAL blocks have increased memory requirement because they maintain the IDMap for trace lookups and resorting. Complete blocks on disk have very little memory requirement. After an OOM, having to replay all WAL blocks further compounds the problem.

Therefore now it applies back pressure to slow down ingesting data from the queue when things get bogged down. There are 2 cases: too much live traces data controlled by the new max_live_traces_bytes (default is 50% of max block, i.e. 250MB), and too many outstanding WAL blocks. This can be monitored via the new back_pressure_seconds_total metric.
Concurrency on reads
A single goroutine read from the kafka queue and deserialized the proto. This was a bottleneck and now that step is parallelized (using internal channel of kgo records). I saw benefits of this in a heavier cluster all the way up to 16 goroutines, so that is the default. Returns started diminishing after 10 or 12. This might sound like a lot of goroutines, but a good comparison is the number of concurrent gRPC calls from distributors in the previous architecture, which would likely be dozens or hundreds. The effectiveness of this can be monitored via the new enqueue_time_seconds_total metric.
Concurrency on completing WAL blocks
Converting a WAL block to complete block is necessary and was a bottleneck for larger volume tenants because it was handled by single goroutine running the completeLoop. Now this is handled by a priority queue and multiple goroutines. Taking a page from the ingesters, this is a shared priority queue across all tenants, which lets us get very good control over the total amount of work being done, but also scale up for 1 single high volume tenant when needed. Concurrency is configurable (default 4). The effectiveness of this can be monitored via the new complete_queue_length metric.

This part is tricky because of the way the generator processors can be started/stopped and the overall code structure. The shared priority queue is reference counted so that the first local-blocks processor starts it, and the last one to shutdown stops it. Happy taking suggestions or other ideas here, the main thing we need is the concurrency for this step in the pipeline.

Other changes:

Add missing steps to validate and default configs
Make partition_lag_seconds a common metric in pkg/ingest
Honor max_traces_per_user in localblocks like we do in the ingesters, but only for the non-flushing instance of local-blocks
To handle processing backlogged data, don't adjust ingestion time slack, but only for the non-flushing instance of local-blocks
Preallocate IDMap buffers when replaying WAL blocks
Cache the results of strutil.SanitizeLabelName in span-metrics, for a reduction in cpu and memory. Toying with a super minimal approach to caching here, interested to hear feedback.

Which issue(s) this PR fixes:
Fixes #

Checklist

Tests updated
Documentation added
CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]

…he number of traces

…option to uniqify strings

…cy within a tenant

… local blocks processor is stopped

javiermolinar

Awesome

modules/generator/generator_kafka.go

modules/generator/processor/localblocks/complete_queue.go

modules/generator/processor/localblocks/config.go

modules/generator/processor/localblocks/processor.go

pkg/livetraces/livetraces.go

pkg/ingest/config.go

pkg/cache/reclaimable/reclaimable.go

modules/generator/processor/spanmetrics/spanmetrics.go

modules/generator/generator_kafka.go

modules/generator/processor/localblocks/config.go

modules/generator/processor/localblocks/processor.go

knylander-grafana

Thank you for updating the manifest.

mdisibio added 19 commits February 13, 2025 10:15

Add new concurrency config options and validate

a524eb8

Reduce allocations of IDmap when replaying wal blocks since we know t…

96e7b1e

…he number of traces

Generator read from kafka concurrency, add shared ingest lag metric, …

88c7273

…option to uniqify strings

Memoize spanmetrics sanitizelabelname, move to better location

0dfcd36

Moved to shared queue for localblocks wal completion, allow concurren…

7db90df

…cy within a tenant

Honor max live traces in non-flushing local blocks processor

b53488c

Add metric for enqueue time

15f2bd0

Add missing mutex lock for enqueuing on replay

8169501

Fix mutex lock while reloading blocks

921d471

Increase default concurrency

531ad75

Simplify local blocks complete queue

3c8d3da

Remove uniqify, fix test

e3f886c

Switch completequeue to reference counting and shut it down when last…

820fcbc

… local blocks processor is stopped

cache rename/cleanup

e16565d

Cleanup/denoising PR

67b97ba

Lint/cleanup

d00df8a

lint

62a180f

Update config manifest

cae66d1

fix race condition

e67f59b

mdisibio mentioned this pull request Feb 18, 2025

[Metrics Generator] Allow running on a different source of data #4686

Merged

3 tasks

cleanup

0cc26b9

mdisibio marked this pull request as ready for review February 19, 2025 13:39

mdisibio requested review from knylander-grafana, joe-elliott, mapno, yvrhdn, zalegrala, electron0zero, ie-pham and stoewer as code owners February 19, 2025 13:39

mdisibio requested a review from javiermolinar as a code owner February 19, 2025 13:39

javiermolinar reviewed Feb 20, 2025

View reviewed changes

mapno reviewed Feb 20, 2025

View reviewed changes

mdisibio added 3 commits February 20, 2025 11:45

review feedback

0f45e6c

Update config manifest

cabf9cb

changelog

bc4be1b

knylander-grafana reviewed Feb 20, 2025

View reviewed changes

javiermolinar approved these changes Feb 21, 2025

View reviewed changes

mapno approved these changes Feb 21, 2025

View reviewed changes

mdisibio merged commit d8bf8fe into grafana:main Feb 21, 2025
14 checks passed

mdisibio mentioned this pull request Feb 24, 2025

v2.7.1 panics when metrics_generator.traces_storage.path is set. #4742

Closed

ruslan-mikhailov mentioned this pull request Feb 24, 2025

Issue 4742: fix panic on startup #4744

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Rhythm] Improve metrics generator + Kafka performance and stability #4721

[Rhythm] Improve metrics generator + Kafka performance and stability #4721

Uh oh!

mdisibio commented Feb 18, 2025 •

edited

Loading

Uh oh!

javiermolinar left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

knylander-grafana left a comment

Uh oh!

Uh oh!

Uh oh!

[Rhythm] Improve metrics generator + Kafka performance and stability #4721

[Rhythm] Improve metrics generator + Kafka performance and stability #4721

Uh oh!

Conversation

mdisibio commented Feb 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

javiermolinar left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

knylander-grafana left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

mdisibio commented Feb 18, 2025 •

edited

Loading