Skip to content

Conversation

mapno
Copy link
Contributor

@mapno mapno commented Jun 27, 2025

What this PR does:

The bufferer is a Kafka-based ingestion service that consumes trace data from Kafka topics instead of receiving it via gRPC like the traditional ingester. It reads serialized traces from Kafka, processes them through the same data pipeline as the ingester (live traces → head blocks → WAL → complete blocks).

                      ┌─────────────────────────────┐
                      │        Bufferer             │
                      │                             │
                      │                             │
                      │ • Manages instances map     │
                      │ • Shared WAL coordination   │
                      │ • Flush queue management    │
                      └─────────────┬───────────────┘
                                    │
                      ┌─────────────┴───────────────┐
                      │                             │
                      ▼                             ▼
          ┌─────────────────────┐       ┌─────────────────────┐
          │  PartitionReader    │       │   instances map     │
          │                     │       │   [tenantID]        │
          │ • Kafka consumer    │       │                     │
          │ • Watermark commits │       │ ┌─────────────────┐ │
          │ • Routes to consume │       │ |   instance A    │ │
          │   function          │       │ │   (tenant A)    │ │
          └─────────────────────┘       │ │                 │ │
                                        │ │ • Live traces   │ │
                                        │ │ • Head block    │ │
                                        │ │ • WAL blocks    │ │
                                        │ │ • Complete      │ │
                                        │ │   blocks        │ │
                                        │ └─────────────────┘ │
                                        │                     │
                                        │ ┌─────────────────┐ │
                                        │ │   instance B    │ │
                                        │ │   (tenant B)    │ │
                                        │ │                 │ │
                                        │ │ • Live traces   │ │
                                        │ │ • Head block    │ │
                                        │ │ • WAL blocks    │ │
                                        │ │ • Complete      │ │
                                        │ │   blocks        │ │
                                        │ └─────────────────┘ │
                                        └─────────────────────┘
                                                   │
                                                   ▼
                                        ┌─────────────────────┐
                                        │     Shared WAL      │
                                        │                     │
                                        │ • Single WAL for    │
                                        │   all instances     │
                                        │ • Coordinated       │
                                        │   block cutting     │
                                        └─────────────────────┘

Checklist

  • Tests updated
  • Documentation added
  • CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]

@mapno mapno force-pushed the rhythm-bufferer branch from 3511f27 to 5ec0217 Compare July 7, 2025 16:10
@mapno mapno force-pushed the rhythm-bufferer branch from 8f2c0bb to dc5a73e Compare July 8, 2025 13:26
@mapno mapno force-pushed the rhythm-bufferer branch from dc5a73e to 33fb555 Compare July 8, 2025 13:41
@mattdurham
Copy link
Contributor

In general does it make sense to move responsibility of timing and actions from the Bufferer to the individual instances? This would make the instances more self contained and potentially avoid a good chunk of the locking. Would be harder to do global cuts but we could do tenant specific then? Or trigger via a channel with at timer in the Bufferer.

@mattdurham
Copy link
Contributor

From chatting would calling Bufferer ReadCache make more sense?

@mattdurham
Copy link
Contributor

Can we document the failure modes and what happens? Do we need 100% data fidelity or can 99.9%, since the data will be correctly handled by long term storage regardless.

@mattdurham
Copy link
Contributor

Assuming failures happen is it better to return incorrect data or no data?

@mapno mapno mentioned this pull request Jul 21, 2025
3 tasks
@mapno
Copy link
Contributor Author

mapno commented Jul 21, 2025

Development has been moved to a dev branch. See #5430

@mapno mapno closed this Jul 21, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants