Redpanda Consumer Mass Disconnect Reproduction #79

Harsh9485 · 2025-06-10T05:06:35Z

Add Detection Rule: Redpanda Consumer Mass Disconnect → Coordinator Failure

Overview

This PR adds a CRE detection rule for a critical high-severity Redpanda failure where mass consumer disconnections overwhelm the group coordinator, causing complete message processing halt.

Rule ID: CRE-2025-0091
Severity: 10/10 (Critical)
Category: distributed-messaging-connectivity

Failure Scenario Reproduced

Issue: When 100+ consumers are forcibly disconnected simultaneously, Redpanda's consumer group coordinator becomes unresponsive.

Impact:

❌ Complete halt of message processing
❌ New consumers cannot join groups (MemberIdRequiredError)
❌ Existing consumers stuck with NodeNotReadyError
❌ Data pipeline downtime until manual intervention

🔗 References

/solves #69
/claim #69
closes #69

Reproduction Repo: [Private repo shared with @tonymeehan @Lyndon-prequel] : Redpanda
CRE Playground: Playground

Demo Video

Screen.Recording.2025-06-10.034013.1.1.mp4

Screen.Recording.2025-06-10.094058.mp4

Impact Score: 10/10 - Complete message processing halt
Mitigation Score: 7/10 - Requires restart + graceful consumer management

Harsh9485 · 2025-06-10T05:22:47Z

Hi @Lyndon-prequel, @tonymeehan

I just wanted to clarify that the failure I submitted in my PR was not my original plan. Initially, I was trying to simulate a high-severity failure described in this GitHub issue: redpanda-data/redpanda#3643. If I had succeeded, it would have been the perfect candidate for a serious production-level failure.

This failure involves a Redpanda broker running on an ARM64 AWS instance where, after several hours of normal operation, both the producer and broker experience a sudden surge of errors and memory usage increases (~1GB every 5 hours). The producer uses rust-rdkafka to send around 200 messages/second to 14 topics with idempotency and SASL authentication enabled.

Despite trying to reproduce this in multiple languages (Rust, Python), I was not able to successfully simulate the issue, even after many attempts and long-running tests.

Because of this, I had to switch to a simpler but still valid failure scenario, which delayed my PR submission.

Thanks for understanding!

Harsh9485 · 2025-06-11T15:16:48Z

@tonymeehan Can I have your eye on this 😄?

tonymeehan · 2025-06-13T14:05:31Z

This rule looks good to merge! Please rename the CRE id (folder and rule ID) to CRE-2025-0091 and we'll be good to merge.

Harsh9485 · 2025-06-13T16:31:45Z

Done!, @tonymeehan

Harsh9485 · 2025-06-15T16:06:33Z

thanks @Lyndon-prequel

Redpanda Consumer Mass Disconnect Reproduction

4fdd0ef

algora-pbc bot added the 🙋 Bounty claim label Jun 10, 2025

algora-pbc bot mentioned this pull request Jun 10, 2025

[New Rule] Redpanda: Reproduce A High-Severity Failure & Write a Detection Rule #69

Closed

update file name

bc154d2

Lyndon-prequel merged commit 2911462 into prequel-dev:main Jun 15, 2025
2 checks passed

Harsh9485 mentioned this pull request Jun 16, 2025

Istio Ambient Troubleshooting Rules #81

Closed

Harsh9485 mentioned this pull request Sep 1, 2025

Supabase (self-hosted): Reproduce High-Severity Failures from the Troubleshooting Guide & Write a CRE Rule [Submit by September 3 11:59 pm ET] #131

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Redpanda Consumer Mass Disconnect Reproduction #79

Redpanda Consumer Mass Disconnect Reproduction #79

Uh oh!

Harsh9485 commented Jun 10, 2025 •

edited

Loading

Uh oh!

Harsh9485 commented Jun 10, 2025 •

edited

Loading

Uh oh!

Harsh9485 commented Jun 11, 2025

Uh oh!

tonymeehan commented Jun 13, 2025

Uh oh!

Harsh9485 commented Jun 13, 2025

Uh oh!

Uh oh!

Harsh9485 commented Jun 15, 2025

Uh oh!

Uh oh!

Redpanda Consumer Mass Disconnect Reproduction #79

Redpanda Consumer Mass Disconnect Reproduction #79

Uh oh!

Conversation

Harsh9485 commented Jun 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Add Detection Rule: Redpanda Consumer Mass Disconnect → Coordinator Failure

Overview

Failure Scenario Reproduced

🔗 References

Demo Video

Uh oh!

Harsh9485 commented Jun 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Harsh9485 commented Jun 11, 2025

Uh oh!

tonymeehan commented Jun 13, 2025

Uh oh!

Harsh9485 commented Jun 13, 2025

Uh oh!

Uh oh!

Harsh9485 commented Jun 15, 2025

Uh oh!

Uh oh!

Harsh9485 commented Jun 10, 2025 •

edited

Loading

Harsh9485 commented Jun 10, 2025 •

edited

Loading