Skip to content

Conversation

Harsh9485
Copy link
Contributor

@Harsh9485 Harsh9485 commented Jun 10, 2025

Add Detection Rule: Redpanda Consumer Mass Disconnect → Coordinator Failure

Overview

This PR adds a CRE detection rule for a critical high-severity Redpanda failure where mass consumer disconnections overwhelm the group coordinator, causing complete message processing halt.

Rule ID: CRE-2025-0091
Severity: 10/10 (Critical)
Category: distributed-messaging-connectivity

Failure Scenario Reproduced

Issue: When 100+ consumers are forcibly disconnected simultaneously, Redpanda's consumer group coordinator becomes unresponsive.

Impact:

  • ❌ Complete halt of message processing
  • ❌ New consumers cannot join groups (MemberIdRequiredError)
  • ❌ Existing consumers stuck with NodeNotReadyError
  • ❌ Data pipeline downtime until manual intervention

🔗 References

/solves #69
/claim #69
closes #69

Demo Video

Screen.Recording.2025-06-10.034013.1.1.mp4
Screen.Recording.2025-06-10.094058.mp4

Impact Score: 10/10 - Complete message processing halt
Mitigation Score: 7/10 - Requires restart + graceful consumer management

@Harsh9485
Copy link
Contributor Author

Harsh9485 commented Jun 10, 2025

Hi @Lyndon-prequel, @tonymeehan

I just wanted to clarify that the failure I submitted in my PR was not my original plan. Initially, I was trying to simulate a high-severity failure described in this GitHub issue: redpanda-data/redpanda#3643. If I had succeeded, it would have been the perfect candidate for a serious production-level failure.

This failure involves a Redpanda broker running on an ARM64 AWS instance where, after several hours of normal operation, both the producer and broker experience a sudden surge of errors and memory usage increases (~1GB every 5 hours). The producer uses rust-rdkafka to send around 200 messages/second to 14 topics with idempotency and SASL authentication enabled.

Despite trying to reproduce this in multiple languages (Rust, Python), I was not able to successfully simulate the issue, even after many attempts and long-running tests.

Because of this, I had to switch to a simpler but still valid failure scenario, which delayed my PR submission.

Thanks for understanding!

@Harsh9485
Copy link
Contributor Author

@tonymeehan Can I have your eye on this 😄?

@tonymeehan
Copy link
Contributor

This rule looks good to merge! Please rename the CRE id (folder and rule ID) to CRE-2025-0091 and we'll be good to merge.

@Harsh9485
Copy link
Contributor Author

Done!, @tonymeehan

@Lyndon-prequel Lyndon-prequel merged commit 2911462 into prequel-dev:main Jun 15, 2025
2 checks passed
@Harsh9485
Copy link
Contributor Author

thanks @Lyndon-prequel

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[New Rule] Redpanda: Reproduce A High-Severity Failure & Write a Detection Rule
3 participants