Skip to content

Conversation

Saturn225
Copy link
Contributor

@Saturn225 Saturn225 commented Jun 3, 2025

Fixes #44
/claim #44

Reproducible Test Setup

Uses a Docker Compose–based Patroni/etcd/PostgreSQL cluster (Patroni v2.x, Postgres 16, etcd 3.x). The repository includes a docker-compose.yaml and helper scripts to spin up three etcd nodes and three Patroni-managed Postgres instances. Each scenario (disk saturation, etcd quorum loss, failover misfire, missing replication slots, replication lag) can be trigerred.

Problem Detection

This rule detects critical, cluster-breaking events that appear in Patroni/etcd/Postgres logs, namely:

  • Disk Saturation (dd “No space left on device” + missing replication‐slot FATALs)
  • Etcd Quorum Loss (DNS resolution failures, MaxRetryErrors, EtcdConnectionFailed)
  • Failover Misfire (etcd connectivity errors preventing Patroni leader election)
  • Patroni Failover Events (missing replication‐slot FATALs during failover)
  • Replication Lag (paused replica, WAL buildup, resumed, final LSN sync)

Why This Matters

  1. WAL Streaming Breakage → Replicas cannot receive WAL; cluster may lose sync entirely.
  2. Disk Full → Primary cannot write new WAL, likely stalls all writes and corrupts replication slots.
  3. Etcd Quorum Loss → Patroni cannot elect/promote a leader; cluster becomes read-only or split-brain.
  4. Failover Misfire → Automated failover will not occur, risking prolonged downtime or data loss.

Together, these failure modes represent a high-severity outage in any HA Postgres cluster. Early detection reduces mean time to recovery and prevents cascading downstream failures.

Reproducible test setup (Maintainers invited) : Github Repo Link
Live CRE link: CRE Playground Link

@Saturn225 Saturn225 requested a review from tonymeehan June 3, 2025 20:27
@tonymeehan tonymeehan merged commit 0424517 into prequel-dev:main Jun 4, 2025
2 checks passed
@Saturn225 Saturn225 deleted the postgres branch June 5, 2025 16:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[New Rule] Postgres (self-hosted): Reproduce A High-Severity Failure & Write a Detection Rule
2 participants