Add Postgres(self-hosted) critical HA upstream failure detection rule #51
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Fixes #44
/claim #44
Reproducible Test Setup
Uses a Docker Compose–based Patroni/etcd/PostgreSQL cluster (Patroni v2.x, Postgres 16, etcd 3.x). The repository includes a docker-compose.yaml and helper scripts to spin up three etcd nodes and three Patroni-managed Postgres instances. Each scenario (disk saturation, etcd quorum loss, failover misfire, missing replication slots, replication lag) can be trigerred.
Problem Detection
This rule detects critical, cluster-breaking events that appear in Patroni/etcd/Postgres logs, namely:
Why This Matters
Together, these failure modes represent a high-severity outage in any HA Postgres cluster. Early detection reduces mean time to recovery and prevents cascading downstream failures.
Reproducible test setup (Maintainers invited) : Github Repo Link
Live CRE link: CRE Playground Link