Add Postgres(self-hosted) critical HA upstream failure detection rule #51

Saturn225 · 2025-06-03T02:41:15Z

Fixes #44
/claim #44

Reproducible Test Setup

Uses a Docker Compose–based Patroni/etcd/PostgreSQL cluster (Patroni v2.x, Postgres 16, etcd 3.x). The repository includes a docker-compose.yaml and helper scripts to spin up three etcd nodes and three Patroni-managed Postgres instances. Each scenario (disk saturation, etcd quorum loss, failover misfire, missing replication slots, replication lag) can be trigerred.

Problem Detection

This rule detects critical, cluster-breaking events that appear in Patroni/etcd/Postgres logs, namely:

Disk Saturation (dd “No space left on device” + missing replication‐slot FATALs)
Etcd Quorum Loss (DNS resolution failures, MaxRetryErrors, EtcdConnectionFailed)
Failover Misfire (etcd connectivity errors preventing Patroni leader election)
Patroni Failover Events (missing replication‐slot FATALs during failover)
Replication Lag (paused replica, WAL buildup, resumed, final LSN sync)

Why This Matters

WAL Streaming Breakage → Replicas cannot receive WAL; cluster may lose sync entirely.
Disk Full → Primary cannot write new WAL, likely stalls all writes and corrupts replication slots.
Etcd Quorum Loss → Patroni cannot elect/promote a leader; cluster becomes read-only or split-brain.
Failover Misfire → Automated failover will not occur, risking prolonged downtime or data loss.

Together, these failure modes represent a high-severity outage in any HA Postgres cluster. Early detection reduces mean time to recovery and prevents cascading downstream failures.

Reproducible test setup (Maintainers invited) : Github Repo Link
Live CRE link: CRE Playground Link

rules/rules/cre-2025-0077/postgres-self-hosted.yaml

Saturn225 added 8 commits June 3, 2025 08:09

add cre-2024-0077

b25336b

add test.log

21343d0

fix format issues

06e68b5

update categories

287329e

update tags for consistency with existing ones

ad5d95a

Update categories.yaml

49275f1

update tags

5cd3a00

Merge branch 'main' into postgres

2400db3

algora-pbc bot added the 🙋 Bounty claim label Jun 3, 2025

algora-pbc bot mentioned this pull request Jun 3, 2025

[New Rule] Postgres (self-hosted): Reproduce A High-Severity Failure & Write a Detection Rule #44

Closed

tonymeehan reviewed Jun 3, 2025

View reviewed changes

rules/rules/cre-2025-0077/postgres-self-hosted.yaml Outdated Show resolved Hide resolved

Update postgres-self-hosted.yaml

0ff9764

Saturn225 requested a review from tonymeehan June 3, 2025 20:27

Merge branch 'main' into postgres

3203a29

tonymeehan approved these changes Jun 4, 2025

View reviewed changes

tonymeehan merged commit 0424517 into prequel-dev:main Jun 4, 2025
2 checks passed

Saturn225 deleted the postgres branch June 5, 2025 16:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add Postgres(self-hosted) critical HA upstream failure detection rule #51

Add Postgres(self-hosted) critical HA upstream failure detection rule #51

Uh oh!

Saturn225 commented Jun 3, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Add Postgres(self-hosted) critical HA upstream failure detection rule #51

Add Postgres(self-hosted) critical HA upstream failure detection rule #51

Uh oh!

Conversation

Saturn225 commented Jun 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reproducible Test Setup

Problem Detection

Why This Matters

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Saturn225 commented Jun 3, 2025 •

edited

Loading