Skip to content

Conversation

lawrencejones
Copy link

@lawrencejones lawrencejones commented Apr 30, 2019

Add a collection of Prometheus metrics to the keeper. The metrics are
aimed to expose errors in the keeper sync loop, providing enough
visibility to detect when the sync is failing (and some insight into
why).


This commit can be tested in the stolon-pgbouncer setup gocardless/stolon-pgbouncer#29

Screenshot 2019-04-30 at 16 58 23

@lawrencejones lawrencejones force-pushed the lawrence-add-keeper-metrics branch from 054cd12 to 2e5f4f1 Compare April 30, 2019 11:11
@lawrencejones lawrencejones changed the title [WIP] keeper Prometheus metrics keeper Prometheus metrics Apr 30, 2019
@lawrencejones lawrencejones force-pushed the lawrence-add-keeper-metrics branch from 2e5f4f1 to cc9a680 Compare April 30, 2019 15:45
@@ -75,6 +76,20 @@ func AddCommonFlags(cmd *cobra.Command, cfg *CommonConfig) {
}
}

var (
clusterIdentifier = prometheus.NewGaugeVec(
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the metric we (GoCardless) have been using to join various time series together. By having all components report this in a consistent fashion (see here for the stolon-pgbouncer definition) it becomes possible to join series from totally different process/infrastructures on the cluster_name and store_prefix labels.

@@ -1391,6 +1482,10 @@ func (p *PostgresKeeper) postgresKeeperSM(pctx context.Context) {
targetRole := db.Spec.Role
log.Debugw("target role", "targetRole", string(targetRole))

// Set metrics to power alerts about mismatched roles
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These metrics are particularly useful to power an up-to-date view of keeper state. It looks like this in a dashboard:

Screenshot 2019-04-30 at 17 01 41

@@ -1770,6 +1877,7 @@ func (p *PostgresKeeper) generateHBA(cd *cluster.ClusterData, db *cluster.DB, on
func sigHandler(sigs chan os.Signal, cancel context.CancelFunc) {
s := <-sigs
log.Debugw("got signal", "signal", s)
shutdownSeconds.SetToCurrentTime()
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Useful for detecting a pending shutdown.

@@ -42,6 +42,13 @@ const (
RoleStandby Role = "standby"
)

// Roles enumerates all possible Role values
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This list needs maintaining to ensure we can set all possible label values to 0 on stolon_keeper_{local,target}_role before setting our active metric to be 1. This becomes clearer when you see the output of a single instances metrics:

image

Add a collection of Prometheus metrics to the keeper. The metrics are
aimed to expose errors in the keeper sync loop, providing enough
visibility to detect when the sync is failing (and some insight into
why).
@lawrencejones lawrencejones force-pushed the lawrence-add-keeper-metrics branch from cc9a680 to b937766 Compare April 30, 2019 16:07
@lawrencejones lawrencejones changed the title keeper Prometheus metrics [WIP] keeper Prometheus metrics Apr 30, 2019
@lawrencejones lawrencejones changed the title [WIP] keeper Prometheus metrics keeper Prometheus metrics Apr 30, 2019
lawrencejones added a commit to gocardless/stolon-pgbouncer that referenced this pull request Apr 30, 2019
[^1]: gocardless/stolon#1

This commit is associated with an open PR [^1] to stolon that adds
Prometheus metrics to the keeper. The changes here include adding a
keeper dashboard that can visualise the keeper statuses and a couple of
essential alerts for keeper health.

We update the playground environment so developers can explore these
metrics and make use of the dashboard. This includes a new docker image,
which has been pushed to Docker hub.
@rnaveiras rnaveiras self-requested a review May 2, 2019 13:45
Copy link

@rnaveiras rnaveiras left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🥇amazing work; let's get this it.

@rnaveiras rnaveiras merged commit 4ab1b31 into master May 2, 2019
@rnaveiras rnaveiras deleted the lawrence-add-keeper-metrics branch May 2, 2019 14:17
@lawrencejones lawrencejones restored the lawrence-add-keeper-metrics branch May 2, 2019 15:32
lawrencejones added a commit that referenced this pull request May 9, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants