keeper Prometheus metrics #639

lawrencejones · 2019-05-02T15:32:49Z

Add a collection of Prometheus metrics to the keeper. The metrics are
aimed to expose errors in the keeper sync loop, providing enough
visibility to detect when the sync is failing (and some insight into
why).

Hey stolon maintainers! I want to make this as easy as possible to review, so I'm going to try and write everything you need up-front here. If you have any questions just drop a comment and I'll do my best to answer them.

We're currently moving our Postgres cluster to use stolon, and in doing so we've adapted some tooling we already had to enable zero-downtime planned failover. The tool that allows this is called stolon-pgbouncer which explains its purpose and aims in the readme.

stolon-pgbouncer has native Prometheus metrics and bundles with it some Prometheus alerts and dashboards. If you clone this repo and run docker-compose up, then navigate to http://localhost:3000 (login as admin/admin, skip password) you'll see two dashboards, one for the stolon-pgbouncer services and another for stolon-keepers.

The master branch of stolon-pgbouncer will boot with a compiled stolon-keeper binary from this PR. This enables us to scrape and display the stolon-keeper metrics in a dashboard we bundle with stolon-pgbouncer (PR that introduces this is here) which looks like this:

It's meant to be a one-stop-shop for keeper state, while the alerts we've defined on these metrics aim to capture whenever keepers are misbehaving or Postgres is pending a restart.

We'd love to get these metrics upstreamed to benefit all stolon users, as well as obviously making our lives easier by avoiding maintaining a fork!

lawrencejones · 2019-05-02T15:38:43Z

cmd/common.go

+	// cluster that any stolon component is associated with. Users can then join between
+	// various metric series for the same cluster without making assumptions about service
+	// discovery labels.
+	clusterIdentifier = prometheus.NewGaugeVec(


This is the metric we (GoCardless) have been using to join various time series together. By having all components report this in a consistent fashion (see here for the stolon-pgbouncer definition) it becomes possible to join series from totally different process/infrastructures on the cluster_name and store_prefix labels.

Ideally only the clustername should be neeed (the prefix isn't used when running stolon inside k8s). Obviously people could setup multiple instances using the same cluster name and prefix using different store instances. But I think people that use multiple stolon instances should also use different cluster names if they want to distinguish them inside prometheus.

Another solution will be to add another "clusteruid" option but I think this is redundanto and should be covered by the "clustername"

But I think people that use multiple stolon instances should also use different cluster names if they want to distinguish them inside prometheus.

I trust your intuition on this a lot more than my own! Will nix the store prefix label then.

lawrencejones · 2019-05-02T15:39:53Z

cmd/keeper/cmd/keeper.go

@@ -1391,6 +1480,10 @@ func (p *PostgresKeeper) postgresKeeperSM(pctx context.Context) {
 	targetRole := db.Spec.Role
 	log.Debugw("target role", "targetRole", string(targetRole))

+	// Set metrics to power alerts about mismatched roles


These metrics are particularly useful to power an up-to-date view of keeper state. It looks like this in a dashboard:

lawrencejones · 2019-05-02T15:40:09Z

cmd/keeper/cmd/keeper.go

@@ -1770,6 +1875,7 @@ func (p *PostgresKeeper) generateHBA(cd *cluster.ClusterData, db *cluster.DB, on
 func sigHandler(sigs chan os.Signal, cancel context.CancelFunc) {
 	s := <-sigs
 	log.Debugw("got signal", "signal", s)
+	shutdownSeconds.SetToCurrentTime()


Useful for detecting a pending shutdown.

lawrencejones · 2019-05-02T15:40:29Z

internal/common/common.go

@@ -42,6 +42,13 @@ const (
 	RoleStandby   Role = "standby"
 )

+// Roles enumerates all possible Role values


This list needs maintaining to ensure we can set all possible label values to 0 on stolon_keeper_{local,target}_role before setting our active metric to be 1. This becomes clearer when you see the output of a single instances metrics:

sgotti · 2019-05-02T17:28:40Z

@lawrencejones Thanks a lot for the PR and the detailed description. I'll review it in the next days.

lawrencejones · 2019-05-08T10:37:10Z

Hey @sgotti, I just realised the semaphoreci tests have failed here. Is that normal or should I debug that?

sgotti

@lawrencejones Overall LGTM! Some review comments inline.

Hey @sgotti, I just realised the semaphoreci tests have failed here. Is that normal or should I debug that?

Don't worry, our tests are heavy parallelized and quite I/O hungry and looks like lately semaphore machines a less powerful causing them to sporadically timeout.

sgotti · 2019-05-08T11:20:27Z

cmd/common.go

+	// cluster that any stolon component is associated with. Users can then join between
+	// various metric series for the same cluster without making assumptions about service
+	// discovery labels.
+	clusterIdentifier = prometheus.NewGaugeVec(


Ideally only the clustername should be neeed (the prefix isn't used when running stolon inside k8s). Obviously people could setup multiple instances using the same cluster name and prefix using different store instances. But I think people that use multiple stolon instances should also use different cluster names if they want to distinguish them inside prometheus.

Another solution will be to add another "clusteruid" option but I think this is redundanto and should be covered by the "clustername"

sgotti · 2019-05-08T11:23:09Z

cmd/keeper/cmd/keeper.go

+	prometheus.MustRegister(sleepInterval)
+	prometheus.MustRegister(shutdownSeconds)
+}
+


The keeper.go file is just quite big. Can you please move this block inside cmd/keeper/cmd/prometheus.go ?

Absolutely!

lawrencejones · 2019-05-08T12:17:46Z

cmd/keeper/cmd/metrics.go

@@ -0,0 +1,98 @@
+// Copyright 2017 Sorint.lab


I went with metrics.go instead of prometheus.go as the latter gave me the feeling of Prometheus integration, like setting up the metrics endpoint rather than defining our metrics.

Happy to go with either though!

AMyltsev · 2019-05-08T12:55:50Z

I am not sure that I write in right place, but I would like ask @lawrencejones about metrics from proxy and sentinels. Have you plan for implementing it?

lawrencejones · 2019-05-08T14:17:23Z

@AMyltsev we're definitely interested in doing this but wanted to start with the keepers first as the most impactful area to add visibility. We had keepers failing to start or being stuck in an initialisation step that we'd otherwise not hear about- that situation wasn't acceptable for us!

If you're interested in tackling the proxy/sentinel metrics yourself as a follow-up to this PR then I'd happily help review? I think this PR can be merged independently of the other components though, and it will benefit the following work by setting an example of how you might approach them.

sgotti · 2019-05-09T15:13:13Z

cmd/common.go

+	clusterIdentifier = prometheus.NewGaugeVec(
+		prometheus.GaugeOpts{
+			Name: "stolon_cluster_identifier",
+			Help: "Set to 1, is labelled with store_prefix and cluster_name",


"Set to 1, is labelled with the cluster_name",

sgotti · 2019-05-09T15:13:52Z

cmd/keeper/cmd/metrics.go

@@ -0,0 +1,98 @@
+// Copyright 2017 Sorint.lab


Add a collection of Prometheus metrics to the keeper. The metrics are aimed to expose errors in the keeper sync loop, providing enough visibility to detect when the sync is failing (and some insight into why).

lawrencejones · 2019-05-10T13:50:50Z

I think all changes are applied now @sgotti, and I've freshly rebased against master. Thanks for all your help!

lawrencejones · 2019-05-14T10:32:38Z

Hey @sgotti, I think everything is fixed here. Don't mean to bother you but if we're good, merging this would make things much easier for me to manage follow-up PRs for sentinel/etc metrics.

Is there anything else you'd like here?

sgotti · 2019-05-14T14:21:04Z

@lawrencejones LGTM! Mergine.

lawrencejones commented May 2, 2019

View reviewed changes

sgotti requested changes May 8, 2019

View reviewed changes

lawrencejones force-pushed the lawrence-add-keeper-metrics branch from b937766 to 433d1a9 Compare May 8, 2019 12:16

lawrencejones commented May 8, 2019

View reviewed changes

sgotti requested changes May 9, 2019

View reviewed changes

lawrencejones force-pushed the lawrence-add-keeper-metrics branch from 433d1a9 to dc47d5a Compare May 9, 2019 16:46

keeper Prometheus metrics

71d1095

Add a collection of Prometheus metrics to the keeper. The metrics are aimed to expose errors in the keeper sync loop, providing enough visibility to detect when the sync is failing (and some insight into why).

lawrencejones force-pushed the lawrence-add-keeper-metrics branch from dc47d5a to 71d1095 Compare May 10, 2019 13:50

lawrencejones mentioned this pull request May 10, 2019

Drop store_prefix label from metrics gocardless/stolon-pgbouncer#31

Merged

sgotti approved these changes May 14, 2019

View reviewed changes

sgotti merged commit a27bcae into sorintlab:master May 14, 2019

lawrencejones deleted the lawrence-add-keeper-metrics branch May 14, 2019 16:26

benwh mentioned this pull request May 31, 2019

sentinel: Add Prometheus metrics #656

Merged

sgotti added this to the v0.14.0 milestone Jun 6, 2019

sgotti mentioned this pull request Jun 7, 2019

Expose metrics related to stolon cluster for prometheus #603

Closed

sgotti mentioned this pull request Mar 19, 2020

metrics exposes only golang metrics #768

Closed

keeper Prometheus metrics #639

keeper Prometheus metrics #639

Uh oh!

Conversation

lawrencejones commented May 2, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sgotti commented May 2, 2019

Uh oh!

lawrencejones commented May 8, 2019

Uh oh!

sgotti left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

AMyltsev commented May 8, 2019

Uh oh!

lawrencejones commented May 8, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lawrencejones commented May 10, 2019

Uh oh!

lawrencejones commented May 14, 2019

Uh oh!

sgotti commented May 14, 2019

Uh oh!

Uh oh!

lawrencejones commented May 2, 2019 •

edited

Loading