Skip to content

keeper Prometheus metrics #639

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
May 14, 2019

Conversation

lawrencejones
Copy link
Contributor

@lawrencejones lawrencejones commented May 2, 2019

Add a collection of Prometheus metrics to the keeper. The metrics are
aimed to expose errors in the keeper sync loop, providing enough
visibility to detect when the sync is failing (and some insight into
why).


Hey stolon maintainers! I want to make this as easy as possible to review, so I'm going to try and write everything you need up-front here. If you have any questions just drop a comment and I'll do my best to answer them.

We're currently moving our Postgres cluster to use stolon, and in doing so we've adapted some tooling we already had to enable zero-downtime planned failover. The tool that allows this is called stolon-pgbouncer which explains its purpose and aims in the readme.

stolon-pgbouncer has native Prometheus metrics and bundles with it some Prometheus alerts and dashboards. If you clone this repo and run docker-compose up, then navigate to http://localhost:3000 (login as admin/admin, skip password) you'll see two dashboards, one for the stolon-pgbouncer services and another for stolon-keepers.

The master branch of stolon-pgbouncer will boot with a compiled stolon-keeper binary from this PR. This enables us to scrape and display the stolon-keeper metrics in a dashboard we bundle with stolon-pgbouncer (PR that introduces this is here) which looks like this:

image

It's meant to be a one-stop-shop for keeper state, while the alerts we've defined on these metrics aim to capture whenever keepers are misbehaving or Postgres is pending a restart.

We'd love to get these metrics upstreamed to benefit all stolon users, as well as obviously making our lives easier by avoiding maintaining a fork!

// cluster that any stolon component is associated with. Users can then join between
// various metric series for the same cluster without making assumptions about service
// discovery labels.
clusterIdentifier = prometheus.NewGaugeVec(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the metric we (GoCardless) have been using to join various time series together. By having all components report this in a consistent fashion (see here for the stolon-pgbouncer definition) it becomes possible to join series from totally different process/infrastructures on the cluster_name and store_prefix labels.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally only the clustername should be neeed (the prefix isn't used when running stolon inside k8s). Obviously people could setup multiple instances using the same cluster name and prefix using different store instances. But I think people that use multiple stolon instances should also use different cluster names if they want to distinguish them inside prometheus.

Another solution will be to add another "clusteruid" option but I think this is redundanto and should be covered by the "clustername"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But I think people that use multiple stolon instances should also use different cluster names if they want to distinguish them inside prometheus.

I trust your intuition on this a lot more than my own! Will nix the store prefix label then.

@@ -1391,6 +1480,10 @@ func (p *PostgresKeeper) postgresKeeperSM(pctx context.Context) {
targetRole := db.Spec.Role
log.Debugw("target role", "targetRole", string(targetRole))

// Set metrics to power alerts about mismatched roles
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These metrics are particularly useful to power an up-to-date view of keeper state. It looks like this in a dashboard:

image

@@ -1770,6 +1875,7 @@ func (p *PostgresKeeper) generateHBA(cd *cluster.ClusterData, db *cluster.DB, on
func sigHandler(sigs chan os.Signal, cancel context.CancelFunc) {
s := <-sigs
log.Debugw("got signal", "signal", s)
shutdownSeconds.SetToCurrentTime()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Useful for detecting a pending shutdown.

@@ -42,6 +42,13 @@ const (
RoleStandby Role = "standby"
)

// Roles enumerates all possible Role values
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This list needs maintaining to ensure we can set all possible label values to 0 on stolon_keeper_{local,target}_role before setting our active metric to be 1. This becomes clearer when you see the output of a single instances metrics:

image

@sgotti
Copy link
Member

sgotti commented May 2, 2019

@lawrencejones Thanks a lot for the PR and the detailed description. I'll review it in the next days.

@lawrencejones
Copy link
Contributor Author

Hey @sgotti, I just realised the semaphoreci tests have failed here. Is that normal or should I debug that?

Copy link
Member

@sgotti sgotti left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lawrencejones Overall LGTM! Some review comments inline.

Hey @sgotti, I just realised the semaphoreci tests have failed here. Is that normal or should I debug that?

Don't worry, our tests are heavy parallelized and quite I/O hungry and looks like lately semaphore machines a less powerful causing them to sporadically timeout.

// cluster that any stolon component is associated with. Users can then join between
// various metric series for the same cluster without making assumptions about service
// discovery labels.
clusterIdentifier = prometheus.NewGaugeVec(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally only the clustername should be neeed (the prefix isn't used when running stolon inside k8s). Obviously people could setup multiple instances using the same cluster name and prefix using different store instances. But I think people that use multiple stolon instances should also use different cluster names if they want to distinguish them inside prometheus.

Another solution will be to add another "clusteruid" option but I think this is redundanto and should be covered by the "clustername"

prometheus.MustRegister(sleepInterval)
prometheus.MustRegister(shutdownSeconds)
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The keeper.go file is just quite big. Can you please move this block inside cmd/keeper/cmd/prometheus.go ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Absolutely!

@lawrencejones lawrencejones force-pushed the lawrence-add-keeper-metrics branch from b937766 to 433d1a9 Compare May 8, 2019 12:16
@@ -0,0 +1,98 @@
// Copyright 2017 Sorint.lab
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went with metrics.go instead of prometheus.go as the latter gave me the feeling of Prometheus integration, like setting up the metrics endpoint rather than defining our metrics.

Happy to go with either though!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@AMyltsev
Copy link

AMyltsev commented May 8, 2019

I am not sure that I write in right place, but I would like ask @lawrencejones about metrics from proxy and sentinels. Have you plan for implementing it?

@lawrencejones
Copy link
Contributor Author

@AMyltsev we're definitely interested in doing this but wanted to start with the keepers first as the most impactful area to add visibility. We had keepers failing to start or being stuck in an initialisation step that we'd otherwise not hear about- that situation wasn't acceptable for us!

If you're interested in tackling the proxy/sentinel metrics yourself as a follow-up to this PR then I'd happily help review? I think this PR can be merged independently of the other components though, and it will benefit the following work by setting an example of how you might approach them.

cmd/common.go Outdated
clusterIdentifier = prometheus.NewGaugeVec(
prometheus.GaugeOpts{
Name: "stolon_cluster_identifier",
Help: "Set to 1, is labelled with store_prefix and cluster_name",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Set to 1, is labelled with the cluster_name",

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤦‍♂

@@ -0,0 +1,98 @@
// Copyright 2017 Sorint.lab
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@lawrencejones lawrencejones force-pushed the lawrence-add-keeper-metrics branch from 433d1a9 to dc47d5a Compare May 9, 2019 16:46
Add a collection of Prometheus metrics to the keeper. The metrics are
aimed to expose errors in the keeper sync loop, providing enough
visibility to detect when the sync is failing (and some insight into
why).
@lawrencejones lawrencejones force-pushed the lawrence-add-keeper-metrics branch from dc47d5a to 71d1095 Compare May 10, 2019 13:50
@lawrencejones
Copy link
Contributor Author

I think all changes are applied now @sgotti, and I've freshly rebased against master. Thanks for all your help!

@lawrencejones
Copy link
Contributor Author

Hey @sgotti, I think everything is fixed here. Don't mean to bother you but if we're good, merging this would make things much easier for me to manage follow-up PRs for sentinel/etc metrics.

Is there anything else you'd like here?

@sgotti
Copy link
Member

sgotti commented May 14, 2019

@lawrencejones LGTM! Mergine.

@sgotti sgotti merged commit a27bcae into sorintlab:master May 14, 2019
@lawrencejones lawrencejones deleted the lawrence-add-keeper-metrics branch May 14, 2019 16:26
@sgotti sgotti added this to the v0.14.0 milestone Jun 6, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants