-
Notifications
You must be signed in to change notification settings - Fork 446
keeper Prometheus metrics #639
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
keeper Prometheus metrics #639
Conversation
// cluster that any stolon component is associated with. Users can then join between | ||
// various metric series for the same cluster without making assumptions about service | ||
// discovery labels. | ||
clusterIdentifier = prometheus.NewGaugeVec( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the metric we (GoCardless) have been using to join various time series together. By having all components report this in a consistent fashion (see here for the stolon-pgbouncer definition) it becomes possible to join series from totally different process/infrastructures on the cluster_name and store_prefix labels.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ideally only the clustername should be neeed (the prefix isn't used when running stolon inside k8s). Obviously people could setup multiple instances using the same cluster name and prefix using different store instances. But I think people that use multiple stolon instances should also use different cluster names if they want to distinguish them inside prometheus.
Another solution will be to add another "clusteruid" option but I think this is redundanto and should be covered by the "clustername"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But I think people that use multiple stolon instances should also use different cluster names if they want to distinguish them inside prometheus.
I trust your intuition on this a lot more than my own! Will nix the store prefix label then.
@@ -1391,6 +1480,10 @@ func (p *PostgresKeeper) postgresKeeperSM(pctx context.Context) { | |||
targetRole := db.Spec.Role | |||
log.Debugw("target role", "targetRole", string(targetRole)) | |||
|
|||
// Set metrics to power alerts about mismatched roles |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@@ -1770,6 +1875,7 @@ func (p *PostgresKeeper) generateHBA(cd *cluster.ClusterData, db *cluster.DB, on | |||
func sigHandler(sigs chan os.Signal, cancel context.CancelFunc) { | |||
s := <-sigs | |||
log.Debugw("got signal", "signal", s) | |||
shutdownSeconds.SetToCurrentTime() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Useful for detecting a pending shutdown.
@@ -42,6 +42,13 @@ const ( | |||
RoleStandby Role = "standby" | |||
) | |||
|
|||
// Roles enumerates all possible Role values |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lawrencejones Thanks a lot for the PR and the detailed description. I'll review it in the next days. |
Hey @sgotti, I just realised the semaphoreci tests have failed here. Is that normal or should I debug that? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lawrencejones Overall LGTM! Some review comments inline.
Hey @sgotti, I just realised the semaphoreci tests have failed here. Is that normal or should I debug that?
Don't worry, our tests are heavy parallelized and quite I/O hungry and looks like lately semaphore machines a less powerful causing them to sporadically timeout.
// cluster that any stolon component is associated with. Users can then join between | ||
// various metric series for the same cluster without making assumptions about service | ||
// discovery labels. | ||
clusterIdentifier = prometheus.NewGaugeVec( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ideally only the clustername should be neeed (the prefix isn't used when running stolon inside k8s). Obviously people could setup multiple instances using the same cluster name and prefix using different store instances. But I think people that use multiple stolon instances should also use different cluster names if they want to distinguish them inside prometheus.
Another solution will be to add another "clusteruid" option but I think this is redundanto and should be covered by the "clustername"
cmd/keeper/cmd/keeper.go
Outdated
prometheus.MustRegister(sleepInterval) | ||
prometheus.MustRegister(shutdownSeconds) | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The keeper.go file is just quite big. Can you please move this block inside cmd/keeper/cmd/prometheus.go
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Absolutely!
b937766
to
433d1a9
Compare
@@ -0,0 +1,98 @@ | |||
// Copyright 2017 Sorint.lab |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I went with metrics.go
instead of prometheus.go
as the latter gave me the feeling of Prometheus integration, like setting up the metrics endpoint rather than defining our metrics.
Happy to go with either though!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
I am not sure that I write in right place, but I would like ask @lawrencejones about metrics from proxy and sentinels. Have you plan for implementing it? |
@AMyltsev we're definitely interested in doing this but wanted to start with the keepers first as the most impactful area to add visibility. We had keepers failing to start or being stuck in an initialisation step that we'd otherwise not hear about- that situation wasn't acceptable for us! If you're interested in tackling the proxy/sentinel metrics yourself as a follow-up to this PR then I'd happily help review? I think this PR can be merged independently of the other components though, and it will benefit the following work by setting an example of how you might approach them. |
cmd/common.go
Outdated
clusterIdentifier = prometheus.NewGaugeVec( | ||
prometheus.GaugeOpts{ | ||
Name: "stolon_cluster_identifier", | ||
Help: "Set to 1, is labelled with store_prefix and cluster_name", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Set to 1, is labelled with the cluster_name",
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🤦♂
@@ -0,0 +1,98 @@ | |||
// Copyright 2017 Sorint.lab |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
433d1a9
to
dc47d5a
Compare
Add a collection of Prometheus metrics to the keeper. The metrics are aimed to expose errors in the keeper sync loop, providing enough visibility to detect when the sync is failing (and some insight into why).
dc47d5a
to
71d1095
Compare
I think all changes are applied now @sgotti, and I've freshly rebased against master. Thanks for all your help! |
Hey @sgotti, I think everything is fixed here. Don't mean to bother you but if we're good, merging this would make things much easier for me to manage follow-up PRs for sentinel/etc metrics. Is there anything else you'd like here? |
@lawrencejones LGTM! Mergine. |
Add a collection of Prometheus metrics to the keeper. The metrics are
aimed to expose errors in the keeper sync loop, providing enough
visibility to detect when the sync is failing (and some insight into
why).
Hey stolon maintainers! I want to make this as easy as possible to review, so I'm going to try and write everything you need up-front here. If you have any questions just drop a comment and I'll do my best to answer them.
We're currently moving our Postgres cluster to use stolon, and in doing so we've adapted some tooling we already had to enable zero-downtime planned failover. The tool that allows this is called stolon-pgbouncer which explains its purpose and aims in the readme.
stolon-pgbouncer has native Prometheus metrics and bundles with it some Prometheus alerts and dashboards. If you clone this repo and run
docker-compose up
, then navigate to http://localhost:3000 (login as admin/admin, skip password) you'll see two dashboards, one for the stolon-pgbouncer services and another for stolon-keepers.The master branch of stolon-pgbouncer will boot with a compiled stolon-keeper binary from this PR. This enables us to scrape and display the stolon-keeper metrics in a dashboard we bundle with stolon-pgbouncer (PR that introduces this is here) which looks like this:
It's meant to be a one-stop-shop for keeper state, while the alerts we've defined on these metrics aim to capture whenever keepers are misbehaving or Postgres is pending a restart.
We'd love to get these metrics upstreamed to benefit all stolon users, as well as obviously making our lives easier by avoiding maintaining a fork!