-
Notifications
You must be signed in to change notification settings - Fork 9.8k
scraping/alerting: Fix issues with duplicate metric registration #13316
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@@ -282,9 +303,10 @@ func (m *Manager) registerProviders(cfgs discovery.Configs, setName string) int | |||
} | |||
} | |||
typ := cfg.Name() | |||
reg := prometheus.WrapRegistererWith(prometheus.Labels{"sd_type": m.name, "job_name": setName}, m.registerer) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if those label names are the most appropriate?
Signed-off-by: Paulin Todev <paulin.todev@gmail.com>
d87419c
to
7660e75
Compare
Thanks for catching this. The most appropriate label name would probably be just However, I'm not sure if it is desired to partition all the SD metrics by job. Maybe it's even an improvement, but maybe it's not useful at all. I would like to hear more opinions about this. @roidelapluie @juliusv (and anyone that feels qualified) WDYT? |
This is not an acceptable solution as it will create a lot of cardinality. |
The goal of #13023 was to allow users who import Prometheus SD as a library, to pass their own registry. If Prometheus is to do this, while still only having one metric series for every SD instance, then I am not sure how to do this cleanly. Do you want to revert #13023 while we discuss a solution? Or would you rather us find a solution straight away? One solution could be to have an |
@roidelapluie thanks for your response. @ptodev I guess that clarifies things: If cardinality increase is a problem, that's even on top of the problem of changing the labels of long established metrics. I would say we should try to not have separate metrics per scrape job, or in other words: Whatever changes we do in the code behind the scenes, the metrics shouldn't change. So far, we acted under the assumption that there will be one SD instance per kind of SD, not per job, but that's wrong, see above. So the conclusion is we truly have to share metrics between SD instances. And for sharing, we either have to manage them outside of the SD instance (as you have described in your last comment), or we go back and reconsider using the A problem with the latter approach is that you cannot really deregister metrics anymore (but de-registration is probably a problem with shared metrics anyway, and maybe we should just not de-register at all). The former approach might be a bit cleaner in general, but then we are back to managing the whole lifecycle of the metrics outside of the SD instances, which is particularly annoying if you use the SD code as a library and have to remember to always initialize the correct set of metrics for each SD type. |
Thank you for the feedback! Let's move the discussion to #13375, which is a different attempt at fixing this issue. |
This PR will make it so that each service discovery mechanism reports its own debug metrics series. Currently, on the latest stable Prometheus release, if a Prometheus process runs more than one instance of the same SD mechanism, then it's not possible to distinguish between metrics for different scrape instances.
Fixes #13312
Unfortunately, #13023 introduced a bug whereby if two scrape jobs use the same service discovery, Prometheus would crash.
When I was working on #13023 I was working under the assumption that Prometheus can't instantiate more than one instance of the same SD. Therefore I unfortunately missed the most obvious test case - the one where there is more then one scrape job which uses a particular SD. I also didn't test AlertManager SDs. This led to the bug described in #13312.
I tested this bugfix locally using the config file below.
Prometheus config
I tested both by using
--enable-feature=new-service-discovery-manager
and without.I also forced several config reloads using the
pkill -SIGHUP -i prometheus
command, to make sure Prometheus doesn't crash and that no errors are logged.After the change I see metrics such as these on
http://localhost:9090/metrics
:cc @bwplotka @beorn7