Skip to content

Conversation

ptodev
Copy link
Contributor

@ptodev ptodev commented Dec 20, 2023

This PR will make it so that each service discovery mechanism reports its own debug metrics series. Currently, on the latest stable Prometheus release, if a Prometheus process runs more than one instance of the same SD mechanism, then it's not possible to distinguish between metrics for different scrape instances.

Fixes #13312

Unfortunately, #13023 introduced a bug whereby if two scrape jobs use the same service discovery, Prometheus would crash.

When I was working on #13023 I was working under the assumption that Prometheus can't instantiate more than one instance of the same SD. Therefore I unfortunately missed the most obvious test case - the one where there is more then one scrape job which uses a particular SD. I also didn't test AlertManager SDs. This led to the bug described in #13312.

I tested this bugfix locally using the config file below.

Prometheus config
global:
  scrape_interval:     16s 

scrape_configs:
  - job_name: 'testing'
    scrape_interval: 6s

    file_sd_configs:
      - files: 
          - "/Users/paulintodev/Desktop/targets.json"
        refresh_interval: 4m
          
  - job_name: 'testing2'
    scrape_interval: 6s

    file_sd_configs:
      - files: 
          - "/Users/paulintodev/Desktop/targets2.json"
        refresh_interval: 4m
          
  - job_name: 'config-0'
    scrape_interval: 6s

    file_sd_configs:
      - files: 
          - "/Users/paulintodev/Desktop/targets3.json"
        refresh_interval: 4m

alerting:
  alertmanagers:
    - file_sd_configs:
      - files: 
          - "/Users/paulintodev/Desktop/targets.json"
        refresh_interval: 4m
    - file_sd_configs:
      - files: 
          - "/Users/paulintodev/Desktop/targets2.json"
        refresh_interval: 4m

I tested both by using --enable-feature=new-service-discovery-manager and without.

I also forced several config reloads using the pkill -SIGHUP -i prometheus command, to make sure Prometheus doesn't crash and that no errors are logged.

After the change I see metrics such as these on http://localhost:9090/metrics:

prometheus_sd_file_read_errors_total{job_name="config-0",sd_type="notify"} 0
prometheus_sd_file_read_errors_total{job_name="config-0",sd_type="scrape"} 0
prometheus_sd_file_read_errors_total{job_name="config-1",sd_type="notify"} 0
prometheus_sd_file_read_errors_total{job_name="testing",sd_type="scrape"} 0
prometheus_sd_file_read_errors_total{job_name="testing2",sd_type="scrape"} 0

cc @bwplotka @beorn7

@@ -282,9 +303,10 @@ func (m *Manager) registerProviders(cfgs discovery.Configs, setName string) int
}
}
typ := cfg.Name()
reg := prometheus.WrapRegistererWith(prometheus.Labels{"sd_type": m.name, "job_name": setName}, m.registerer)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if those label names are the most appropriate?

Signed-off-by: Paulin Todev <paulin.todev@gmail.com>
@ptodev ptodev force-pushed the fix-duplicate-registration branch from d87419c to 7660e75 Compare December 21, 2023 16:55
@beorn7
Copy link
Member

beorn7 commented Dec 21, 2023

Thanks for catching this. The most appropriate label name would probably be just job (as we use it for the default target labels).

However, I'm not sure if it is desired to partition all the SD metrics by job. Maybe it's even an improvement, but maybe it's not useful at all. I would like to hear more opinions about this. @roidelapluie @juliusv (and anyone that feels qualified) WDYT?

@roidelapluie
Copy link
Member

This is not an acceptable solution as it will create a lot of cardinality.

@ptodev
Copy link
Contributor Author

ptodev commented Dec 27, 2023

The goal of #13023 was to allow users who import Prometheus SD as a library, to pass their own registry. If Prometheus is to do this, while still only having one metric series for every SD instance, then I am not sure how to do this cleanly. Do you want to revert #13023 while we discuss a solution? Or would you rather us find a solution straight away?

One solution could be to have an any object containing SD-specific metrics inside a DiscovererOptions. We will need to remove Registerer from DiscovererOptions, then make new metrics structures such as FileSDMetrics which would have to be created and managed by the Discovery Manager. All of this sounds very convoluted. Maybe there is a better way?

@beorn7
Copy link
Member

beorn7 commented Dec 28, 2023

@roidelapluie thanks for your response.

@ptodev I guess that clarifies things: If cardinality increase is a problem, that's even on top of the problem of changing the labels of long established metrics. I would say we should try to not have separate metrics per scrape job, or in other words: Whatever changes we do in the code behind the scenes, the metrics shouldn't change.

So far, we acted under the assumption that there will be one SD instance per kind of SD, not per job, but that's wrong, see above.

So the conclusion is we truly have to share metrics between SD instances. And for sharing, we either have to manage them outside of the SD instance (as you have described in your last comment), or we go back and reconsider using the AlreadyRegisteredError pattern, see example here: https://pkg.go.dev/github.com/prometheus/client_golang/prometheus#example-AlreadyRegisteredError

A problem with the latter approach is that you cannot really deregister metrics anymore (but de-registration is probably a problem with shared metrics anyway, and maybe we should just not de-register at all).

The former approach might be a bit cleaner in general, but then we are back to managing the whole lifecycle of the metrics outside of the SD instances, which is particularly annoying if you use the SD code as a library and have to remember to always initialize the correct set of metrics for each SD type.

@beorn7
Copy link
Member

beorn7 commented Dec 28, 2023

WRT reverting #13023 : It didn't make it into v2.49, so I would say don't need to emergency-rollback #13023, but a fix should still happen soon.

@ptodev
Copy link
Contributor Author

ptodev commented Jan 4, 2024

Thank you for the feedback! Let's move the discussion to #13375, which is a different attempt at fixing this issue.

@ptodev ptodev closed this Jan 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

SD: failed to register metric: duplicate metrics collector registration attempted consul.go:349
3 participants