scraping/alerting: Fix issues with duplicate metric registration #13316

ptodev · 2023-12-20T22:30:06Z

This PR will make it so that each service discovery mechanism reports its own debug metrics series. Currently, on the latest stable Prometheus release, if a Prometheus process runs more than one instance of the same SD mechanism, then it's not possible to distinguish between metrics for different scrape instances.

Fixes #13312

Unfortunately, #13023 introduced a bug whereby if two scrape jobs use the same service discovery, Prometheus would crash.

When I was working on #13023 I was working under the assumption that Prometheus can't instantiate more than one instance of the same SD. Therefore I unfortunately missed the most obvious test case - the one where there is more then one scrape job which uses a particular SD. I also didn't test AlertManager SDs. This led to the bug described in #13312.

I tested this bugfix locally using the config file below.

Prometheus config

global:
  scrape_interval:     16s 

scrape_configs:
  - job_name: 'testing'
    scrape_interval: 6s

    file_sd_configs:
      - files: 
          - "/Users/paulintodev/Desktop/targets.json"
        refresh_interval: 4m
          
  - job_name: 'testing2'
    scrape_interval: 6s

    file_sd_configs:
      - files: 
          - "/Users/paulintodev/Desktop/targets2.json"
        refresh_interval: 4m
          
  - job_name: 'config-0'
    scrape_interval: 6s

    file_sd_configs:
      - files: 
          - "/Users/paulintodev/Desktop/targets3.json"
        refresh_interval: 4m

alerting:
  alertmanagers:
    - file_sd_configs:
      - files: 
          - "/Users/paulintodev/Desktop/targets.json"
        refresh_interval: 4m
    - file_sd_configs:
      - files: 
          - "/Users/paulintodev/Desktop/targets2.json"
        refresh_interval: 4m

I tested both by using --enable-feature=new-service-discovery-manager and without.

I also forced several config reloads using the pkill -SIGHUP -i prometheus command, to make sure Prometheus doesn't crash and that no errors are logged.

After the change I see metrics such as these on http://localhost:9090/metrics:

prometheus_sd_file_read_errors_total{job_name="config-0",sd_type="notify"} 0
prometheus_sd_file_read_errors_total{job_name="config-0",sd_type="scrape"} 0
prometheus_sd_file_read_errors_total{job_name="config-1",sd_type="notify"} 0
prometheus_sd_file_read_errors_total{job_name="testing",sd_type="scrape"} 0
prometheus_sd_file_read_errors_total{job_name="testing2",sd_type="scrape"} 0

cc @bwplotka @beorn7

ptodev · 2023-12-20T22:31:17Z

discovery/legacymanager/manager.go

@@ -282,9 +303,10 @@ func (m *Manager) registerProviders(cfgs discovery.Configs, setName string) int
 			}
 		}
 		typ := cfg.Name()
+		reg := prometheus.WrapRegistererWith(prometheus.Labels{"sd_type": m.name, "job_name": setName}, m.registerer)


I'm not sure if those label names are the most appropriate?

Signed-off-by: Paulin Todev <paulin.todev@gmail.com>

beorn7 · 2023-12-21T17:55:22Z

Thanks for catching this. The most appropriate label name would probably be just job (as we use it for the default target labels).

However, I'm not sure if it is desired to partition all the SD metrics by job. Maybe it's even an improvement, but maybe it's not useful at all. I would like to hear more opinions about this. @roidelapluie @juliusv (and anyone that feels qualified) WDYT?

roidelapluie · 2023-12-26T15:20:48Z

This is not an acceptable solution as it will create a lot of cardinality.

ptodev · 2023-12-27T14:53:53Z

The goal of #13023 was to allow users who import Prometheus SD as a library, to pass their own registry. If Prometheus is to do this, while still only having one metric series for every SD instance, then I am not sure how to do this cleanly. Do you want to revert #13023 while we discuss a solution? Or would you rather us find a solution straight away?

One solution could be to have an any object containing SD-specific metrics inside a DiscovererOptions. We will need to remove Registerer from DiscovererOptions, then make new metrics structures such as FileSDMetrics which would have to be created and managed by the Discovery Manager. All of this sounds very convoluted. Maybe there is a better way?

beorn7 · 2023-12-28T14:51:02Z

@roidelapluie thanks for your response.

@ptodev I guess that clarifies things: If cardinality increase is a problem, that's even on top of the problem of changing the labels of long established metrics. I would say we should try to not have separate metrics per scrape job, or in other words: Whatever changes we do in the code behind the scenes, the metrics shouldn't change.

So far, we acted under the assumption that there will be one SD instance per kind of SD, not per job, but that's wrong, see above.

So the conclusion is we truly have to share metrics between SD instances. And for sharing, we either have to manage them outside of the SD instance (as you have described in your last comment), or we go back and reconsider using the AlreadyRegisteredError pattern, see example here: https://pkg.go.dev/github.com/prometheus/client_golang/prometheus#example-AlreadyRegisteredError

A problem with the latter approach is that you cannot really deregister metrics anymore (but de-registration is probably a problem with shared metrics anyway, and maybe we should just not de-register at all).

The former approach might be a bit cleaner in general, but then we are back to managing the whole lifecycle of the metrics outside of the SD instances, which is particularly annoying if you use the SD code as a library and have to remember to always initialize the correct set of metrics for each SD type.

beorn7 · 2023-12-28T14:54:42Z

WRT reverting #13023 : It didn't make it into v2.49, so I would say don't need to emergency-rollback #13023, but a fix should still happen soon.

ptodev · 2024-01-04T14:15:30Z

Thank you for the feedback! Let's move the discussion to #13375, which is a different attempt at fixing this issue.

ptodev commented Dec 20, 2023

View reviewed changes

Fix issues with duplicate metric registration

7660e75

Signed-off-by: Paulin Todev <paulin.todev@gmail.com>

ptodev force-pushed the fix-duplicate-registration branch from d87419c to 7660e75 Compare December 21, 2023 16:55

beorn7 requested review from roidelapluie and juliusv December 25, 2023 23:29

ptodev mentioned this pull request Jan 4, 2024

SD Managers taking over responsibility for registration of debug metrics #13375

Merged

ptodev closed this Jan 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

scraping/alerting: Fix issues with duplicate metric registration #13316

scraping/alerting: Fix issues with duplicate metric registration #13316

ptodev commented Dec 20, 2023

Uh oh!

ptodev Dec 20, 2023

Uh oh!

beorn7 commented Dec 21, 2023

Uh oh!

roidelapluie commented Dec 26, 2023

Uh oh!

ptodev commented Dec 27, 2023

Uh oh!

beorn7 commented Dec 28, 2023

Uh oh!

beorn7 commented Dec 28, 2023

Uh oh!

ptodev commented Jan 4, 2024

Uh oh!

Uh oh!

scraping/alerting: Fix issues with duplicate metric registration #13316

scraping/alerting: Fix issues with duplicate metric registration #13316

Conversation

ptodev commented Dec 20, 2023

Uh oh!

ptodev Dec 20, 2023

Choose a reason for hiding this comment

Uh oh!

beorn7 commented Dec 21, 2023

Uh oh!

roidelapluie commented Dec 26, 2023

Uh oh!

ptodev commented Dec 27, 2023

Uh oh!

beorn7 commented Dec 28, 2023

Uh oh!

beorn7 commented Dec 28, 2023

Uh oh!

ptodev commented Jan 4, 2024

Uh oh!

Uh oh!