Skip to content

Pod discovery labels not being updated on endpoints role using kubernetes_sd_config #11305

@gsanchezgavier

Description

@gsanchezgavier

What did you do?

We have a test to validate that Prometheus is scraping endpoints in k8s with some specific configuration.

We noticed that time time the Endpoints were not being scraped due to the __meta_kubernetes_pod_phase not being updated to the actual phase of the Pod Running it keep in Pending phase.

Since Prometheus has to discover the endpoints when the pod phase is Pending I haven't found a simple steps to reproduce it.
I created a unit test that represent the case and it fails on the Endpoints. I uploaded the test to this PR. In the same PR i modified a Pod role test checking that the update works in that case.

Prometheus is running in agent mode.

What did you expect to see?

the __meta_kubernetes_pod_phase should be updated with the current phase of the Pod.

What did you see instead? Under which circumstances?

__meta_kubernetes_pod_phase had a previous value.

I launched a separate Prometheus instance and the label was correctly discovered.

System information

Linux 5.10.104-linuxkit aarch64

Prometheus version

prometheus, version 2.37.0 (branch: HEAD, revision: b41e0750abf5cc18d8233161560731de05199330)
  build user:       root@0ebb6827e27f
  build date:       20220714-15:19:21
  go version:       go1.18.4
  platform:         linux/arm64

Prometheus configuration file

global:
  scrape_interval: 15s
  scrape_timeout: 10s
  evaluation_interval: 1m
  external_labels:
    cluster_name: gs-local-e2e
scrape_configs:
- job_name: self-metrics
  honor_timestamps: true
  scrape_interval: 15s
  scrape_timeout: 10s
  metrics_path: /metrics
  scheme: http
  follow_redirects: true
  enable_http2: true
  metric_relabel_configs:
  - source_labels: [__name__]
    separator: ;
    regex: prometheus_agent_active_series|prometheus_target_interval_length_seconds|prometheus_target_scrape_pool_targets|prometheus_remote_storage_samples_pending|prometheus_remote_storage_samples_in_total|prometheus_remote_storage_samples_retried_total|prometheus_agent_corruptions_total|prometheus_remote_storage_shards|prometheus_sd_kubernetes_events_total|prometheus_agent_checkpoint_creations_failed_total|prometheus_agent_checkpoint_deletions_failed_total|prometheus_remote_storage_samples_dropped_total|prometheus_remote_storage_samples_failed_total|prometheus_sd_kubernetes_http_request_total|prometheus_agent_truncate_duration_seconds_sum|prometheus_build_info|process_resident_memory_bytes|process_virtual_memory_bytes|process_cpu_seconds_total
    replacement: $1
    action: keep
  static_configs:
  - targets:
    - localhost:9090
- job_name: kubernetes-job-pod
  honor_timestamps: true
  scrape_interval: 15s
  scrape_timeout: 10s
  metrics_path: /metrics
  scheme: http
  follow_redirects: true
  enable_http2: true
  relabel_configs:
  - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
    separator: ;
    regex: "true"
    replacement: $1
    action: keep
  - source_labels: [__meta_kubernetes_pod_phase]
    separator: ;
    regex: Pending|Succeeded|Failed|Completed
    replacement: $1
    action: drop
  - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scheme]
    separator: ;
    regex: (https?)
    target_label: __scheme__
    replacement: $1
    action: replace
  - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
    separator: ;
    regex: (.+)
    target_label: __metrics_path__
    replacement: $1
    action: replace
  - source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
    separator: ;
    regex: (.+?)(?::\d+)?;(\d+)
    target_label: __address__
    replacement: $1:$2
    action: replace
  - separator: ;
    regex: __meta_kubernetes_pod_annotation_prometheus_io_param_(.+)
    replacement: __param_$1
    action: labelmap
  - separator: ;
    regex: __meta_kubernetes_pod_label_(.+)
    replacement: $1
    action: labelmap
  - source_labels: [__meta_kubernetes_namespace]
    separator: ;
    regex: (.*)
    target_label: namespace
    replacement: $1
    action: replace
  - source_labels: [__meta_kubernetes_pod_node_name]
    separator: ;
    regex: (.*)
    target_label: node
    replacement: $1
    action: replace
  - source_labels: [__meta_kubernetes_pod_name]
    separator: ;
    regex: (.*)
    target_label: pod
    replacement: $1
    action: replace
  kubernetes_sd_configs:
  - role: pod
    kubeconfig_file: ""
    follow_redirects: true
    enable_http2: true
- job_name: kubernetes-job-endpoints
  honor_timestamps: true
  scrape_interval: 15s
  scrape_timeout: 10s
  metrics_path: /metrics
  scheme: http
  follow_redirects: true
  enable_http2: true
  relabel_configs:
  - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scrape]
    separator: ;
    regex: "true"
    replacement: $1
    action: keep
  - source_labels: [__meta_kubernetes_pod_phase]
    separator: ;
    regex: Pending|Succeeded|Failed|Completed
    replacement: $1
    action: drop
  - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scheme]
    separator: ;
    regex: (https?)
    target_label: __scheme__
    replacement: $1
    action: replace
  - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_path]
    separator: ;
    regex: (.+)
    target_label: __metrics_path__
    replacement: $1
    action: replace
  - source_labels: [__address__, __meta_kubernetes_service_annotation_prometheus_io_port]
    separator: ;
    regex: (.+?)(?::\d+)?;(\d+)
    target_label: __address__
    replacement: $1:$2
    action: replace
  - separator: ;
    regex: __meta_kubernetes_service_annotation_prometheus_io_param_(.+)
    replacement: __param_$1
    action: labelmap
  - separator: ;
    regex: __meta_kubernetes_service_label_(.+)
    replacement: $1
    action: labelmap
  - source_labels: [__meta_kubernetes_namespace]
    separator: ;
    regex: (.*)
    target_label: namespace
    replacement: $1
    action: replace
  - source_labels: [__meta_kubernetes_service_name]
    separator: ;
    regex: (.*)
    target_label: service
    replacement: $1
    action: replace
  - source_labels: [__meta_kubernetes_pod_node_name]
    separator: ;
    regex: (.*)
    target_label: node
    replacement: $1
    action: replace
  kubernetes_sd_configs:
  - role: endpoints
    kubeconfig_file: ""
    follow_redirects: true
    enable_http2: true
remote_write:
- url: https://metric-api.newrelic.com/prometheus/v1/write?prometheus_server=newrelic-prometheus-0
  remote_timeout: 30s
  authorization:
    type: Bearer
    credentials: <secret>
  follow_redirects: true
  enable_http2: true
  queue_config:
    capacity: 2500
    max_shards: 200
    min_shards: 1
    max_samples_per_send: 500
    batch_send_deadline: 5s
    min_backoff: 30ms
    max_backoff: 5s
  metadata_config:
    send: true
    send_interval: 1m
    max_samples_per_send: 500

Alertmanager version

No response

Alertmanager configuration file

No response

Logs

ts=2022-09-14T10:47:31.915Z caller=main.go:187 level=info msg="Experimental agent mode enabled."
ts=2022-09-14T10:47:31.915Z caller=main.go:172 level=info msg="Experimental expand-external-labels enabled"                                                                                                                                                  ts=2022-09-14T10:47:31.916Z caller=main.go:535 level=info msg="Starting Prometheus Agent" mode=agent version="(version=2.37.0, branch=HEAD, revision=b41e0750abf5cc18d8233161560731de05199330)"
ts=2022-09-14T10:47:31.917Z caller=main.go:540 level=info build_context="(go=go1.18.4, user=root@0ebb6827e27f, date=20220714-15:19:21)"                                                                                                                      ts=2022-09-14T10:47:31.917Z caller=main.go:541 level=info host_details="(Linux 5.10.104-linuxkit #1 SMP PREEMPT Thu Mar 17 17:05:54 UTC 2022 aarch64 newrelic-prometheus-4 (none))"
ts=2022-09-14T10:47:31.917Z caller=main.go:542 level=info fd_limits="(soft=1048576, hard=1048576)"                                                                                                                                                           ts=2022-09-14T10:47:31.917Z caller=main.go:543 level=info vm_limits="(soft=unlimited, hard=unlimited)"
ts=2022-09-14T10:47:31.917Z caller=web.go:553 level=info component=web msg="Start listening for connections" address=0.0.0.0:9090                                                                                                                            ts=2022-09-14T10:47:31.918Z caller=main.go:1028 level=info msg="Starting WAL storage ..."
ts=2022-09-14T10:47:31.983Z caller=tls_config.go:195 level=info component=web msg="TLS is disabled." http2=false                                                                                                                                             ts=2022-09-14T10:47:31.984Z caller=db.go:332 level=info msg="replaying WAL, this may take a while" dir=/etc/prometheus/storage/wal
ts=2022-09-14T10:47:31.985Z caller=db.go:383 level=info msg="WAL segment loaded" segment=0 maxSegment=0                                                                                                                                                      ts=2022-09-14T10:47:31.985Z caller=main.go:1049 level=info fs_type=EXT4_SUPER_MAGIC
ts=2022-09-14T10:47:31.985Z caller=main.go:1052 level=info msg="Agent WAL storage started"                                                                                                                                                                   ts=2022-09-14T10:47:31.985Z caller=main.go:1177 level=info msg="Loading configuration file" filename=/etc/prometheus/config/config.yaml
ts=2022-09-14T10:47:31.985Z caller=dedupe.go:112 component=remote level=info remote_name=e4c6b2 url="https://metric-api.newrelic.com/prometheus/v1/write?prometheus_server=newrelic-prometheus-4" msg="Starting WAL watcher" queue=e4c6b2                    ts=2022-09-14T10:47:31.985Z caller=dedupe.go:112 component=remote level=info remote_name=e4c6b2 url="https://metric-api.newrelic.com/prometheus/v1/write?prometheus_server=newrelic-prometheus-4" msg="Starting scraped metadata watcher"
ts=2022-09-14T10:47:31.986Z caller=kubernetes.go:326 level=info component="discovery manager scrape" discovery=kubernetes msg="Using pod service account via in-cluster config"                                                                              ts=2022-09-14T10:47:31.986Z caller=dedupe.go:112 component=remote level=info remote_name=e4c6b2 url="https://metric-api.newrelic.com/prometheus/v1/write?prometheus_server=newrelic-prometheus-4" msg="Replaying WAL" queue=e4c6b2
ts=2022-09-14T10:47:31.986Z caller=kubernetes.go:326 level=info component="discovery manager scrape" discovery=kubernetes msg="Using pod service account via in-cluster config"                                                                              ts=2022-09-14T10:47:31.986Z caller=main.go:1214 level=info msg="Completed loading of configuration file" filename=/etc/prometheus/config/config.yaml totalDuration=1.198625ms db_storage=583ns remote_storage=173.167µs web_handler=166ns query_engine=209ns
ts=2022-09-14T10:47:31.986Z caller=main.go:957 level=info msg="Server is ready to receive web requests."                                                                                                                                                     ts=2022-09-14T10:47:40.436Z caller=dedupe.go:112 component=remote level=info remote_name=e4c6b2 url="https://metric-api.newrelic.com/prometheus/v1/write?prometheus_server=newrelic-prometheus-4" msg="Done replaying WAL" duration=8.450239004s

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions