-
Notifications
You must be signed in to change notification settings - Fork 9.8k
Description
What did you do?
We have a test to validate that Prometheus is scraping endpoints in k8s with some specific configuration.
We noticed that time time the Endpoints were not being scraped due to the __meta_kubernetes_pod_phase
not being updated to the actual phase of the Pod Running
it keep in Pending
phase.
Since Prometheus has to discover the endpoints when the pod phase is Pending I haven't found a simple steps to reproduce it.
I created a unit test that represent the case and it fails on the Endpoints. I uploaded the test to this PR. In the same PR i modified a Pod role test checking that the update works in that case.
Prometheus is running in agent mode.
What did you expect to see?
the __meta_kubernetes_pod_phase
should be updated with the current phase of the Pod.
What did you see instead? Under which circumstances?
__meta_kubernetes_pod_phase
had a previous value.
I launched a separate Prometheus instance and the label was correctly discovered.
System information
Linux 5.10.104-linuxkit aarch64
Prometheus version
prometheus, version 2.37.0 (branch: HEAD, revision: b41e0750abf5cc18d8233161560731de05199330)
build user: root@0ebb6827e27f
build date: 20220714-15:19:21
go version: go1.18.4
platform: linux/arm64
Prometheus configuration file
global:
scrape_interval: 15s
scrape_timeout: 10s
evaluation_interval: 1m
external_labels:
cluster_name: gs-local-e2e
scrape_configs:
- job_name: self-metrics
honor_timestamps: true
scrape_interval: 15s
scrape_timeout: 10s
metrics_path: /metrics
scheme: http
follow_redirects: true
enable_http2: true
metric_relabel_configs:
- source_labels: [__name__]
separator: ;
regex: prometheus_agent_active_series|prometheus_target_interval_length_seconds|prometheus_target_scrape_pool_targets|prometheus_remote_storage_samples_pending|prometheus_remote_storage_samples_in_total|prometheus_remote_storage_samples_retried_total|prometheus_agent_corruptions_total|prometheus_remote_storage_shards|prometheus_sd_kubernetes_events_total|prometheus_agent_checkpoint_creations_failed_total|prometheus_agent_checkpoint_deletions_failed_total|prometheus_remote_storage_samples_dropped_total|prometheus_remote_storage_samples_failed_total|prometheus_sd_kubernetes_http_request_total|prometheus_agent_truncate_duration_seconds_sum|prometheus_build_info|process_resident_memory_bytes|process_virtual_memory_bytes|process_cpu_seconds_total
replacement: $1
action: keep
static_configs:
- targets:
- localhost:9090
- job_name: kubernetes-job-pod
honor_timestamps: true
scrape_interval: 15s
scrape_timeout: 10s
metrics_path: /metrics
scheme: http
follow_redirects: true
enable_http2: true
relabel_configs:
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
separator: ;
regex: "true"
replacement: $1
action: keep
- source_labels: [__meta_kubernetes_pod_phase]
separator: ;
regex: Pending|Succeeded|Failed|Completed
replacement: $1
action: drop
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scheme]
separator: ;
regex: (https?)
target_label: __scheme__
replacement: $1
action: replace
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
separator: ;
regex: (.+)
target_label: __metrics_path__
replacement: $1
action: replace
- source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
separator: ;
regex: (.+?)(?::\d+)?;(\d+)
target_label: __address__
replacement: $1:$2
action: replace
- separator: ;
regex: __meta_kubernetes_pod_annotation_prometheus_io_param_(.+)
replacement: __param_$1
action: labelmap
- separator: ;
regex: __meta_kubernetes_pod_label_(.+)
replacement: $1
action: labelmap
- source_labels: [__meta_kubernetes_namespace]
separator: ;
regex: (.*)
target_label: namespace
replacement: $1
action: replace
- source_labels: [__meta_kubernetes_pod_node_name]
separator: ;
regex: (.*)
target_label: node
replacement: $1
action: replace
- source_labels: [__meta_kubernetes_pod_name]
separator: ;
regex: (.*)
target_label: pod
replacement: $1
action: replace
kubernetes_sd_configs:
- role: pod
kubeconfig_file: ""
follow_redirects: true
enable_http2: true
- job_name: kubernetes-job-endpoints
honor_timestamps: true
scrape_interval: 15s
scrape_timeout: 10s
metrics_path: /metrics
scheme: http
follow_redirects: true
enable_http2: true
relabel_configs:
- source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scrape]
separator: ;
regex: "true"
replacement: $1
action: keep
- source_labels: [__meta_kubernetes_pod_phase]
separator: ;
regex: Pending|Succeeded|Failed|Completed
replacement: $1
action: drop
- source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scheme]
separator: ;
regex: (https?)
target_label: __scheme__
replacement: $1
action: replace
- source_labels: [__meta_kubernetes_service_annotation_prometheus_io_path]
separator: ;
regex: (.+)
target_label: __metrics_path__
replacement: $1
action: replace
- source_labels: [__address__, __meta_kubernetes_service_annotation_prometheus_io_port]
separator: ;
regex: (.+?)(?::\d+)?;(\d+)
target_label: __address__
replacement: $1:$2
action: replace
- separator: ;
regex: __meta_kubernetes_service_annotation_prometheus_io_param_(.+)
replacement: __param_$1
action: labelmap
- separator: ;
regex: __meta_kubernetes_service_label_(.+)
replacement: $1
action: labelmap
- source_labels: [__meta_kubernetes_namespace]
separator: ;
regex: (.*)
target_label: namespace
replacement: $1
action: replace
- source_labels: [__meta_kubernetes_service_name]
separator: ;
regex: (.*)
target_label: service
replacement: $1
action: replace
- source_labels: [__meta_kubernetes_pod_node_name]
separator: ;
regex: (.*)
target_label: node
replacement: $1
action: replace
kubernetes_sd_configs:
- role: endpoints
kubeconfig_file: ""
follow_redirects: true
enable_http2: true
remote_write:
- url: https://metric-api.newrelic.com/prometheus/v1/write?prometheus_server=newrelic-prometheus-0
remote_timeout: 30s
authorization:
type: Bearer
credentials: <secret>
follow_redirects: true
enable_http2: true
queue_config:
capacity: 2500
max_shards: 200
min_shards: 1
max_samples_per_send: 500
batch_send_deadline: 5s
min_backoff: 30ms
max_backoff: 5s
metadata_config:
send: true
send_interval: 1m
max_samples_per_send: 500
Alertmanager version
No response
Alertmanager configuration file
No response
Logs
ts=2022-09-14T10:47:31.915Z caller=main.go:187 level=info msg="Experimental agent mode enabled."
ts=2022-09-14T10:47:31.915Z caller=main.go:172 level=info msg="Experimental expand-external-labels enabled" ts=2022-09-14T10:47:31.916Z caller=main.go:535 level=info msg="Starting Prometheus Agent" mode=agent version="(version=2.37.0, branch=HEAD, revision=b41e0750abf5cc18d8233161560731de05199330)"
ts=2022-09-14T10:47:31.917Z caller=main.go:540 level=info build_context="(go=go1.18.4, user=root@0ebb6827e27f, date=20220714-15:19:21)" ts=2022-09-14T10:47:31.917Z caller=main.go:541 level=info host_details="(Linux 5.10.104-linuxkit #1 SMP PREEMPT Thu Mar 17 17:05:54 UTC 2022 aarch64 newrelic-prometheus-4 (none))"
ts=2022-09-14T10:47:31.917Z caller=main.go:542 level=info fd_limits="(soft=1048576, hard=1048576)" ts=2022-09-14T10:47:31.917Z caller=main.go:543 level=info vm_limits="(soft=unlimited, hard=unlimited)"
ts=2022-09-14T10:47:31.917Z caller=web.go:553 level=info component=web msg="Start listening for connections" address=0.0.0.0:9090 ts=2022-09-14T10:47:31.918Z caller=main.go:1028 level=info msg="Starting WAL storage ..."
ts=2022-09-14T10:47:31.983Z caller=tls_config.go:195 level=info component=web msg="TLS is disabled." http2=false ts=2022-09-14T10:47:31.984Z caller=db.go:332 level=info msg="replaying WAL, this may take a while" dir=/etc/prometheus/storage/wal
ts=2022-09-14T10:47:31.985Z caller=db.go:383 level=info msg="WAL segment loaded" segment=0 maxSegment=0 ts=2022-09-14T10:47:31.985Z caller=main.go:1049 level=info fs_type=EXT4_SUPER_MAGIC
ts=2022-09-14T10:47:31.985Z caller=main.go:1052 level=info msg="Agent WAL storage started" ts=2022-09-14T10:47:31.985Z caller=main.go:1177 level=info msg="Loading configuration file" filename=/etc/prometheus/config/config.yaml
ts=2022-09-14T10:47:31.985Z caller=dedupe.go:112 component=remote level=info remote_name=e4c6b2 url="https://metric-api.newrelic.com/prometheus/v1/write?prometheus_server=newrelic-prometheus-4" msg="Starting WAL watcher" queue=e4c6b2 ts=2022-09-14T10:47:31.985Z caller=dedupe.go:112 component=remote level=info remote_name=e4c6b2 url="https://metric-api.newrelic.com/prometheus/v1/write?prometheus_server=newrelic-prometheus-4" msg="Starting scraped metadata watcher"
ts=2022-09-14T10:47:31.986Z caller=kubernetes.go:326 level=info component="discovery manager scrape" discovery=kubernetes msg="Using pod service account via in-cluster config" ts=2022-09-14T10:47:31.986Z caller=dedupe.go:112 component=remote level=info remote_name=e4c6b2 url="https://metric-api.newrelic.com/prometheus/v1/write?prometheus_server=newrelic-prometheus-4" msg="Replaying WAL" queue=e4c6b2
ts=2022-09-14T10:47:31.986Z caller=kubernetes.go:326 level=info component="discovery manager scrape" discovery=kubernetes msg="Using pod service account via in-cluster config" ts=2022-09-14T10:47:31.986Z caller=main.go:1214 level=info msg="Completed loading of configuration file" filename=/etc/prometheus/config/config.yaml totalDuration=1.198625ms db_storage=583ns remote_storage=173.167µs web_handler=166ns query_engine=209ns
ts=2022-09-14T10:47:31.986Z caller=main.go:957 level=info msg="Server is ready to receive web requests." ts=2022-09-14T10:47:40.436Z caller=dedupe.go:112 component=remote level=info remote_name=e4c6b2 url="https://metric-api.newrelic.com/prometheus/v1/write?prometheus_server=newrelic-prometheus-4" msg="Done replaying WAL" duration=8.450239004s