Skip to content

[inputs.internal] collection taking longer than 15s whilst outputs seem to stop outputting data #16070

@burnjake

Description

@burnjake

Relevant telegraf.conf

[agent]
  collection_jitter = "3s"
  debug = true
  flush_interval = "15s"
  flush_jitter = "0s"
  hostname = "$HOSTNAME"
  interval = "15s"
  logfile = ""
  metric_batch_size = 15000
  metric_buffer_limit = 100000
  omit_hostname = false
  precision = ""
  quiet = false
  round_interval = true
[[processors.converter]]
  namepass = [
    "foobar_duration_ms"
  ]
  [processors.converter.tags]
    integer = [
        "duration"
    ]
[[aggregators.histogram]]
  drop_original = true
  grace = "120s"
  namepass = [
    "foobar_duration_ms"
  ]
  period = "30s"
  [[aggregators.histogram.config]]
    buckets = [50.0,100.0,250.0,500.0,750.0,1000.0,2000.0,5000.0,10000.0,25000.0
    ]
    fields = [
              "duration"
    ]
    measurement_name = "foobar_duration_ms"
[[outputs.prometheus_client]]
  collectors_exclude = [
    "gocollector",
    "process"
  ]
  listen = ":9273"
[[inputs.opentelemetry]]
[[inputs.internal]]
  collect_memstats = true

Logs from Telegraf

Logs are normally just a mixture of D! [aggregators.histogram] Updated aggregation range... or D! [outputs.prometheus_client] Buffer fullness: 2380 / 100000 metrics.

When telegraf silently crashes in this case, there is a spike in the number of logs saying both
D! [aggregators.histogram] Metric is outside aggregation window; discarding... and

W! [inputs.internal] Collection took longer than expected; not complete after interval of 15s
D! [inputs.internal] Previous collection has not completed; scheduled collection skipped

System info

telegraf:1.32-alpine, K8s 1.29.9

Steps to reproduce

Sadly I've tried to load test it locally, but I am yet to successfully reproduce this somewhere that isn't production. I will add detail once I reliably can.

Expected behavior

telegraf continues to aggregate and output metrics and if there is a problem, the process exits allowing it be automatically restarted if the user wishes.

Actual behavior

It "hangs" without restarting after ~12h of ingestion. It indicates that it cannot gather data from the internal input but doesn't log that it isn't able to gather data from the opentelemetry input, even though it stops outputting data entirely. It continues to utilise around the same amount of memory and CPU even though it apparently isn't gathering/process/aggregating/outputting any data.

The "fix" for this is to restart telegraf. Some graphs to help illustrate the behaviour:

image

Something I've noticed is that the the rate of metrics written keeps increasing which I'm guessing is a function of cardinality, and if telegraf has yet "seen" all the combinations of values for the labels of a given metric. As the output isn't configured to expire metrics, there is no way for this number to ever decrease without restarting the process. Maybe the cardinality is too high here, but I'm not sure how best to measure if it is. I would expect CPU and memory to approach the provisioned limit perhaps, but so far I haven't seen this.

Additional info

I initially raised this in the #telegraf Slack channel in the InfluxDB workspace and was directed to raising a bug report. Thread link.

Metadata

Metadata

Assignees

Labels

bugunexpected problem or unintended behavior

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions