-
Notifications
You must be signed in to change notification settings - Fork 5.7k
Description
Relevant telegraf.conf
[agent]
collection_jitter = "3s"
debug = true
flush_interval = "15s"
flush_jitter = "0s"
hostname = "$HOSTNAME"
interval = "15s"
logfile = ""
metric_batch_size = 15000
metric_buffer_limit = 100000
omit_hostname = false
precision = ""
quiet = false
round_interval = true
[[processors.converter]]
namepass = [
"foobar_duration_ms"
]
[processors.converter.tags]
integer = [
"duration"
]
[[aggregators.histogram]]
drop_original = true
grace = "120s"
namepass = [
"foobar_duration_ms"
]
period = "30s"
[[aggregators.histogram.config]]
buckets = [50.0,100.0,250.0,500.0,750.0,1000.0,2000.0,5000.0,10000.0,25000.0
]
fields = [
"duration"
]
measurement_name = "foobar_duration_ms"
[[outputs.prometheus_client]]
collectors_exclude = [
"gocollector",
"process"
]
listen = ":9273"
[[inputs.opentelemetry]]
[[inputs.internal]]
collect_memstats = true
Logs from Telegraf
Logs are normally just a mixture of D! [aggregators.histogram] Updated aggregation range...
or D! [outputs.prometheus_client] Buffer fullness: 2380 / 100000 metrics
.
When telegraf silently crashes in this case, there is a spike in the number of logs saying both
D! [aggregators.histogram] Metric is outside aggregation window; discarding...
and
W! [inputs.internal] Collection took longer than expected; not complete after interval of 15s
D! [inputs.internal] Previous collection has not completed; scheduled collection skipped
System info
telegraf:1.32-alpine, K8s 1.29.9
Steps to reproduce
Sadly I've tried to load test it locally, but I am yet to successfully reproduce this somewhere that isn't production. I will add detail once I reliably can.
Expected behavior
telegraf continues to aggregate and output metrics and if there is a problem, the process exits allowing it be automatically restarted if the user wishes.
Actual behavior
It "hangs" without restarting after ~12h of ingestion. It indicates that it cannot gather data from the internal
input but doesn't log that it isn't able to gather data from the opentelemetry
input, even though it stops outputting data entirely. It continues to utilise around the same amount of memory and CPU even though it apparently isn't gathering/process/aggregating/outputting any data.
The "fix" for this is to restart telegraf. Some graphs to help illustrate the behaviour:
Something I've noticed is that the the rate of metrics written keeps increasing which I'm guessing is a function of cardinality, and if telegraf has yet "seen" all the combinations of values for the labels of a given metric. As the output isn't configured to expire metrics, there is no way for this number to ever decrease without restarting the process. Maybe the cardinality is too high here, but I'm not sure how best to measure if it is. I would expect CPU and memory to approach the provisioned limit perhaps, but so far I haven't seen this.
Additional info
I initially raised this in the #telegraf Slack channel in the InfluxDB workspace and was directed to raising a bug report. Thread link.