-
Notifications
You must be signed in to change notification settings - Fork 5.7k
Description
Relevant telegraf.conf
[agent]
collection_jitter = "10s"
debug = false
flush_interval = "5s"
flush_jitter = "5s"
hostname = "$HOST_NAME"
interval = "10s"
logfile = ""
metric_batch_size = 5000
metric_buffer_limit = 100000
omit_hostname = true
precision = ""
quiet = false
round_interval = true
Logs from Telegraf
(sorry: logs are from a different config from above. I'm trying to tweak to work around this...)
Sometimes I see the writer go down to zero:
2025-06-17T08:57:34Z D! [outputs.influxdb_v2] Buffer fullness: 35578 / 50000 metrics
2025-06-17T08:57:41Z D! [outputs.influxdb_v2] Wrote batch of 5000 metrics in 43.812032ms
2025-06-17T08:57:41Z D! [outputs.influxdb_v2] Buffer fullness: 37437 / 50000 metrics
2025-06-17T08:57:41Z D! [outputs.influxdb_v2] Wrote batch of 5000 metrics in 35.464166ms
2025-06-17T08:57:41Z D! [outputs.influxdb_v2] Wrote batch of 5000 metrics in 44.59292ms
2025-06-17T08:57:41Z D! [outputs.influxdb_v2] Wrote batch of 5000 metrics in 37.239506ms
2025-06-17T08:57:41Z D! [outputs.influxdb_v2] Wrote batch of 5000 metrics in 74.431344ms
2025-06-17T08:57:41Z D! [outputs.influxdb_v2] Wrote batch of 5000 metrics in 67.190767ms
2025-06-17T08:57:41Z D! [outputs.influxdb_v2] Wrote batch of 5000 metrics in 104.431733ms
2025-06-17T08:57:41Z D! [outputs.influxdb_v2] Wrote batch of 5000 metrics in 122.537735ms
2025-06-17T08:57:41Z D! [outputs.influxdb_v2] Wrote batch of 4543 metrics in 38.236008ms
2025-06-17T08:57:41Z D! [outputs.influxdb_v2] Buffer fullness: 0 / 50000 metrics
It seems the full flush happens only when the flush timeout happens:
2025-06-17T08:57:41Z D! [outputs.influxdb_v2] Wrote batch of 2894 metrics in 36.683051ms
2025-06-17T08:57:41Z D! [outputs.influxdb_v2] Buffer fullness: 3242 / 50000 metrics
2025-06-17T08:57:41Z D! [outputs.influxdb_v2] Wrote batch of 5000 metrics in 35.493495ms
2025-06-17T08:57:41Z D! [outputs.influxdb_v2] Buffer fullness: 5708 / 50000 metrics
2025-06-17T08:57:41Z D! [outputs.influxdb_v2] Wrote batch of 5000 metrics in 34.052446ms
2025-06-17T08:57:41Z D! [outputs.influxdb_v2] Buffer fullness: 6637 / 50000 metrics
2025-06-17T08:57:41Z D! [outputs.influxdb_v2] Wrote batch of 5000 metrics in 37.435861ms
2025-06-17T08:57:41Z D! [outputs.influxdb_v2] Buffer fullness: 8003 / 50000 metrics
2025-06-17T08:57:41Z D! [outputs.influxdb_v2] Wrote batch of 5000 metrics in 36.671503ms
2025-06-17T08:57:41Z D! [outputs.influxdb_v2] Buffer fullness: 7930 / 50000 metrics
2025-06-17T08:57:41Z D! [outputs.influxdb_v2] Wrote batch of 5000 metrics in 27.867088ms
2025-06-17T08:57:41Z D! [outputs.influxdb_v2] Buffer fullness: 4417 / 50000 metrics
2025-06-17T08:57:42Z D! [outputs.influxdb_v2] Wrote batch of 5000 metrics in 27.015952ms
2025-06-17T08:57:42Z D! [outputs.influxdb_v2] Buffer fullness: 25 / 50000 metrics
2025-06-17T08:57:46Z D! [outputs.influxdb_v2] Wrote batch of 5000 metrics in 154.732343ms
2025-06-17T08:57:46Z D! [outputs.influxdb_v2] Buffer fullness: 10711 / 50000 metrics
2025-06-17T08:57:46Z D! [outputs.influxdb_v2] Wrote batch of 5000 metrics in 105.514562ms
2025-06-17T08:57:46Z D! [outputs.influxdb_v2] Buffer fullness: 14363 / 50000 metrics
2025-06-17T08:57:46Z D! [outputs.influxdb_v2] Wrote batch of 5000 metrics in 85.894132ms
2025-06-17T08:57:46Z D! [outputs.influxdb_v2] Buffer fullness: 15157 / 50000 metrics
2025-06-17T08:57:46Z D! [outputs.influxdb_v2] Wrote batch of 5000 metrics in 75.549172ms
2025-06-17T08:57:46Z D! [outputs.influxdb_v2] Buffer fullness: 17828 / 50000 metrics
2025-06-17T08:57:46Z D! [outputs.influxdb_v2] Wrote batch of 5000 metrics in 75.283279ms
2025-06-17T08:57:46Z D! [outputs.influxdb_v2] Buffer fullness: 21845 / 50000 metrics
2025-06-17T08:57:46Z D! [outputs.influxdb_v2] Wrote batch of 5000 metrics in 67.259195ms
2025-06-17T08:57:46Z D! [outputs.influxdb_v2] Buffer fullness: 25087 / 50000 metrics
2025-06-17T08:57:46Z D! [outputs.influxdb_v2] Wrote batch of 5000 metrics in 66.871765ms
2025-06-17T08:57:46Z D! [outputs.influxdb_v2] Buffer fullness: 25385 / 50000 metrics
2025-06-17T08:57:46Z D! [outputs.influxdb_v2] Wrote batch of 5000 metrics in 46.220724ms
2025-06-17T08:57:46Z D! [outputs.influxdb_v2] Buffer fullness: 20385 / 50000 metrics
2025-06-17T08:57:53Z D! [outputs.influxdb_v2] Wrote batch of 5000 metrics in 17.035093ms
2025-06-17T08:57:53Z D! [outputs.influxdb_v2] Buffer fullness: 22551 / 50000 metrics
2025-06-17T08:57:53Z D! [outputs.influxdb_v2] Wrote batch of 5000 metrics in 92.336851ms
2025-06-17T08:57:53Z D! [outputs.influxdb_v2] Buffer fullness: 26555 / 50000 metrics
2025-06-17T08:57:53Z D! [outputs.influxdb_v2] Wrote batch of 5000 metrics in 94.172931ms
2025-06-17T08:57:53Z D! [outputs.influxdb_v2] Buffer fullness: 33136 / 50000 metrics
2025-06-17T08:57:53Z D! [outputs.influxdb_v2] Wrote batch of 5000 metrics in 92.849197ms
2025-06-17T08:57:53Z D! [outputs.influxdb_v2] Buffer fullness: 32152 / 50000 metrics
2025-06-17T08:57:53Z D! [outputs.influxdb_v2] Wrote batch of 5000 metrics in 5.363979ms
2025-06-17T08:57:53Z D! [outputs.influxdb_v2] Buffer fullness: 27152 / 50000 metrics
2025-06-17T08:57:54Z D! [outputs.influxdb_v2] Wrote batch of 5000 metrics in 40.171784ms
2025-06-17T08:57:54Z D! [outputs.influxdb_v2] Buffer fullness: 28125 / 50000 metrics
2025-06-17T08:57:54Z D! [outputs.influxdb_v2] Wrote batch of 5000 metrics in 31.332002ms
2025-06-17T08:57:54Z D! [outputs.influxdb_v2] Buffer fullness: 25863 / 50000 metrics
2025-06-17T08:57:56Z D! [outputs.influxdb_v2] Wrote batch of 5000 metrics in 76.994172ms
2025-06-17T08:57:56Z D! [outputs.influxdb_v2] Buffer fullness: 30122 / 50000 metrics
2025-06-17T08:57:56Z D! [outputs.influxdb_v2] Wrote batch of 5000 metrics in 40.388718ms
2025-06-17T08:57:56Z D! [outputs.influxdb_v2] Buffer fullness: 30938 / 50000 metrics
2025-06-17T08:57:56Z D! [outputs.influxdb_v2] Wrote batch of 5000 metrics in 46.760249ms
2025-06-17T08:57:56Z D! [outputs.influxdb_v2] Buffer fullness: 29757 / 50000 metrics
2025-06-17T08:57:56Z D! [outputs.influxdb_v2] Wrote batch of 5000 metrics in 58.84479ms
2025-06-17T08:57:56Z D! [outputs.influxdb_v2] Buffer fullness: 30153 / 50000 metrics
2025-06-17T08:57:56Z D! [outputs.influxdb_v2] Wrote batch of 5000 metrics in 40.064677ms
2025-06-17T08:57:56Z D! [outputs.influxdb_v2] Buffer fullness: 31301 / 50000 metrics
2025-06-17T08:57:56Z D! [outputs.influxdb_v2] Wrote batch of 5000 metrics in 119.661765ms
2025-06-17T08:57:56Z D! [outputs.influxdb_v2] Buffer fullness: 35937 / 50000 metrics
2025-06-17T08:57:56Z D! [outputs.influxdb_v2] Wrote batch of 5000 metrics in 85.453554ms
2025-06-17T08:57:56Z D! [outputs.influxdb_v2] Buffer fullness: 38721 / 50000 metrics
2025-06-17T08:57:56Z D! [outputs.influxdb_v2] Wrote batch of 5000 metrics in 68.982766ms
2025-06-17T08:57:56Z D! [outputs.influxdb_v2] Buffer fullness: 40698 / 50000 metrics
2025-06-17T08:57:57Z D! [outputs.influxdb_v2] Wrote batch of 5000 metrics in 75.132074ms
2025-06-17T08:57:57Z D! [outputs.influxdb_v2] Buffer fullness: 39866 / 50000 metrics
2025-06-17T08:57:57Z D! [outputs.influxdb_v2] Wrote batch of 5000 metrics in 57.241426ms
2025-06-17T08:57:57Z D! [outputs.influxdb_v2] Buffer fullness: 35712 / 50000 metrics
2025-06-17T08:57:57Z D! [outputs.influxdb_v2] Wrote batch of 5000 metrics in 52.475769ms
2025-06-17T08:57:57Z D! [outputs.influxdb_v2] Wrote batch of 5000 metrics in 55.374607ms
2025-06-17T08:57:57Z D! [outputs.influxdb_v2] Wrote batch of 5000 metrics in 48.899409ms
2025-06-17T08:57:57Z D! [outputs.influxdb_v2] Wrote batch of 5000 metrics in 65.875674ms
2025-06-17T08:57:57Z D! [outputs.influxdb_v2] Wrote batch of 5000 metrics in 38.123181ms
2025-06-17T08:57:57Z D! [outputs.influxdb_v2] Wrote batch of 5000 metrics in 60.528723ms
2025-06-17T08:57:58Z D! [outputs.influxdb_v2] Wrote batch of 5000 metrics in 101.813116ms
2025-06-17T08:57:58Z D! [outputs.influxdb_v2] Wrote batch of 712 metrics in 11.633554ms
2025-06-17T08:57:58Z D! [outputs.influxdb_v2] Buffer fullness: 0 / 50000 metrics
System info
Telegraf 1.34, Ubuntu 24.04, Kubernetes
Docker
No response
Steps to reproduce
- Create bursty velocity metrics - in my case, a Kubernetes input that scrapes every 10 seconds all Prometheus endpoints
- My metrics average about 9k/s, sometimes bursting to 5x that (according to metrics_written measurement), hence my 100k buffer should be plenty for 10s
- Set the interval = flush interval = flush jitter = 10s
- Theoretically flush jitter shouldn't make this worse because it's supposed to flush when buffers are full, right? But yes this can make two flushes up to 20s apart.
- Observe that during the burst, the buffer expands, and dropped metrics count skyrockets. This is even though I have set
- Workaround is to set a super low flush interval (eg 100ms).
- That seemed super odd compared to how the common knowledge is to set flush interval to be no more frequent than the collection interval.
Expected behavior
Dropped metrics should remain zero, since both Telegraf and Influx are not maxing out my CPU.
Actual behavior
I was hitting my metrics buffer limit and occasionally dropping metrics.
I started tweaking the flush interval and flush jitter at around 19.30 on the graph. The original flush interval/flush jitter before that was 5s/5s. After cutting the buffer down to 50k and changing the flush interval/jitter to 100ms/100ms I've got a much lower utilisation.
Additional info
https://github.com/influxdata/telegraf/blob/master/models/running_output.go#L270:
count := atomic.AddInt64(&r.newMetricsCount, 1)
if count == int64(r.MetricBatchSize) {
atomic.StoreInt64(&r.newMetricsCount, 0)
select {
case r.BatchReady <- time.Now():
default:
}
}
https://github.com/influxdata/telegraf/blob/master/agent/agent.go#L918
and
select {
case <-ctx.Done():
logError(a.flushOnce(output, ticker, output.Write))
return
case <-ticker.Elapsed():
logError(a.flushOnce(output, ticker, output.Write))
case <-flushRequested:
logError(a.flushOnce(output, ticker, output.Write))
case <-output.BatchReady:
logError(a.flushBatch(output, output.WriteBatch))
}
My read of the code is that the agent does a batch flush when BatchReady has an element available. BatchReady is a channel with buffer 1.
In the output code, whenever a metric is added and the batch size is met, the newMetricsCount is reset and then BatchReady is set. For as long as the agent is still flushing the batch, subsequent attempts even when the batch size is met will reset the newMetricsCount, but no further triggers of the batch flush will be done (running_output.go's select will choose the default case.)
The only time another batch flush will be triggered will be when the agent is done flushing the batch, and the output has added another BatchSize of metrics (ignoring the fact that there's actually several batches of metrics ready to go.)
In that sense, newMetricsCount is actually a misnomer, it's actually always storing the number of metrics added modulo the batch size, ignoring whether further batches can already run.
One potential fix I can think of is to juggle the conditions around. When adding metrics, do not reset the count ever to zero:
count := atomic.AddInt64(&r.newMetricsCount, 1)
if count >= int64(r.MetricBatchSize) {
select {
case r.BatchReady <- time.Now():
default:
}
}
Then in updateTransaction, reduce the newMetricsCount by the number of metrics written, trigger another batch flush if it's ready:
func (*RunningOutput) updateTransaction(tx *Transaction, err error) {
// No error indicates all metrics were written successfully
if err == nil {
if atomic.AddInt64(&r.newMetricsCount, -int64(len(tx.Batch))) > int64(r.MetricBatchSize) {
select {
case r.BatchReady <- time.Now():
default:
}
}
tx.AcceptAll()
return
}
(not sure how to deal with the error case, just thinking out loud...)