Skip to content

Possible metrics buffer overflow when metrics are written faster than output flush completion #17200

@lowjoel

Description

@lowjoel

Relevant telegraf.conf

[agent]
  collection_jitter = "10s"
  debug = false
  flush_interval = "5s"
  flush_jitter = "5s"
  hostname = "$HOST_NAME"
  interval = "10s"
  logfile = ""
  metric_batch_size = 5000
  metric_buffer_limit = 100000
  omit_hostname = true
  precision = ""
  quiet = false
  round_interval = true

Logs from Telegraf

(sorry: logs are from a different config from above. I'm trying to tweak to work around this...)

Sometimes I see the writer go down to zero:

2025-06-17T08:57:34Z D! [outputs.influxdb_v2] Buffer fullness: 35578 / 50000 metrics
2025-06-17T08:57:41Z D! [outputs.influxdb_v2] Wrote batch of 5000 metrics in 43.812032ms
2025-06-17T08:57:41Z D! [outputs.influxdb_v2] Buffer fullness: 37437 / 50000 metrics
2025-06-17T08:57:41Z D! [outputs.influxdb_v2] Wrote batch of 5000 metrics in 35.464166ms
2025-06-17T08:57:41Z D! [outputs.influxdb_v2] Wrote batch of 5000 metrics in 44.59292ms
2025-06-17T08:57:41Z D! [outputs.influxdb_v2] Wrote batch of 5000 metrics in 37.239506ms
2025-06-17T08:57:41Z D! [outputs.influxdb_v2] Wrote batch of 5000 metrics in 74.431344ms
2025-06-17T08:57:41Z D! [outputs.influxdb_v2] Wrote batch of 5000 metrics in 67.190767ms
2025-06-17T08:57:41Z D! [outputs.influxdb_v2] Wrote batch of 5000 metrics in 104.431733ms
2025-06-17T08:57:41Z D! [outputs.influxdb_v2] Wrote batch of 5000 metrics in 122.537735ms
2025-06-17T08:57:41Z D! [outputs.influxdb_v2] Wrote batch of 4543 metrics in 38.236008ms
2025-06-17T08:57:41Z D! [outputs.influxdb_v2] Buffer fullness: 0 / 50000 metrics

It seems the full flush happens only when the flush timeout happens:

2025-06-17T08:57:41Z D! [outputs.influxdb_v2] Wrote batch of 2894 metrics in 36.683051ms
2025-06-17T08:57:41Z D! [outputs.influxdb_v2] Buffer fullness: 3242 / 50000 metrics
2025-06-17T08:57:41Z D! [outputs.influxdb_v2] Wrote batch of 5000 metrics in 35.493495ms
2025-06-17T08:57:41Z D! [outputs.influxdb_v2] Buffer fullness: 5708 / 50000 metrics
2025-06-17T08:57:41Z D! [outputs.influxdb_v2] Wrote batch of 5000 metrics in 34.052446ms
2025-06-17T08:57:41Z D! [outputs.influxdb_v2] Buffer fullness: 6637 / 50000 metrics
2025-06-17T08:57:41Z D! [outputs.influxdb_v2] Wrote batch of 5000 metrics in 37.435861ms
2025-06-17T08:57:41Z D! [outputs.influxdb_v2] Buffer fullness: 8003 / 50000 metrics
2025-06-17T08:57:41Z D! [outputs.influxdb_v2] Wrote batch of 5000 metrics in 36.671503ms
2025-06-17T08:57:41Z D! [outputs.influxdb_v2] Buffer fullness: 7930 / 50000 metrics
2025-06-17T08:57:41Z D! [outputs.influxdb_v2] Wrote batch of 5000 metrics in 27.867088ms
2025-06-17T08:57:41Z D! [outputs.influxdb_v2] Buffer fullness: 4417 / 50000 metrics
2025-06-17T08:57:42Z D! [outputs.influxdb_v2] Wrote batch of 5000 metrics in 27.015952ms
2025-06-17T08:57:42Z D! [outputs.influxdb_v2] Buffer fullness: 25 / 50000 metrics
2025-06-17T08:57:46Z D! [outputs.influxdb_v2] Wrote batch of 5000 metrics in 154.732343ms
2025-06-17T08:57:46Z D! [outputs.influxdb_v2] Buffer fullness: 10711 / 50000 metrics
2025-06-17T08:57:46Z D! [outputs.influxdb_v2] Wrote batch of 5000 metrics in 105.514562ms
2025-06-17T08:57:46Z D! [outputs.influxdb_v2] Buffer fullness: 14363 / 50000 metrics
2025-06-17T08:57:46Z D! [outputs.influxdb_v2] Wrote batch of 5000 metrics in 85.894132ms
2025-06-17T08:57:46Z D! [outputs.influxdb_v2] Buffer fullness: 15157 / 50000 metrics
2025-06-17T08:57:46Z D! [outputs.influxdb_v2] Wrote batch of 5000 metrics in 75.549172ms
2025-06-17T08:57:46Z D! [outputs.influxdb_v2] Buffer fullness: 17828 / 50000 metrics
2025-06-17T08:57:46Z D! [outputs.influxdb_v2] Wrote batch of 5000 metrics in 75.283279ms
2025-06-17T08:57:46Z D! [outputs.influxdb_v2] Buffer fullness: 21845 / 50000 metrics
2025-06-17T08:57:46Z D! [outputs.influxdb_v2] Wrote batch of 5000 metrics in 67.259195ms
2025-06-17T08:57:46Z D! [outputs.influxdb_v2] Buffer fullness: 25087 / 50000 metrics
2025-06-17T08:57:46Z D! [outputs.influxdb_v2] Wrote batch of 5000 metrics in 66.871765ms
2025-06-17T08:57:46Z D! [outputs.influxdb_v2] Buffer fullness: 25385 / 50000 metrics
2025-06-17T08:57:46Z D! [outputs.influxdb_v2] Wrote batch of 5000 metrics in 46.220724ms
2025-06-17T08:57:46Z D! [outputs.influxdb_v2] Buffer fullness: 20385 / 50000 metrics
2025-06-17T08:57:53Z D! [outputs.influxdb_v2] Wrote batch of 5000 metrics in 17.035093ms
2025-06-17T08:57:53Z D! [outputs.influxdb_v2] Buffer fullness: 22551 / 50000 metrics
2025-06-17T08:57:53Z D! [outputs.influxdb_v2] Wrote batch of 5000 metrics in 92.336851ms
2025-06-17T08:57:53Z D! [outputs.influxdb_v2] Buffer fullness: 26555 / 50000 metrics
2025-06-17T08:57:53Z D! [outputs.influxdb_v2] Wrote batch of 5000 metrics in 94.172931ms
2025-06-17T08:57:53Z D! [outputs.influxdb_v2] Buffer fullness: 33136 / 50000 metrics
2025-06-17T08:57:53Z D! [outputs.influxdb_v2] Wrote batch of 5000 metrics in 92.849197ms
2025-06-17T08:57:53Z D! [outputs.influxdb_v2] Buffer fullness: 32152 / 50000 metrics
2025-06-17T08:57:53Z D! [outputs.influxdb_v2] Wrote batch of 5000 metrics in 5.363979ms
2025-06-17T08:57:53Z D! [outputs.influxdb_v2] Buffer fullness: 27152 / 50000 metrics
2025-06-17T08:57:54Z D! [outputs.influxdb_v2] Wrote batch of 5000 metrics in 40.171784ms
2025-06-17T08:57:54Z D! [outputs.influxdb_v2] Buffer fullness: 28125 / 50000 metrics
2025-06-17T08:57:54Z D! [outputs.influxdb_v2] Wrote batch of 5000 metrics in 31.332002ms
2025-06-17T08:57:54Z D! [outputs.influxdb_v2] Buffer fullness: 25863 / 50000 metrics
2025-06-17T08:57:56Z D! [outputs.influxdb_v2] Wrote batch of 5000 metrics in 76.994172ms
2025-06-17T08:57:56Z D! [outputs.influxdb_v2] Buffer fullness: 30122 / 50000 metrics
2025-06-17T08:57:56Z D! [outputs.influxdb_v2] Wrote batch of 5000 metrics in 40.388718ms
2025-06-17T08:57:56Z D! [outputs.influxdb_v2] Buffer fullness: 30938 / 50000 metrics
2025-06-17T08:57:56Z D! [outputs.influxdb_v2] Wrote batch of 5000 metrics in 46.760249ms
2025-06-17T08:57:56Z D! [outputs.influxdb_v2] Buffer fullness: 29757 / 50000 metrics
2025-06-17T08:57:56Z D! [outputs.influxdb_v2] Wrote batch of 5000 metrics in 58.84479ms
2025-06-17T08:57:56Z D! [outputs.influxdb_v2] Buffer fullness: 30153 / 50000 metrics
2025-06-17T08:57:56Z D! [outputs.influxdb_v2] Wrote batch of 5000 metrics in 40.064677ms
2025-06-17T08:57:56Z D! [outputs.influxdb_v2] Buffer fullness: 31301 / 50000 metrics
2025-06-17T08:57:56Z D! [outputs.influxdb_v2] Wrote batch of 5000 metrics in 119.661765ms
2025-06-17T08:57:56Z D! [outputs.influxdb_v2] Buffer fullness: 35937 / 50000 metrics
2025-06-17T08:57:56Z D! [outputs.influxdb_v2] Wrote batch of 5000 metrics in 85.453554ms
2025-06-17T08:57:56Z D! [outputs.influxdb_v2] Buffer fullness: 38721 / 50000 metrics
2025-06-17T08:57:56Z D! [outputs.influxdb_v2] Wrote batch of 5000 metrics in 68.982766ms
2025-06-17T08:57:56Z D! [outputs.influxdb_v2] Buffer fullness: 40698 / 50000 metrics
2025-06-17T08:57:57Z D! [outputs.influxdb_v2] Wrote batch of 5000 metrics in 75.132074ms
2025-06-17T08:57:57Z D! [outputs.influxdb_v2] Buffer fullness: 39866 / 50000 metrics
2025-06-17T08:57:57Z D! [outputs.influxdb_v2] Wrote batch of 5000 metrics in 57.241426ms
2025-06-17T08:57:57Z D! [outputs.influxdb_v2] Buffer fullness: 35712 / 50000 metrics
2025-06-17T08:57:57Z D! [outputs.influxdb_v2] Wrote batch of 5000 metrics in 52.475769ms
2025-06-17T08:57:57Z D! [outputs.influxdb_v2] Wrote batch of 5000 metrics in 55.374607ms
2025-06-17T08:57:57Z D! [outputs.influxdb_v2] Wrote batch of 5000 metrics in 48.899409ms
2025-06-17T08:57:57Z D! [outputs.influxdb_v2] Wrote batch of 5000 metrics in 65.875674ms
2025-06-17T08:57:57Z D! [outputs.influxdb_v2] Wrote batch of 5000 metrics in 38.123181ms
2025-06-17T08:57:57Z D! [outputs.influxdb_v2] Wrote batch of 5000 metrics in 60.528723ms
2025-06-17T08:57:58Z D! [outputs.influxdb_v2] Wrote batch of 5000 metrics in 101.813116ms
2025-06-17T08:57:58Z D! [outputs.influxdb_v2] Wrote batch of 712 metrics in 11.633554ms
2025-06-17T08:57:58Z D! [outputs.influxdb_v2] Buffer fullness: 0 / 50000 metrics

System info

Telegraf 1.34, Ubuntu 24.04, Kubernetes

Docker

No response

Steps to reproduce

  1. Create bursty velocity metrics - in my case, a Kubernetes input that scrapes every 10 seconds all Prometheus endpoints
    1. My metrics average about 9k/s, sometimes bursting to 5x that (according to metrics_written measurement), hence my 100k buffer should be plenty for 10s
  2. Set the interval = flush interval = flush jitter = 10s
    1. Theoretically flush jitter shouldn't make this worse because it's supposed to flush when buffers are full, right? But yes this can make two flushes up to 20s apart.
  3. Observe that during the burst, the buffer expands, and dropped metrics count skyrockets. This is even though I have set
  4. Workaround is to set a super low flush interval (eg 100ms).
    1. That seemed super odd compared to how the common knowledge is to set flush interval to be no more frequent than the collection interval.

Expected behavior

Dropped metrics should remain zero, since both Telegraf and Influx are not maxing out my CPU.

Actual behavior

I was hitting my metrics buffer limit and occasionally dropping metrics.

Image

I started tweaking the flush interval and flush jitter at around 19.30 on the graph. The original flush interval/flush jitter before that was 5s/5s. After cutting the buffer down to 50k and changing the flush interval/jitter to 100ms/100ms I've got a much lower utilisation.

Additional info

https://github.com/influxdata/telegraf/blob/master/models/running_output.go#L270:

	count := atomic.AddInt64(&r.newMetricsCount, 1)
	if count == int64(r.MetricBatchSize) {
		atomic.StoreInt64(&r.newMetricsCount, 0)
		select {
		case r.BatchReady <- time.Now():
		default:
		}
	}

https://github.com/influxdata/telegraf/blob/master/agent/agent.go#L918

and

		select {
		case <-ctx.Done():
			logError(a.flushOnce(output, ticker, output.Write))
			return
		case <-ticker.Elapsed():
			logError(a.flushOnce(output, ticker, output.Write))
		case <-flushRequested:
			logError(a.flushOnce(output, ticker, output.Write))
		case <-output.BatchReady:
			logError(a.flushBatch(output, output.WriteBatch))
		}

My read of the code is that the agent does a batch flush when BatchReady has an element available. BatchReady is a channel with buffer 1.

In the output code, whenever a metric is added and the batch size is met, the newMetricsCount is reset and then BatchReady is set. For as long as the agent is still flushing the batch, subsequent attempts even when the batch size is met will reset the newMetricsCount, but no further triggers of the batch flush will be done (running_output.go's select will choose the default case.)

The only time another batch flush will be triggered will be when the agent is done flushing the batch, and the output has added another BatchSize of metrics (ignoring the fact that there's actually several batches of metrics ready to go.)

In that sense, newMetricsCount is actually a misnomer, it's actually always storing the number of metrics added modulo the batch size, ignoring whether further batches can already run.

One potential fix I can think of is to juggle the conditions around. When adding metrics, do not reset the count ever to zero:

	count := atomic.AddInt64(&r.newMetricsCount, 1)
	if count >= int64(r.MetricBatchSize) {
		select {
		case r.BatchReady <- time.Now():
		default:
		}
	}

Then in updateTransaction, reduce the newMetricsCount by the number of metrics written, trigger another batch flush if it's ready:

func (*RunningOutput) updateTransaction(tx *Transaction, err error) {
	// No error indicates all metrics were written successfully
	if err == nil {
		if atomic.AddInt64(&r.newMetricsCount, -int64(len(tx.Batch))) > int64(r.MetricBatchSize) {
			select {
			case r.BatchReady <- time.Now():
			default:
			}
		}
		tx.AcceptAll()
		return
	}

(not sure how to deal with the error case, just thinking out loud...)

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugunexpected problem or unintended behavior

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions