Skip to content

remote_write client metrics: a partially failed batch is counted as all samples failed #14323

@thampiotr

Description

@thampiotr

What did you do?

  1. Using a remote_write pipeline to send samples and exemplars to a remote backend
  2. A batch of samples and exemplars is sent, for example 2000 samples and 1 exemplar.
  3. The one exemplar is rejected, for example, due to this issue.
  4. remote_write receives an HTTP status 400, logs a "non-recoverable" error log message and drops the batch
  5. The failure metrics are incremented with number of samples/exemplars/histograms in the entire batch. Even if all samples were ingested, but only one exemplar failed.

What did you expect to see?

  1. The failure metrics would ideally reflect how many data points have failed in the batch. Otherwise we don't have accurate success rate metrics.
  2. Exemplars / native histograms are experimental. It may be desirable to retry the batch without exemplars and native histograms to not drop the samples?

What did you see instead? Under which circumstances?

The failure metrics counted the entire batch as failed, even when the samples were ingested correctly.

System information

No response

Prometheus version

No response

Prometheus configuration file

No response

Alertmanager version

No response

Alertmanager configuration file

No response

Logs

ts=2024-06-19T17:38:16.946175Z level=error msg="non-recoverable error" ... url=(...) count=5 exemplarCount=1 err="server returned HTTP status 400 Bad Request: send data to ingesters: failed pushing to ingester ...: user=...: err: out of order exemplar. timestamp=2024-06-19T17:35:15Z, series=a_test_total{...}, exemplar={...}"

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions