-
Notifications
You must be signed in to change notification settings - Fork 9.8k
Closed
Labels
Description
What did you do?
- Using a remote_write pipeline to send samples and exemplars to a remote backend
- A batch of samples and exemplars is sent, for example 2000 samples and 1 exemplar.
- The one exemplar is rejected, for example, due to this issue.
- remote_write receives an HTTP status 400, logs a "non-recoverable" error log message and drops the batch
- The failure metrics are incremented with number of samples/exemplars/histograms in the entire batch. Even if all samples were ingested, but only one exemplar failed.
What did you expect to see?
- The failure metrics would ideally reflect how many data points have failed in the batch. Otherwise we don't have accurate success rate metrics.
- Exemplars / native histograms are experimental. It may be desirable to retry the batch without exemplars and native histograms to not drop the samples?
What did you see instead? Under which circumstances?
The failure metrics counted the entire batch as failed, even when the samples were ingested correctly.
System information
No response
Prometheus version
No response
Prometheus configuration file
No response
Alertmanager version
No response
Alertmanager configuration file
No response
Logs
ts=2024-06-19T17:38:16.946175Z level=error msg="non-recoverable error" ... url=(...) count=5 exemplarCount=1 err="server returned HTTP status 400 Bad Request: send data to ingesters: failed pushing to ingester ...: user=...: err: out of order exemplar. timestamp=2024-06-19T17:35:15Z, series=a_test_total{...}, exemplar={...}"