Add `BatchingOptions.RetryTimeLimit` #2120

nblumhardt · 2024-10-01T08:51:26Z

This PR improves batching sink behavior by making the retry time for failed batches explicit and configurable, through BatchingOptions.RetryTimeLimit. Previously, the retry time was computed from BufferingTimeLimit in a fairly opaque manner.

Batching still uses exponential back-off, and checks "expected" retry times against the limit, so the final retry for a batch may happen earlier than the configured maximum. The maximum granularity of retries is now 1 minute, so this puts an upper bound on the difference between the retry time limit and the actual last retry.

The failure sequence is:

If no failures have occurred, collect and send batches within BufferingTimeLimit (two seconds by default)
When the first failure occurs, retry at max(BufferingTimeLimit, five seconds) later, unless this would exceed RetryTimeLimit (ten minutes by default)
Exponentially back off the retry interval until it's no more than one minute, trying each time unless the attempt would exceed RetryTimeLimit
Once RetryTimeLimit is hit, drop the current batch
Attempt to send another batch at each interval, continuing to use the exponential back-off with intervals no more than one minute apart, dropping failed batches immediately
Once ten consecutive batches have failed and been dropped, drain the whole queue
Continue at the same progression of intervals, now dropping the batch and draining the queue on each subsequent failure

By default, this process buffers all data for 10 minutes, and drops one batch per minute for at least 10 more; once those two stages have completed, stop buffering and drain as much data as possible to prevent negative impacts on the host application.

The main failure modes are:

Backpressure due to queue exhaustion (the queue limit, 100000 by default, needs to last at least 10 minutes); this is not new, and is reasonably mitigated by either increasing queue size, or substantially reducing the retry time
Stalls due to "poison" events; these are also not new, and also able to be mitigated by reducing retry time, but it'd be nice to open up some more options (fast failure with some ILogEventFailureListener tie-in)
Byzantine failures, e.g. where the target sink accepts only one batch per 10 minutes; in this case we'll keep resetting the timer but not be able to work through any significant portion of the queue; also not new, and no current mitigation

There are definitely still opportunities to improve this, but each option we open up narrows down the range of possible improvements to the algorithm. After adding the new setting, I think we'd be wise to leave things as-is for at least a year, just to let the dust settle.

Note that systems with non-default buffering delays might be relying on these to generate very long buffering times. I think it's reasonable to expect these to migrate to the new setting, but we'll need to point the change out prominently in the release notes.

nblumhardt · 2024-10-01T08:53:08Z

src/Serilog/Util/TimeProvider.cs

+/// A super-simple, cut-down subset of `System.TimeProvider` which we use internally to avoid a package dependency
+/// on platforms without it.
+/// </summary>
+abstract class TimeProvider


Adding a package dependency to pull this in on downlevel platforms would be a significant change that we'd hold for a major version bump. The minimal subset of System.TimeProvider implemented here should at least get us started with integrating it in a few places.

nblumhardt · 2024-10-01T23:03:56Z

I've kicked the tyres in some test projects but I think this'll need some wider use/dogfooding before it's safe to ship, so I'll merge later today and get the ball rolling unless I hear from anyone interested in reviewing :-)

Add BatchingOptions.RetryTimeLimit

c6b424e

nblumhardt commented Oct 1, 2024

View reviewed changes

Improve some exception messages.

cb42d28

nblumhardt merged commit fe40d19 into serilog:dev Oct 2, 2024
1 check passed

This was referenced Oct 4, 2024

Proposal: Add support for TimeProvider #1920

Open

4.1.0 Release #2123

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add `BatchingOptions.RetryTimeLimit` #2120

Add `BatchingOptions.RetryTimeLimit` #2120

Uh oh!

nblumhardt commented Oct 1, 2024

Uh oh!

nblumhardt Oct 1, 2024

Uh oh!

nblumhardt commented Oct 1, 2024

Uh oh!

Uh oh!

Uh oh!

Add BatchingOptions.RetryTimeLimit #2120

Add BatchingOptions.RetryTimeLimit #2120

Uh oh!

Conversation

nblumhardt commented Oct 1, 2024

Uh oh!

nblumhardt Oct 1, 2024

Choose a reason for hiding this comment

Uh oh!

nblumhardt commented Oct 1, 2024

Uh oh!

Uh oh!

Uh oh!

Add `BatchingOptions.RetryTimeLimit` #2120

Add `BatchingOptions.RetryTimeLimit` #2120