Skip to content

Conversation

nblumhardt
Copy link
Member

This PR improves batching sink behavior by making the retry time for failed batches explicit and configurable, through BatchingOptions.RetryTimeLimit. Previously, the retry time was computed from BufferingTimeLimit in a fairly opaque manner.

Batching still uses exponential back-off, and checks "expected" retry times against the limit, so the final retry for a batch may happen earlier than the configured maximum. The maximum granularity of retries is now 1 minute, so this puts an upper bound on the difference between the retry time limit and the actual last retry.

The failure sequence is:

  • If no failures have occurred, collect and send batches within BufferingTimeLimit (two seconds by default)
  • When the first failure occurs, retry at max(BufferingTimeLimit, five seconds) later, unless this would exceed RetryTimeLimit (ten minutes by default)
  • Exponentially back off the retry interval until it's no more than one minute, trying each time unless the attempt would exceed RetryTimeLimit
  • Once RetryTimeLimit is hit, drop the current batch
  • Attempt to send another batch at each interval, continuing to use the exponential back-off with intervals no more than one minute apart, dropping failed batches immediately
  • Once ten consecutive batches have failed and been dropped, drain the whole queue
  • Continue at the same progression of intervals, now dropping the batch and draining the queue on each subsequent failure

By default, this process buffers all data for 10 minutes, and drops one batch per minute for at least 10 more; once those two stages have completed, stop buffering and drain as much data as possible to prevent negative impacts on the host application.

The main failure modes are:

  • Backpressure due to queue exhaustion (the queue limit, 100000 by default, needs to last at least 10 minutes); this is not new, and is reasonably mitigated by either increasing queue size, or substantially reducing the retry time
  • Stalls due to "poison" events; these are also not new, and also able to be mitigated by reducing retry time, but it'd be nice to open up some more options (fast failure with some ILogEventFailureListener tie-in)
  • Byzantine failures, e.g. where the target sink accepts only one batch per 10 minutes; in this case we'll keep resetting the timer but not be able to work through any significant portion of the queue; also not new, and no current mitigation

There are definitely still opportunities to improve this, but each option we open up narrows down the range of possible improvements to the algorithm. After adding the new setting, I think we'd be wise to leave things as-is for at least a year, just to let the dust settle.

Note that systems with non-default buffering delays might be relying on these to generate very long buffering times. I think it's reasonable to expect these to migrate to the new setting, but we'll need to point the change out prominently in the release notes.

/// A super-simple, cut-down subset of `System.TimeProvider` which we use internally to avoid a package dependency
/// on platforms without it.
/// </summary>
abstract class TimeProvider
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding a package dependency to pull this in on downlevel platforms would be a significant change that we'd hold for a major version bump. The minimal subset of System.TimeProvider implemented here should at least get us started with integrating it in a few places.

@nblumhardt
Copy link
Member Author

I've kicked the tyres in some test projects but I think this'll need some wider use/dogfooding before it's safe to ship, so I'll merge later today and get the ball rolling unless I hear from anyone interested in reviewing :-)

@nblumhardt nblumhardt merged commit fe40d19 into serilog:dev Oct 2, 2024
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant