Add BatchingOptions.RetryTimeLimit
#2120
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR improves batching sink behavior by making the retry time for failed batches explicit and configurable, through
BatchingOptions.RetryTimeLimit
. Previously, the retry time was computed fromBufferingTimeLimit
in a fairly opaque manner.Batching still uses exponential back-off, and checks "expected" retry times against the limit, so the final retry for a batch may happen earlier than the configured maximum. The maximum granularity of retries is now 1 minute, so this puts an upper bound on the difference between the retry time limit and the actual last retry.
The failure sequence is:
By default, this process buffers all data for 10 minutes, and drops one batch per minute for at least 10 more; once those two stages have completed, stop buffering and drain as much data as possible to prevent negative impacts on the host application.
The main failure modes are:
ILogEventFailureListener
tie-in)There are definitely still opportunities to improve this, but each option we open up narrows down the range of possible improvements to the algorithm. After adding the new setting, I think we'd be wise to leave things as-is for at least a year, just to let the dust settle.
Note that systems with non-default buffering delays might be relying on these to generate very long buffering times. I think it's reasonable to expect these to migrate to the new setting, but we'll need to point the change out prominently in the release notes.