-
Notifications
You must be signed in to change notification settings - Fork 24
Description
At present, if a handler
continually fails to make progress on a given stream, the Scheduler will continually retry, resulting in:
- running 'hot'; there are no backoffs and/or anything else to ameliorate the impact of things failing
- no direct way to determine that such a state has been entered from a programmatic, alerting or dashboards perspective. At present, a number of secondary effects will hint at the problem:
- lack of progress if observing the read and/or checkpoint positions on the source e.g. fo the CFP or Feed readers
- increase in exception outcomes on dashboards
- reduction in successful outcomes on dashboards
In order to be able to define a clear alertable condition, it is proposed to maintain, on a per-stream basis:
- number of consecutive failures, timestamp of first failure
- number of consecutive successes without progress, timestamp of first attempt
While the necessary data may be maintained at the stream level, its problematic to surface these either as:
- a log record per invocation - Handlers can receive extremely high traffic and adding this overhead as a fixed cost is not likely to work well
- metrics tagged/instanced at the stream level - this will lead to excessive cardinality as it's unbounded, while likely making querying more complex (though more metrics gives more scope for alerting, it does not ease the task of determining which ones are relevant to someone coming to a set of metrics fresh)
Metrics
timeFailing
: now - oldest failing since (tags: app,category)countFailing
: number of streams whose last outcome was a failure (tags: app,category)timeStalled
: now - oldest stalled (tags: app,category)countStalled
: number of streams whose last handler invocation did not make progress (and has messages waiting) (tags: app,category)
🤔 - longestRunning
: oldest dispatched call in flight that has yet to yield a success/fail
🤔 - countRunning
: number of handler invocations in flight
Example Alerts
max timeFailing > 5m
as a reasonable default for the average projector that is writing to a rate-limited storemax timeStalled > 2m
for a watchdog that's responsible for cancelling and/or pumping workflows that have not reached a conclusion within 1m
🤔 - max timeRunning > 1h
for a workflow engine processing step sanity check
Pseudocode Logic
When a success happens:
- consecutive failures/failing since is cleared
- (if progress is made) consecutive stalled is cleared
When a fail or happens:
- consecutive failures/failing since is either initialized to (1,now) or incremented
When a success with lack of progress happens:
- consecutive stalled is either initialized to (1,now) or incremented
When a dispatch or completion of a call happens:
- update the longest running task start time metric
- record the latency in the metrics immediately vs waiting to surface it every n ms?
🤔 while there should probably be a set of callbacks the projector provides that can be used to hook in metrics, but we also want the system to log summaries out of the box
Other ideas/questions
- is being able to inject backoffs based on these metrics for a given specific stream important?
- how/would one want to be able to internally exit/restart the projector host app based on the values ?
- is there some more important intermittent failure pattern this will be useless for ?
- I'm excluding higher level lag based metrics e.g. stuff https://github.com/linkedin/Burrow does
tagging @ameier38 @belcher-rok @deviousasti @dunnry @enricosada @ragiano215 @swrhim @wantastic84 who have been party to discussions in this space (and may be able to extend or, hopefully, simply the requirements above, or link to a better writeup or concept regarding this)