Provide for detection of Stuck Projectors in StreamsProjector

At present, if a `handler` continually fails to make progress on a given stream, the Scheduler will continually retry, resulting in:
- running 'hot'; there are no backoffs and/or anything else to ameliorate the impact of things failing
- no direct way to determine that such a state has been entered from a programmatic, alerting or dashboards perspective. At present, a number of secondary effects will hint at the problem:
  - lack of progress if observing the read and/or checkpoint positions on the source e.g. fo the CFP or Feed readers
  - increase in exception outcomes on dashboards
  - reduction in successful outcomes on dashboards
  
In order to be able to define a clear alertable condition, it is proposed to maintain, on a per-stream basis:
- number of consecutive failures, timestamp of first failure
- number of consecutive successes without progress, timestamp of first attempt

While the necessary data may be maintained at the stream level, its problematic to surface these either as:
- a log record per invocation - Handlers can receive extremely high traffic and adding this overhead as a fixed cost is not likely to work well
- metrics tagged/instanced at the _stream_ level - this will lead to excessive cardinality as it's unbounded, while likely making querying more complex (though more metrics gives more scope for alerting, it does not ease the task of determining which ones are relevant to someone coming to a set of metrics fresh)

## Metrics
- `timeFailing`: now - oldest failing since (tags: app,category)
- `countFailing`: number of streams whose last outcome was a failure (tags: app,category)
- `timeStalled`: now - oldest stalled (tags: app,category)
- `countStalled`: number of streams whose last handler invocation did not make progress (and has messages waiting) (tags: app,category)

🤔 - `longestRunning`: oldest dispatched call in flight that has yet to yield a success/fail
🤔 - `countRunning`: number of handler invocations in flight

## Example Alerts
- `max timeFailing > 5m` as a reasonable default for the average projector that is writing to a rate-limited store
- `max timeStalled > 2m` for a watchdog that's responsible for cancelling and/or pumping workflows that have not reached a conclusion within 1m

🤔 - `max timeRunning > 1h` for a workflow engine processing step sanity check

## Pseudocode Logic

When a success happens:
- consecutive failures/failing since is cleared
- (if progress is made) consecutive stalled is cleared

When a fail or happens:
- consecutive failures/failing since is either initialized to (1,now) or incremented

When a success with lack of progress happens:
- consecutive stalled is either initialized to (1,now) or incremented

When a dispatch or completion of a call happens:
- update the longest running task start time metric
- _record the latency in the metrics immediately vs waiting to surface it every n ms?_ 

🤔 while there should probably be a set of callbacks the projector provides that can be used to hook in metrics, but we also want the system to log summaries out of the box

## Other ideas/questions
- is being able to inject backoffs based on these metrics for a given specific stream important?
- how/would one want to be able to internally exit/restart the projector host app based on the values ?
- is there some more important intermittent failure pattern this will be useless for ?
- I'm excluding higher level lag based metrics e.g. stuff https://github.com/linkedin/Burrow does

tagging @ameier38 @belcher-rok @deviousasti @dunnry @enricosada @ragiano215 @swrhim @wantastic84 who have been party to discussions in this space (and may be able to extend or, hopefully, simply the requirements above, or link to a better writeup or concept regarding this)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Provide for detection of Stuck Projectors in StreamsProjector #125

Metrics

Example Alerts

Pseudocode Logic

Other ideas/questions

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Provide for detection of Stuck Projectors in StreamsProjector #125

Description

Metrics

Example Alerts

Pseudocode Logic

Other ideas/questions

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions