Skip to content

Provide for detection of Stuck Projectors in StreamsProjector #125

@bartelink

Description

@bartelink

At present, if a handler continually fails to make progress on a given stream, the Scheduler will continually retry, resulting in:

  • running 'hot'; there are no backoffs and/or anything else to ameliorate the impact of things failing
  • no direct way to determine that such a state has been entered from a programmatic, alerting or dashboards perspective. At present, a number of secondary effects will hint at the problem:
    • lack of progress if observing the read and/or checkpoint positions on the source e.g. fo the CFP or Feed readers
    • increase in exception outcomes on dashboards
    • reduction in successful outcomes on dashboards

In order to be able to define a clear alertable condition, it is proposed to maintain, on a per-stream basis:

  • number of consecutive failures, timestamp of first failure
  • number of consecutive successes without progress, timestamp of first attempt

While the necessary data may be maintained at the stream level, its problematic to surface these either as:

  • a log record per invocation - Handlers can receive extremely high traffic and adding this overhead as a fixed cost is not likely to work well
  • metrics tagged/instanced at the stream level - this will lead to excessive cardinality as it's unbounded, while likely making querying more complex (though more metrics gives more scope for alerting, it does not ease the task of determining which ones are relevant to someone coming to a set of metrics fresh)

Metrics

  • timeFailing: now - oldest failing since (tags: app,category)
  • countFailing: number of streams whose last outcome was a failure (tags: app,category)
  • timeStalled: now - oldest stalled (tags: app,category)
  • countStalled: number of streams whose last handler invocation did not make progress (and has messages waiting) (tags: app,category)

🤔 - longestRunning: oldest dispatched call in flight that has yet to yield a success/fail
🤔 - countRunning: number of handler invocations in flight

Example Alerts

  • max timeFailing > 5m as a reasonable default for the average projector that is writing to a rate-limited store
  • max timeStalled > 2m for a watchdog that's responsible for cancelling and/or pumping workflows that have not reached a conclusion within 1m

🤔 - max timeRunning > 1h for a workflow engine processing step sanity check

Pseudocode Logic

When a success happens:

  • consecutive failures/failing since is cleared
  • (if progress is made) consecutive stalled is cleared

When a fail or happens:

  • consecutive failures/failing since is either initialized to (1,now) or incremented

When a success with lack of progress happens:

  • consecutive stalled is either initialized to (1,now) or incremented

When a dispatch or completion of a call happens:

  • update the longest running task start time metric
  • record the latency in the metrics immediately vs waiting to surface it every n ms?

🤔 while there should probably be a set of callbacks the projector provides that can be used to hook in metrics, but we also want the system to log summaries out of the box

Other ideas/questions

  • is being able to inject backoffs based on these metrics for a given specific stream important?
  • how/would one want to be able to internally exit/restart the projector host app based on the values ?
  • is there some more important intermittent failure pattern this will be useless for ?
  • I'm excluding higher level lag based metrics e.g. stuff https://github.com/linkedin/Burrow does

tagging @ameier38 @belcher-rok @deviousasti @dunnry @enricosada @ragiano215 @swrhim @wantastic84 who have been party to discussions in this space (and may be able to extend or, hopefully, simply the requirements above, or link to a better writeup or concept regarding this)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions