Skip to content

Ruler is slow to sync rules when rule groups are reshuffled and rule evaluations latency is high #2280

@pracucci

Description

@pracucci

I found out that the ruler can be very slow to sync rules, if rule groups get reshuffled between replicas and these rule groups are very slow to evaluate.

Timeline

The following timeline shows what triggered the investigation:

  1. We disabled shuffle sharding on the read path, causing rule evaluations to use more CPU and being way slower (average rule evaluation increased from 2ms to 20s).
  2. To be able to evaluate all rules within the evaluation interval, we scaled out rulers (more replicas). Scaling up rulers will shard tenants and/or rule groups to more rulers, reducing the number of tenants and/or rule groups per ruler.
  3. Right after the scale up, we expected to see a sudden drop in the number of rule groups evaluated in the old rulers, but it didn't happen. The actual number of evaluated rule groups decreased very slowly.

Investigation

The investigation took several hours, so here I'm reporting just some key information.

We can see the rules synching was very slow querying cortex_ruler_sync_rules_total on a specific ruler pod. We expect rules synching to happen at least every -ruler.poll-interval=1m but it didn't happen for about 45m:

Screenshot 2022-06-29 at 17 13 39

The rules synching involves 4 main operations (executed in sequence):

  1. List rule groups in the object storage
  2. Get rule groups from the object storage
  3. Sync rules to per-tenant ruler manager
  4. Stop per-tenant ruler managers for tenants not owned anymore

Using a mix of metrics and logs, I've identified the slowdown was happening while calling syncRulesToManager() (step 3 of the list above), and in particular being slow in the call to manager.Update():

err = manager.Update(r.cfg.EvaluationInterval, files, nil, r.cfg.ExternalURL.String(), nil)

Root cause hypothesis

When we scale up rulers, tenants (due to shuffle sharding) and rule groups (due to ruler sharding) are shuffled between replicas (from old replicas to new ones). When a tenant is not removed from the old ruler replica, but one of its rule groups is reshuffled to a new ruler replica, the ruler stops the given rule group and waits until the inflight rule evaluation is completed:

// Stop remaining old groups.
wg.Add(len(m.groups))
for n, oldg := range m.groups {
go func(n string, g *Group) {
g.markStale = true
g.stop()
if m := g.metrics; m != nil {
m.IterationsMissed.DeleteLabelValues(n)
m.IterationsScheduled.DeleteLabelValues(n)
m.EvalTotal.DeleteLabelValues(n)
m.EvalFailures.DeleteLabelValues(n)
m.GroupInterval.DeleteLabelValues(n)
m.GroupLastEvalTime.DeleteLabelValues(n)
m.GroupLastDuration.DeleteLabelValues(n)
m.GroupRules.DeleteLabelValues(n)
m.GroupSamples.DeleteLabelValues((n))
}
wg.Done()
}(n, oldg)
}
wg.Wait()

If the rule evaluation is very slow (e.g. 20s on average) and this happen for several tenants, the sync may take a long time before completing, because tenants are synched processed sequentially:

mimir/pkg/ruler/manager.go

Lines 110 to 112 in 37376ee

for userID, ruleGroup := range ruleGroups {
r.syncRulesToManager(ctx, userID, ruleGroup)
}

The following screenshot shows the progress of synched tenants, querying the metric sum(cortex_ruler_config_updates_total) (the sync started at 09:30 and ended at 10:14):
Screenshot 2022-06-29 at 17 31 47

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions