Ruler is slow to sync rules when rule groups are reshuffled and rule evaluations latency is high

I found out that the ruler can be very slow to sync rules, if rule groups get reshuffled between replicas and these rule groups are very slow to evaluate.

## Timeline

The following timeline shows what triggered the investigation:

1. We disabled shuffle sharding on the read path, causing **rule evaluations to use more CPU and being way slower** (average rule evaluation increased from 2ms to 20s).
2. To be able to evaluate all rules within the evaluation interval, **we scaled out rulers** (more replicas). Scaling up rulers will shard tenants and/or rule groups to more rulers, reducing the number of tenants and/or rule groups per ruler.
3. Right after the scale up, we expected to see a sudden drop in the number of rule groups evaluated in the old rulers, but it didn't happen. The actual number of evaluated rule groups decreased very slowly.

## Investigation

The investigation took several hours, so here I'm reporting just some key information.

We can see the rules synching was very slow querying `cortex_ruler_sync_rules_total` on a specific ruler pod. We expect rules synching to happen at least every `-ruler.poll-interval=1m` but it didn't happen for about 45m:

![Screenshot 2022-06-29 at 17 13 39](https://user-images.githubusercontent.com/1701904/176472386-a7c1202e-2f01-4982-a2f6-8a5fff8f887c.png)

The rules synching involves 4 main operations (executed in sequence):
1. [List rule groups in the object storage](https://github.com/grafana/mimir/blob/37376ee1287f36542648573ae707c5f74775e7fb/pkg/ruler/ruler.go#L468)
2. [Get rule groups from the object storage](https://github.com/grafana/mimir/blob/37376ee1287f36542648573ae707c5f74775e7fb/pkg/ruler/ruler.go#L474)
3. [Sync rules to per-tenant ruler manager](https://github.com/grafana/mimir/blob/37376ee1287f36542648573ae707c5f74775e7fb/pkg/ruler/manager.go#L110-L112)
4. [Stop per-tenant ruler managers for tenants not owned anymore](https://github.com/grafana/mimir/blob/37376ee1287f36542648573ae707c5f74775e7fb/pkg/ruler/manager.go#L115-L127)

Using a mix of metrics and logs, I've identified the slowdown was happening while calling [`syncRulesToManager()`](https://github.com/grafana/mimir/blob/37376ee1287f36542648573ae707c5f74775e7fb/pkg/ruler/manager.go#L111) (step 3 of the list above), and in particular being slow in the call to `manager.Update()`:
https://github.com/grafana/mimir/blob/37376ee1287f36542648573ae707c5f74775e7fb/pkg/ruler/manager.go#L161

## Root cause hypothesis

When we scale up rulers, tenants (due to shuffle sharding) and rule groups (due to ruler sharding) are shuffled between replicas (from old replicas to new ones). When a tenant is **not** removed from the old ruler replica, but **one of its rule groups is reshuffled to a new ruler replica**, the ruler stops the given rule group and waits until the inflight rule evaluation is completed:
https://github.com/grafana/mimir/blob/37376ee1287f36542648573ae707c5f74775e7fb/vendor/github.com/prometheus/prometheus/rules/manager.go#L1066-L1087

If the rule evaluation is very slow (e.g. 20s on average) and this happen for several tenants, the sync may take a long time before completing, because tenants are synched processed sequentially:
https://github.com/grafana/mimir/blob/37376ee1287f36542648573ae707c5f74775e7fb/pkg/ruler/manager.go#L110-L112

The following screenshot shows the progress of synched tenants, querying the metric `sum(cortex_ruler_config_updates_total)` (the sync started at 09:30 and ended at 10:14):
![Screenshot 2022-06-29 at 17 31 47](https://user-images.githubusercontent.com/1701904/176476652-78e91f36-bcfe-4954-8d8b-ea417672beb6.png)



	// Stop remaining old groups.
	wg.Add(len(m.groups))
	for n, oldg := range m.groups {
	go func(n string, g *Group) {
	g.markStale = true
	g.stop()
	if m := g.metrics; m != nil {
	m.IterationsMissed.DeleteLabelValues(n)
	m.IterationsScheduled.DeleteLabelValues(n)
	m.EvalTotal.DeleteLabelValues(n)
	m.EvalFailures.DeleteLabelValues(n)
	m.GroupInterval.DeleteLabelValues(n)
	m.GroupLastEvalTime.DeleteLabelValues(n)
	m.GroupLastDuration.DeleteLabelValues(n)
	m.GroupRules.DeleteLabelValues(n)
	m.GroupSamples.DeleteLabelValues((n))
	}
	wg.Done()
	}(n, oldg)
	}

	wg.Wait()

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Ruler is slow to sync rules when rule groups are reshuffled and rule evaluations latency is high #2280

Timeline

Investigation

Root cause hypothesis

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

	for userID, ruleGroup := range ruleGroups {
	r.syncRulesToManager(ctx, userID, ruleGroup)
	}

Ruler is slow to sync rules when rule groups are reshuffled and rule evaluations latency is high #2280

Description

Timeline

Investigation

Root cause hypothesis

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions