Store-gateway drops all blocks if fails to heartbeat the ring

This issue describes an issue we experienced yesterday on some store-gateways in a Mimir cluster.

Scenario:
- Mimir configured to run ring on Consul (but the issue could happen with any ring backend, `memberlist` included)
- Store-gateway is overloaded while lazy-loading a large number of big block's index-headers
- Store-gateway fails to heartbeat the ring for X consecutive minutes, where X is > `-store-gateway.sharding-ring.heartbeat-timeout`
- At the 1st next sync, the store-gateway detects itself as unhealthy and drop all its blocks
- At the 2nd next sync, the store-gateway detects itself as healthy and re-adds all its blocks, causing additional load to the store-gateway itself

## Zoom in into a specific store-gateway

The issue described above happened to several store-gateways in a large cluster.
To better understand the behaviour, I'm looking at a specific one: `store-gateway-zone-b-3`.

The sequence of related block syncs are:

```
level=info ts=2022-05-02T09:06:19.395846082Z caller=gateway.go:288 msg="synchronizing TSDB blocks for all users" reason=ring-changeShow context
level=info ts=2022-05-02T09:07:03.365919167Z caller=gateway.go:294 msg="successfully synchronized TSDB blocks for all users" reason=ring-change

# In this sync all blocks have been dropped.
level=info ts=2022-05-02T09:07:03.3659588Z caller=gateway.go:288 msg="synchronizing TSDB blocks for all users" reason=ring-change
level=info ts=2022-05-02T09:08:55.148949647Z caller=gateway.go:294 msg="successfully synchronized TSDB blocks for all users" reason=ring-change

# In this sync all blocks have been re-added. The sync took 37 minutes.
level=info ts=2022-05-02T09:08:55.148992704Z caller=gateway.go:288 msg="synchronizing TSDB blocks for all users" reason=ring-change
level=info ts=2022-05-02T09:45:34.826814806Z caller=gateway.go:294 msg="successfully synchronized TSDB blocks for all users" reason=ring-change

level=info ts=2022-05-02T09:45:34.826872774Z caller=gateway.go:288 msg="synchronizing TSDB blocks for all users" reason=ring-change
level=info ts=2022-05-02T09:45:34.853222892Z caller=gateway.go:294 msg="successfully synchronized TSDB blocks for all users" reason=ring-change
```

Querying the metric `cortex_consul_request_duration_seconds_count{namespace="REDACTED",pod="store-gateway-zone-b-3",operation="CAS",status_code="200"}` we can see the number of successful CAS operations did **not** increase between 09:03:15 and 09:08:30. The metric value (when scraping was successful because store-gateway was overloaded) is always `2369` during that timeframe:

![Screenshot 2022-05-03 at 10 17 04](https://user-images.githubusercontent.com/1701904/166425185-ae3da9e4-1842-4f2d-b2ac-40f7d7e24382.png)

The store-gateway is running with `-store-gateway.sharding-ring.heartbeat-timeout=4m`, the last successful heartbeating was before 09:03:00, so at minute 09:07:00 it was expired and unhealthy in the ring. The next sync started at 09:07:03, once the store-gateway was unhealthy in the ring.

The store-gateway has a protection to keep the previously loaded blocks in case it's unable to lookup the ring. However, it did successfully lookup the ring, but the self instance was detected as unhealthy:

https://github.com/grafana/mimir/blob/fb39490a27ec43e5ccd7fb34d0a47ccc8b4e18a1/pkg/storegateway/sharding_strategy.go#L98-L110

The store-gateway also has a protection to keep the previously loaded blocks if there's no other authoritative owner for that block in the ring, but that wasn't the case neither, because there were other owners:

https://github.com/grafana/mimir/blob/fb39490a27ec43e5ccd7fb34d0a47ccc8b4e18a1/pkg/storegateway/sharding_strategy.go#L117-L132

So all the blocks have been unloaded, and then (at the 2nd next sync) progressively loaded back.

## Proposal

I propose to add a simple additional protection to the store-gateway which doesn't drop any loaded block if the store-gateway itself is unhealthy in the ring.

The idea is that if the store-gateway itself is detected as unhealthy in the ring, it shouldn't do anything during the periodic sync, neither add or drop blocks, because it's in a inconsistent situation: the store-gateway is obviously running, but detected as unhealthy in the ring.


	if err != nil {
	if _, ok := loaded[blockID]; ok {
	level.Warn(logger).Log("msg", "failed to check block owner but block is kept because was previously loaded", "block", blockID.String(), "err", err)
	} else {
	level.Warn(logger).Log("msg", "failed to check block owner and block has been excluded because was not previously loaded", "block", blockID.String(), "err", err)

	// Skip the block.
	synced.WithLabelValues(shardExcludedMeta).Inc()
	delete(metas, blockID)
	}

	continue
	}

	// The block is not owned by the store-gateway. However, if it's currently loaded
	// we can safely unload it only once at least 1 authoritative owner is available
	// for queries.
	if _, ok := loaded[blockID]; ok {
	// The ring Get() returns an error if there's no available instance.
	if _, err := r.Get(key, BlocksOwnerRead, bufDescs, bufHosts, bufZones); err != nil {
	// Keep the block.
	continue
	}
	}

	// The block is not owned by the store-gateway and there's at least 1 available
	// authoritative owner available for queries, so we can filter it out (and unload
	// it if it was loaded).
	synced.WithLabelValues(shardExcludedMeta).Inc()
	delete(metas, blockID)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Store-gateway drops all blocks if fails to heartbeat the ring #1805

Zoom in into a specific store-gateway

Proposal

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Store-gateway drops all blocks if fails to heartbeat the ring #1805

Description

Zoom in into a specific store-gateway

Proposal

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions