Skip to content

Store-gateway drops all blocks if fails to heartbeat the ring #1805

@pracucci

Description

@pracucci

This issue describes an issue we experienced yesterday on some store-gateways in a Mimir cluster.

Scenario:

  • Mimir configured to run ring on Consul (but the issue could happen with any ring backend, memberlist included)
  • Store-gateway is overloaded while lazy-loading a large number of big block's index-headers
  • Store-gateway fails to heartbeat the ring for X consecutive minutes, where X is > -store-gateway.sharding-ring.heartbeat-timeout
  • At the 1st next sync, the store-gateway detects itself as unhealthy and drop all its blocks
  • At the 2nd next sync, the store-gateway detects itself as healthy and re-adds all its blocks, causing additional load to the store-gateway itself

Zoom in into a specific store-gateway

The issue described above happened to several store-gateways in a large cluster.
To better understand the behaviour, I'm looking at a specific one: store-gateway-zone-b-3.

The sequence of related block syncs are:

level=info ts=2022-05-02T09:06:19.395846082Z caller=gateway.go:288 msg="synchronizing TSDB blocks for all users" reason=ring-changeShow context
level=info ts=2022-05-02T09:07:03.365919167Z caller=gateway.go:294 msg="successfully synchronized TSDB blocks for all users" reason=ring-change

# In this sync all blocks have been dropped.
level=info ts=2022-05-02T09:07:03.3659588Z caller=gateway.go:288 msg="synchronizing TSDB blocks for all users" reason=ring-change
level=info ts=2022-05-02T09:08:55.148949647Z caller=gateway.go:294 msg="successfully synchronized TSDB blocks for all users" reason=ring-change

# In this sync all blocks have been re-added. The sync took 37 minutes.
level=info ts=2022-05-02T09:08:55.148992704Z caller=gateway.go:288 msg="synchronizing TSDB blocks for all users" reason=ring-change
level=info ts=2022-05-02T09:45:34.826814806Z caller=gateway.go:294 msg="successfully synchronized TSDB blocks for all users" reason=ring-change

level=info ts=2022-05-02T09:45:34.826872774Z caller=gateway.go:288 msg="synchronizing TSDB blocks for all users" reason=ring-change
level=info ts=2022-05-02T09:45:34.853222892Z caller=gateway.go:294 msg="successfully synchronized TSDB blocks for all users" reason=ring-change

Querying the metric cortex_consul_request_duration_seconds_count{namespace="REDACTED",pod="store-gateway-zone-b-3",operation="CAS",status_code="200"} we can see the number of successful CAS operations did not increase between 09:03:15 and 09:08:30. The metric value (when scraping was successful because store-gateway was overloaded) is always 2369 during that timeframe:

Screenshot 2022-05-03 at 10 17 04

The store-gateway is running with -store-gateway.sharding-ring.heartbeat-timeout=4m, the last successful heartbeating was before 09:03:00, so at minute 09:07:00 it was expired and unhealthy in the ring. The next sync started at 09:07:03, once the store-gateway was unhealthy in the ring.

The store-gateway has a protection to keep the previously loaded blocks in case it's unable to lookup the ring. However, it did successfully lookup the ring, but the self instance was detected as unhealthy:

if err != nil {
if _, ok := loaded[blockID]; ok {
level.Warn(logger).Log("msg", "failed to check block owner but block is kept because was previously loaded", "block", blockID.String(), "err", err)
} else {
level.Warn(logger).Log("msg", "failed to check block owner and block has been excluded because was not previously loaded", "block", blockID.String(), "err", err)
// Skip the block.
synced.WithLabelValues(shardExcludedMeta).Inc()
delete(metas, blockID)
}
continue
}

The store-gateway also has a protection to keep the previously loaded blocks if there's no other authoritative owner for that block in the ring, but that wasn't the case neither, because there were other owners:

// The block is not owned by the store-gateway. However, if it's currently loaded
// we can safely unload it only once at least 1 authoritative owner is available
// for queries.
if _, ok := loaded[blockID]; ok {
// The ring Get() returns an error if there's no available instance.
if _, err := r.Get(key, BlocksOwnerRead, bufDescs, bufHosts, bufZones); err != nil {
// Keep the block.
continue
}
}
// The block is not owned by the store-gateway and there's at least 1 available
// authoritative owner available for queries, so we can filter it out (and unload
// it if it was loaded).
synced.WithLabelValues(shardExcludedMeta).Inc()
delete(metas, blockID)

So all the blocks have been unloaded, and then (at the 2nd next sync) progressively loaded back.

Proposal

I propose to add a simple additional protection to the store-gateway which doesn't drop any loaded block if the store-gateway itself is unhealthy in the ring.

The idea is that if the store-gateway itself is detected as unhealthy in the ring, it shouldn't do anything during the periodic sync, neither add or drop blocks, because it's in a inconsistent situation: the store-gateway is obviously running, but detected as unhealthy in the ring.

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions