-
Notifications
You must be signed in to change notification settings - Fork 634
Description
This issue describes an issue we experienced yesterday on some store-gateways in a Mimir cluster.
Scenario:
- Mimir configured to run ring on Consul (but the issue could happen with any ring backend,
memberlist
included) - Store-gateway is overloaded while lazy-loading a large number of big block's index-headers
- Store-gateway fails to heartbeat the ring for X consecutive minutes, where X is >
-store-gateway.sharding-ring.heartbeat-timeout
- At the 1st next sync, the store-gateway detects itself as unhealthy and drop all its blocks
- At the 2nd next sync, the store-gateway detects itself as healthy and re-adds all its blocks, causing additional load to the store-gateway itself
Zoom in into a specific store-gateway
The issue described above happened to several store-gateways in a large cluster.
To better understand the behaviour, I'm looking at a specific one: store-gateway-zone-b-3
.
The sequence of related block syncs are:
level=info ts=2022-05-02T09:06:19.395846082Z caller=gateway.go:288 msg="synchronizing TSDB blocks for all users" reason=ring-changeShow context
level=info ts=2022-05-02T09:07:03.365919167Z caller=gateway.go:294 msg="successfully synchronized TSDB blocks for all users" reason=ring-change
# In this sync all blocks have been dropped.
level=info ts=2022-05-02T09:07:03.3659588Z caller=gateway.go:288 msg="synchronizing TSDB blocks for all users" reason=ring-change
level=info ts=2022-05-02T09:08:55.148949647Z caller=gateway.go:294 msg="successfully synchronized TSDB blocks for all users" reason=ring-change
# In this sync all blocks have been re-added. The sync took 37 minutes.
level=info ts=2022-05-02T09:08:55.148992704Z caller=gateway.go:288 msg="synchronizing TSDB blocks for all users" reason=ring-change
level=info ts=2022-05-02T09:45:34.826814806Z caller=gateway.go:294 msg="successfully synchronized TSDB blocks for all users" reason=ring-change
level=info ts=2022-05-02T09:45:34.826872774Z caller=gateway.go:288 msg="synchronizing TSDB blocks for all users" reason=ring-change
level=info ts=2022-05-02T09:45:34.853222892Z caller=gateway.go:294 msg="successfully synchronized TSDB blocks for all users" reason=ring-change
Querying the metric cortex_consul_request_duration_seconds_count{namespace="REDACTED",pod="store-gateway-zone-b-3",operation="CAS",status_code="200"}
we can see the number of successful CAS operations did not increase between 09:03:15 and 09:08:30. The metric value (when scraping was successful because store-gateway was overloaded) is always 2369
during that timeframe:
The store-gateway is running with -store-gateway.sharding-ring.heartbeat-timeout=4m
, the last successful heartbeating was before 09:03:00, so at minute 09:07:00 it was expired and unhealthy in the ring. The next sync started at 09:07:03, once the store-gateway was unhealthy in the ring.
The store-gateway has a protection to keep the previously loaded blocks in case it's unable to lookup the ring. However, it did successfully lookup the ring, but the self instance was detected as unhealthy:
mimir/pkg/storegateway/sharding_strategy.go
Lines 98 to 110 in fb39490
if err != nil { | |
if _, ok := loaded[blockID]; ok { | |
level.Warn(logger).Log("msg", "failed to check block owner but block is kept because was previously loaded", "block", blockID.String(), "err", err) | |
} else { | |
level.Warn(logger).Log("msg", "failed to check block owner and block has been excluded because was not previously loaded", "block", blockID.String(), "err", err) | |
// Skip the block. | |
synced.WithLabelValues(shardExcludedMeta).Inc() | |
delete(metas, blockID) | |
} | |
continue | |
} |
The store-gateway also has a protection to keep the previously loaded blocks if there's no other authoritative owner for that block in the ring, but that wasn't the case neither, because there were other owners:
mimir/pkg/storegateway/sharding_strategy.go
Lines 117 to 132 in fb39490
// The block is not owned by the store-gateway. However, if it's currently loaded | |
// we can safely unload it only once at least 1 authoritative owner is available | |
// for queries. | |
if _, ok := loaded[blockID]; ok { | |
// The ring Get() returns an error if there's no available instance. | |
if _, err := r.Get(key, BlocksOwnerRead, bufDescs, bufHosts, bufZones); err != nil { | |
// Keep the block. | |
continue | |
} | |
} | |
// The block is not owned by the store-gateway and there's at least 1 available | |
// authoritative owner available for queries, so we can filter it out (and unload | |
// it if it was loaded). | |
synced.WithLabelValues(shardExcludedMeta).Inc() | |
delete(metas, blockID) |
So all the blocks have been unloaded, and then (at the 2nd next sync) progressively loaded back.
Proposal
I propose to add a simple additional protection to the store-gateway which doesn't drop any loaded block if the store-gateway itself is unhealthy in the ring.
The idea is that if the store-gateway itself is detected as unhealthy in the ring, it shouldn't do anything during the periodic sync, neither add or drop blocks, because it's in a inconsistent situation: the store-gateway is obviously running, but detected as unhealthy in the ring.