Skip to content

Unregister ingesters on shutdown unless an update is being rolled out #5901

@liam-howe-maersk

Description

@liam-howe-maersk

Is your feature request related to a problem? Please describe.

We run our Mimir stack on AKS spot nodes across 3 availability zones and so far we have found this to be an acceptable solution for us despite the risk of node evictions and pods being shuffled amongst nodes. Part of the reason this has worked for us is by making use of the setting -ingester.unregister-on-shutdown=true so that when an ingester is shut down, because it is being moved to another node, it will unregister itself from the ring and it will not cause any downtime as series are sent to other ingesters in the zone.

However, we have observed that this has a negative effect during rollout of updates to a zone, similar to discussed here grafana/helm-charts#1313. During a rollout a zone will see it's number of in-memory series double and lots of out-of-order samples ingested, both of which cause more CPU to be required and makes our system unstable and susceptible to downtime during ingester rollouts. This is becoming more prevelant as our cluster grows and we only expect to grow more, right now we have about 360,000,000 in-memory series across 3 zones and 330 ingesters and this can double during rollouts.

Describe the solution you'd like

I believe that we could make our approach work if we were able to tell ingesters not to unregister themselves when an update is being rolled out to their zone, but other zones should continue to unregister themselves if they are shutdown for any reason. We use the rollout-operator to handle our rollouts across our 3 zones and so expect only 1 zone to be rolled out at a time.

Right now I believe that to update the config -ingester.unregister-on-shutdown to false we would need to rollout an update to the ingesters which would again cause the problems we are already seeing. Some solution whereby we update the ingesters' configuration to not unregister itself before the rollout-operator kills the pod should work, if this is possible.

Describe alternatives you've considered

If there was some way to update this config for unregistering on the fly then we could script something to change this setting before we merge an update to our ingesters but it feels like having this logic as part of the rollout-operator would be the smoothest, in which case it may make more sense to move this issue to that repository.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions