Add proposal document for ingester unregister flexibility #7231

LasseHels · 2024-01-26T13:01:04Z

What this PR does

This pull request adds a document which summarises #5901 and introduces a solution proposal.

Checklist

~~Tests updated.~~
Documentation added.
~~CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX].~~
~~about-versioning.md updated with experimental features.~~

CLAassistant · 2024-01-26T13:01:10Z

All committers have signed the CLA.

dimitarvdimitrov · 2024-01-26T18:17:41Z

thanks for taking the time to write this @LasseHels!

It makes sense to me in terms of allowing to trade of query consistency for availability. Mimir wasn't built with this use case in mind, so this solution seems a bit counter-intuitive. But if we make this clear in the documentation for anyone who uses it, it is an acceptable solution.

I've also asked other maintainers to take a look.

pstibrany

Thank you very much for writing the summary of the previous discussion in this short and digestible proposal. Suggested change looks tiny, and I would be curious to learn how it works in practice.

Have you tried to implement it and test if it works in your situation? If it works, I have no objections to adding such endpoint to Mimir.

docs/proposals/ingester-unregister-flexibility.md

pstibrany · 2024-01-29T08:49:24Z

docs/proposals/ingester-unregister-flexibility.md

+**Cons**:
+* Queries may return incomplete results as samples held by the evicted ingesters are temporarily inaccessible.
+  The samples become accessible again when the ingesters join the ring after having replayed their WAL.
+* Ownership of evicted series (i.e., series whose owner was evicted) is split between two ingesters. This is likely


To confirm my understanding, this is a result of resharding, right?

this is a result of resharding, right?

Yes. Consider this example:

Ingester a-1 owns series up{app="database"}.

a-1 ingests a couple of samples from series up{app="database"}.

a-1 is evicted.

As soon as a-1 is evicted, ownership of up{app="database"} is implicitly passed to a-5.

While a-1 is restarting, a-5 ingests samples from the up{app="database"} series.

Ownership of series up{app="database"} remains with a-5 even after a-1 has joined the ring once again¹.

Ownership of series up{app="database"} is now split since both a-1 and a-5 hold samples for the series.

Footnotes

Unless tokens are persisted on disk via -ingester.ring.tokens-file-path, in which case ownership of the series will be passed back to a-1. ↩

Right, thank you. I find this aspect of the proposal quite worrisome, to be honest, but whether it is a problem or not depends on your specifics.

(I will "unresolve" this conversation, as it has context that others may find useful.)

@pstibrany Is there anything in particular with split series ownership that concerns you?

If my understanding is correct, split ownership can already happen if ingesters are configured (in the standard way) to leave the ring on shutdown. I suppose the endpoint makes it easier to trigger a state of split ownership.

split ownership can already happen if ingesters are configured (in the standard way) to leave the ring on shutdown

That's correct. However, we don't recommend to leave ring on shutdown (e.g. we don't do it in our Jsonnet and Helm). The only reason why the default is still -ingester.ring.unregister-on-shutdown=true is because it's easier for the first time Mimir operator (at the beginning it's difficult to understand the implications of not leaving the ring on shutdown).

@pstibrany Is there anything in particular with split series ownership that concerns you?

Resharding of the series and related increased usage of resources (memory, cpu) on other ingesters. If ingesters leave ring frequently, surviving ingesters may accumulate lot of series in memory that otherwise belong to other ingesters. It's something to keep in mind when planning for capacity. You save cost thanks to using spot-instances, but you may need to plan for larger instances instead. As you mention, it depends on how often ingester evictions happen.

LasseHels · 2024-01-29T12:06:05Z

@pstibrany

Have you tried to implement it and test if it works in your situation?

We've not implemented it yet. We wanted to ensure that we had at least some buy-in from the Mimir team before starting with the implementation.

pracucci

With the maintainer hat

I agree with Peter that the proposed change looks small. I also think you don't need to get these changes in Mimir main to start testing it. Would be nice (but not a blocker) if you could just do the changes, deploy them in your cluster and share some data points about how it works for you.

Mimir is quite opininated about guaranteeing we never return partial query results. The only reason why this proposal doesn't break such stance is because the proposed API is generic and it's how it will get used that may cause partial query results (but it's no different than clicking the "Forget" button the ring page).

With the operator hat

I'm personally not much convinced about the approach you're going to take. I understand cutting costs, but I'm not sure doubting whether the query results you get are consistent or not may provide a bad query experience.

I also think that in case of a significant % of ingesters removed from the ring, the others will suffer a thundering effect that will cause an outage anyway. By experience, recovering from such outage may be tricky. If remaining ingesters ingested a volume of data they can't practically handle, the only way to recover is vertically scaling, because all data written to WAL will still be there in a subsequent restart (alternative is delete the WAL, which is quite bad).

That being said, these are my personal considerations on how you plan to use the API and not the API itself for which I have no concerns.

LasseHels · 2024-01-31T07:34:40Z

@pstibrany @pracucci Thanks for your feedback. We will get started with an implementation and report back when we've tried it out, which could take a while.

jdbaldry

I'll leave this review for @grafana/mimir-maintainers and I've opened #7493 for the future.

LasseHels · 2024-04-25T07:49:44Z

We will get started with an implementation and report back when we've tried it out, which could take a while.

See #5901 (comment).

@pstibrany Now that the parent pull request has been merged, should we merge this one also?

pstibrany · 2024-04-25T07:51:48Z

@pstibrany Now that the parent pull request has been merged, should we merge this one also?

Let's do that.

LasseHels · 2024-04-25T08:06:04Z

Looks like a CI check is failing, let me see if I can fix that.

LasseHels · 2024-04-25T08:23:17Z

Looks like a CI check is failing, let me see if I can fix that.

Fixed!

@pstibrany I think auto-merge was disabled because I pushed a new commit to fix the pipeline. We should be good to merge now.

add proposal document for ingester unregister flexibility

4b92958

clarifications

55ec50c

LasseHels marked this pull request as ready for review January 26, 2024 13:09

LasseHels requested review from a team as code owners January 26, 2024 13:09

Merge branch 'main' into proposal-ingester-unregister-flexibility

4b6f279

LasseHels mentioned this pull request Jan 26, 2024

Unregister ingesters on shutdown unless an update is being rolled out #5901

Closed

pstibrany approved these changes Jan 29, 2024

View reviewed changes

LasseHels added 3 commits January 29, 2024 12:26

rename /ingester/prepare-unregister to /ingester/unregister-on-shutdown

8a2cd1f

fix typo

b6246b0

Merge branch 'main' into proposal-ingester-unregister-flexibility

b9db08a

pracucci approved these changes Jan 29, 2024

View reviewed changes

LasseHels added 2 commits February 8, 2024 12:05

Merge branch 'main' into proposal-ingester-unregister-flexibility

1ab0737

Merge branch 'main' into proposal-ingester-unregister-flexibility

299dd6d

LasseHels requested a review from jdbaldry as a code owner February 28, 2024 08:52

jdbaldry reviewed Feb 28, 2024

View reviewed changes

LasseHels mentioned this pull request Mar 27, 2024

Add /ingester/unregister-on-shutdown HTTP endpoint #7739

Merged

4 tasks

dimitarvdimitrov mentioned this pull request Apr 8, 2024

Ingesters should remove themselves from the ring immediately on shutdown #5978

Closed

Merge branch 'main' into proposal-ingester-unregister-flexibility

c3554fe

pstibrany enabled auto-merge (squash) April 25, 2024 07:51

make doc

823848d

auto-merge was automatically disabled April 25, 2024 08:06
Head branch was pushed to by a user without write access

pstibrany merged commit 00a2d1d into grafana:main Apr 25, 2024

Add proposal document for ingester unregister flexibility #7231

Add proposal document for ingester unregister flexibility #7231

Uh oh!

Conversation

LasseHels commented Jan 26, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this PR does

Checklist

Uh oh!

CLAassistant commented Jan 26, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dimitarvdimitrov commented Jan 26, 2024

Uh oh!

pstibrany left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

pstibrany Jan 29, 2024

Choose a reason for hiding this comment

Uh oh!

LasseHels Jan 29, 2024

Choose a reason for hiding this comment

Footnotes

Uh oh!

pstibrany Jan 29, 2024

Choose a reason for hiding this comment

Uh oh!

LasseHels Jan 29, 2024

Choose a reason for hiding this comment

Uh oh!

pracucci Jan 29, 2024

Choose a reason for hiding this comment

Uh oh!

pstibrany Jan 29, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

LasseHels commented Jan 29, 2024

Uh oh!

pracucci left a comment

Choose a reason for hiding this comment

With the maintainer hat

With the operator hat

Uh oh!

LasseHels commented Jan 31, 2024

Uh oh!

jdbaldry left a comment

Choose a reason for hiding this comment

Uh oh!

LasseHels commented Apr 25, 2024

Uh oh!

pstibrany commented Apr 25, 2024

Uh oh!

LasseHels commented Apr 25, 2024

Uh oh!

LasseHels commented Apr 25, 2024

Uh oh!

Uh oh!

LasseHels commented Jan 26, 2024 •

edited

Loading

CLAassistant commented Jan 26, 2024 •

edited

Loading

pstibrany Jan 29, 2024 •

edited

Loading