[Idea] Salvage Activations from Defunct Directory Partitions

This is not my idea, but I'm documenting my understanding of it here for discussion so that we can come up with an action plan.

First, a primer on how Orleans finds where a grain activation lives - correct me if I'm wrong: 
* Orleans supports a flexible placement model where the placement of an activation is dynamic and guided by its placement policy rather than being restricted to a fixed calculation.
* To support this model, Orleans maintains a distributed lookup table of GrainId -> ActivationId, where ActivationId points to the silo which hosts an activation
* This distributed table is partitioned across all silos in the cluster, where each silo holds one partition of the directory.
* Silos are placed around conceptual ring based upon a hash of their `SiloAddress` (similar to nodes in a [Chord DHT](https://en.wikipedia.org/wiki/Chord_(peer-to-peer))).
* In order to find where a grain activation exists in the cluster, its `GrainId` is mapped to a silo on that ring by finding the silo with the largest `hash(SiloAddress)` less than the `hash(GrainId)`. This is the Primary Directory Partition for this grain.
* When a silo is added to the cluster, the directory on each silo is notified and they each perform a hand-off for any grains which have a new primary directory partition (based on the above algorithm).
* When a silo is removed, similar rebalancing happens so that new activations can be placed on a surviving silo.
* Each silo maintains a local cache of parts of the distributed table as an optimization, similar to a DNS cache.

That's how a directory partition handles cluster membership changes, but what happens to the actual activations (`Grain` instances) during a membership change? Currently, if a silo dies, every activation whose primary directory partition was on that silo is eagerly deactivated (see `Catalog.SiloStatusChangeNotification`). That is, if SiloA has an activation whose primary directory partition is on SiloB and SiloB dies, then SiloA will kill that activation. This can cause a large amount of activation churn, particularly in small clusters.

The proposal is to register those activations with the correct, surviving directory partition instead of deactivating them.

We must maintain a few invariants while implementing this optimization:
* Activations must eventually converge to at most one per grain.
* Activations which are tracked using the directory cannot be allowed to exist without being registered in the directory - they cannot be orphaned.
* Activations must eventually be registered in the correct directory partition.

Are there nuances here which I've missed or is this too vague?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Idea] Salvage Activations from Defunct Directory Partitions #2656

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Idea] Salvage Activations from Defunct Directory Partitions #2656

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions