-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Batch update with singleton/sharding fixes #3780
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Let me know when this is ready to review - I'm working on those serialization fixes over here: #3744 |
a4d4cb3
to
c074907
Compare
I think this is mostly ready, there is one open topic with persistent sharding passivate-idle |
@zbynek001 looks like there might be some conflicts with the most recent changes in |
Migrated from #25854
Migrated from #25274
Migrated from #24442
Migrated from #25639 (#25710) Testing of singleton leaving gossip optimization, exiting change to two oldest per role hardening ClusterSingletonManagerIsStuck restart, increase ClusterSingletonManagerIsStuck
Migrated from #26336 (#26339) Stop singleton when self MemberDowned * It's safer to stop singleton instance early in case of downing. * Instead of waiting for MemberRemoved and trying to hand over. Stop ShardRegion when self MemberDowned Upper bound when waiting for seen in shutdownSelfWhenDown
Migrated from #26793
Migrated from #25648 Stops entities of shard forcefully if they don't handle stopMessage Prints a warning log while stopping the entities fix version of backward exclude file and checks for shard stopped adds documentation for handoff timeout
… (for validation) Migrated from #26061
Migrated from #26012 (#26101) * Improve default shard rebalancing algorithm * Use rebalance-threshold=1 because it will give the best distribution, and previous default could result in too large difference between nodes * Off by one error, difference > threshold vs >= * Added more unit tests * Note that in some cases it may still not be optimal, stopping more shards than necessary, but a different strategy that looks at more than most and least is out of scope for this issue. In practise those cases shouldn't matter much. * Also note that the rebalance interval is by default 10 seconds, so typically shards will start up before next rebalance tick. It's intentionally a slow process to not cause instabilities by moving too much at the same time.
Migrated from #26214 * I could reproduce the issue locally with debug logging and it's clear that it's a timing issue. The GracefulShutdownReq message goes to deadletters and it's not retried because the coordinator variable was unset. * cluster-sharding-shutdown-region phase of CoordinatedShutdown timed out
Migrated from #26451 (#26452)
Migrated from #25782 (#26819)
ddata passivate-idle fix sharding fixes (part of #26878)
Compilation issue on Linux:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM overall - but need to take a look at some of the compilation issues on Linux and some of the changes made to ClusterDaemon
in more detail.
_probe2.ExpectMsg<int>(10.Seconds()).Should().Be(2); | ||
_region2.Tell(3, _probe2.Ref); | ||
_probe2.ExpectMsg<int>(10.Seconds()).Should().Be(3); | ||
AwaitAssert(() => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good change
} | ||
} | ||
|
||
#endregion | ||
|
||
protected ClusterShardingSettings settings; | ||
protected readonly TimeSpan smallTolerance = TimeSpan.FromMilliseconds(300); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, saw this when I started work on #3780 - good change
@@ -1109,7 +1109,7 @@ public IActorRef ShardRegionProxy(string typeName) | |||
throw new ArgumentException($"Shard type [{typeName}] must be started first"); | |||
} | |||
|
|||
private IShardAllocationStrategy DefaultShardAllocationStrategy(ClusterShardingSettings settings) | |||
public IShardAllocationStrategy DefaultShardAllocationStrategy(ClusterShardingSettings settings) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good change
/// <param name="e"></param> | ||
/// <param name="keepNrOfBatches"></param> | ||
/// <param name="snapshotAfter"></param> | ||
internal void InternalDeleteMessagesBeforeSnapshot(SaveSnapshotSuccess e, int keepNrOfBatches, int snapshotAfter) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Like this - not user-friendly enough to make public
but good enough for our internal uses of Eventsourced
.
@@ -45,7 +45,7 @@ public CoordinatedShutdownLeave() | |||
{ | |||
// MemberRemoved is needed in case it was downed instead | |||
_cluster.Leave(_cluster.SelfAddress); | |||
_cluster.Subscribe(Self, typeof(ClusterEvent.MemberLeft), typeof(ClusterEvent.MemberRemoved)); | |||
_cluster.Subscribe(Self, typeof(ClusterEvent.MemberLeft), typeof(ClusterEvent.MemberRemoved), typeof(ClusterEvent.MemberDowned)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch
@@ -885,6 +901,9 @@ private static IEnumerable<IMemberEvent> CollectMemberEvents(IEnumerable<Member> | |||
case MemberStatus.Exiting: | |||
yield return new MemberExited(member); | |||
break; | |||
case MemberStatus.Down: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also a good catch
/// Member status changed to `MemberStatus.Down` and will be removed | ||
/// when all members have seen the `Down` status. | ||
/// </summary> | ||
public sealed class MemberDowned : MemberStatusChange |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
@@ -153,10 +153,15 @@ private bool MatchingRole(Member member) | |||
private void HandleInitial(ClusterEvent.CurrentClusterState state) | |||
{ | |||
_membersByAge = state.Members | |||
.Where(m => (m.Status == MemberStatus.Up || m.Status == MemberStatus.Leaving) && MatchingRole(m)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense - wouldn't want to pin a ClusterSingleton
on a node that's leaving.
@@ -13,6 +13,7 @@ | |||
|
|||
<ItemGroup> | |||
<PackageReference Include="Microsoft.NET.Test.Sdk" Version="$(TestSdkVersion)" /> | |||
<PackageReference Include="System.ValueTuple" Version="4.4.0" /> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like this is what's causing the issues on Linux - let me take a look and see if I can figure out why.
The |
Only affects Linux though. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
@@ -2172,6 +2175,10 @@ private void ShutdownSelfWhenDown() | |||
SendGossipRandom(MaxGossipsBeforeShuttingDownMyself); | |||
Shutdown(); | |||
} | |||
else | |||
{ | |||
_selfDownCounter++; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense - gives the node a chance to gossip a few times prior to termination.
} | ||
|
||
/// <summary> | ||
/// Gossip the Exiting change to the two oldest nodes for quick dissemination to potential Singleton nodes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This optimization makes sense
{ | ||
if (exitingMembers.Any()) | ||
{ | ||
var roles = exitingMembers.SelectMany(m => m.Roles); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might be enumerating the existingMembers
collection multiple times, but this doesn't appear to be an issue per the actual unit tests and MNTR results.
thanks for your contribution @zbynek001 |
* Introduce 'MemberDowned' member event Migrated from #25854 * Ignore possible state change while waiting for removal Migrated from #25274 * mark TryToIdentifySingleton as NoSerializationVerificationNeeded Migrated from #24442 * Improvements of singleton leaving scenario Migrated from #25639 (#25710) Testing of singleton leaving gossip optimization, exiting change to two oldest per role hardening ClusterSingletonManagerIsStuck restart, increase ClusterSingletonManagerIsStuck * Stop singleton and shards when self MemberDowned Migrated from #26336 (#26339) Stop singleton when self MemberDowned * It's safer to stop singleton instance early in case of downing. * Instead of waiting for MemberRemoved and trying to hand over. Stop ShardRegion when self MemberDowned Upper bound when waiting for seen in shutdownSelfWhenDown * Discards HandOverToMe in state End to avoid unhandled message warning Migrated from #26793 * Warn if handOffStopMessage not handled Migrated from #25648 Stops entities of shard forcefully if they don't handle stopMessage Prints a warning log while stopping the entities fix version of backward exclude file and checks for shard stopped adds documentation for handoff timeout * Save EntityStarted when StartEntity requested via remembered entities (for validation) Migrated from #26061 * Improve default shard rebalancing algorithm Migrated from #26012 (#26101) * Improve default shard rebalancing algorithm * Use rebalance-threshold=1 because it will give the best distribution, and previous default could result in too large difference between nodes * Off by one error, difference > threshold vs >= * Added more unit tests * Note that in some cases it may still not be optimal, stopping more shards than necessary, but a different strategy that looks at more than most and least is out of scope for this issue. In practise those cases shouldn't matter much. * Also note that the rebalance interval is by default 10 seconds, so typically shards will start up before next rebalance tick. It's intentionally a slow process to not cause instabilities by moving too much at the same time. * Always retry sendGracefulShutdownToCoordinator Migrated from #26214 * I could reproduce the issue locally with debug logging and it's clear that it's a timing issue. The GracefulShutdownReq message goes to deadletters and it's not retried because the coordinator variable was unset. * cluster-sharding-shutdown-region phase of CoordinatedShutdown timed out * Consolidate duplicate persistence sharding function Migrated from #26451 (#26452) * API approval * sharding with ddata specs timing issue fix * Enable passivate-idle-entity-after by default Migrated from #25782 (#26819) * persistent shard passivate-idle fix ddata passivate-idle fix sharding fixes (part of #26878) * Removed references to System.ValueTuple
* Introduce 'MemberDowned' member event Migrated from #25854 * Ignore possible state change while waiting for removal Migrated from #25274 * mark TryToIdentifySingleton as NoSerializationVerificationNeeded Migrated from #24442 * Improvements of singleton leaving scenario Migrated from #25639 (#25710) Testing of singleton leaving gossip optimization, exiting change to two oldest per role hardening ClusterSingletonManagerIsStuck restart, increase ClusterSingletonManagerIsStuck * Stop singleton and shards when self MemberDowned Migrated from #26336 (#26339) Stop singleton when self MemberDowned * It's safer to stop singleton instance early in case of downing. * Instead of waiting for MemberRemoved and trying to hand over. Stop ShardRegion when self MemberDowned Upper bound when waiting for seen in shutdownSelfWhenDown * Discards HandOverToMe in state End to avoid unhandled message warning Migrated from #26793 * Warn if handOffStopMessage not handled Migrated from #25648 Stops entities of shard forcefully if they don't handle stopMessage Prints a warning log while stopping the entities fix version of backward exclude file and checks for shard stopped adds documentation for handoff timeout * Save EntityStarted when StartEntity requested via remembered entities (for validation) Migrated from #26061 * Improve default shard rebalancing algorithm Migrated from #26012 (#26101) * Improve default shard rebalancing algorithm * Use rebalance-threshold=1 because it will give the best distribution, and previous default could result in too large difference between nodes * Off by one error, difference > threshold vs >= * Added more unit tests * Note that in some cases it may still not be optimal, stopping more shards than necessary, but a different strategy that looks at more than most and least is out of scope for this issue. In practise those cases shouldn't matter much. * Also note that the rebalance interval is by default 10 seconds, so typically shards will start up before next rebalance tick. It's intentionally a slow process to not cause instabilities by moving too much at the same time. * Always retry sendGracefulShutdownToCoordinator Migrated from #26214 * I could reproduce the issue locally with debug logging and it's clear that it's a timing issue. The GracefulShutdownReq message goes to deadletters and it's not retried because the coordinator variable was unset. * cluster-sharding-shutdown-region phase of CoordinatedShutdown timed out * Consolidate duplicate persistence sharding function Migrated from #26451 (#26452) * API approval * sharding with ddata specs timing issue fix * Enable passivate-idle-entity-after by default Migrated from #25782 (#26819) * persistent shard passivate-idle fix ddata passivate-idle fix sharding fixes (part of #26878) * Removed references to System.ValueTuple
* Introduce 'MemberDowned' member event Migrated from #25854 * Ignore possible state change while waiting for removal Migrated from #25274 * mark TryToIdentifySingleton as NoSerializationVerificationNeeded Migrated from #24442 * Improvements of singleton leaving scenario Migrated from #25639 (#25710) Testing of singleton leaving gossip optimization, exiting change to two oldest per role hardening ClusterSingletonManagerIsStuck restart, increase ClusterSingletonManagerIsStuck * Stop singleton and shards when self MemberDowned Migrated from #26336 (#26339) Stop singleton when self MemberDowned * It's safer to stop singleton instance early in case of downing. * Instead of waiting for MemberRemoved and trying to hand over. Stop ShardRegion when self MemberDowned Upper bound when waiting for seen in shutdownSelfWhenDown * Discards HandOverToMe in state End to avoid unhandled message warning Migrated from #26793 * Warn if handOffStopMessage not handled Migrated from #25648 Stops entities of shard forcefully if they don't handle stopMessage Prints a warning log while stopping the entities fix version of backward exclude file and checks for shard stopped adds documentation for handoff timeout * Save EntityStarted when StartEntity requested via remembered entities (for validation) Migrated from #26061 * Improve default shard rebalancing algorithm Migrated from #26012 (#26101) * Improve default shard rebalancing algorithm * Use rebalance-threshold=1 because it will give the best distribution, and previous default could result in too large difference between nodes * Off by one error, difference > threshold vs >= * Added more unit tests * Note that in some cases it may still not be optimal, stopping more shards than necessary, but a different strategy that looks at more than most and least is out of scope for this issue. In practise those cases shouldn't matter much. * Also note that the rebalance interval is by default 10 seconds, so typically shards will start up before next rebalance tick. It's intentionally a slow process to not cause instabilities by moving too much at the same time. * Always retry sendGracefulShutdownToCoordinator Migrated from #26214 * I could reproduce the issue locally with debug logging and it's clear that it's a timing issue. The GracefulShutdownReq message goes to deadletters and it's not retried because the coordinator variable was unset. * cluster-sharding-shutdown-region phase of CoordinatedShutdown timed out * Consolidate duplicate persistence sharding function Migrated from #26451 (#26452) * API approval * sharding with ddata specs timing issue fix * Enable passivate-idle-entity-after by default Migrated from #25782 (#26819) * persistent shard passivate-idle fix ddata passivate-idle fix sharding fixes (part of #26878) * Removed references to System.ValueTuple
* Introduce 'MemberDowned' member event Migrated from #25854 * Ignore possible state change while waiting for removal Migrated from #25274 * mark TryToIdentifySingleton as NoSerializationVerificationNeeded Migrated from #24442 * Improvements of singleton leaving scenario Migrated from #25639 (#25710) Testing of singleton leaving gossip optimization, exiting change to two oldest per role hardening ClusterSingletonManagerIsStuck restart, increase ClusterSingletonManagerIsStuck * Stop singleton and shards when self MemberDowned Migrated from #26336 (#26339) Stop singleton when self MemberDowned * It's safer to stop singleton instance early in case of downing. * Instead of waiting for MemberRemoved and trying to hand over. Stop ShardRegion when self MemberDowned Upper bound when waiting for seen in shutdownSelfWhenDown * Discards HandOverToMe in state End to avoid unhandled message warning Migrated from #26793 * Warn if handOffStopMessage not handled Migrated from #25648 Stops entities of shard forcefully if they don't handle stopMessage Prints a warning log while stopping the entities fix version of backward exclude file and checks for shard stopped adds documentation for handoff timeout * Save EntityStarted when StartEntity requested via remembered entities (for validation) Migrated from #26061 * Improve default shard rebalancing algorithm Migrated from #26012 (#26101) * Improve default shard rebalancing algorithm * Use rebalance-threshold=1 because it will give the best distribution, and previous default could result in too large difference between nodes * Off by one error, difference > threshold vs >= * Added more unit tests * Note that in some cases it may still not be optimal, stopping more shards than necessary, but a different strategy that looks at more than most and least is out of scope for this issue. In practise those cases shouldn't matter much. * Also note that the rebalance interval is by default 10 seconds, so typically shards will start up before next rebalance tick. It's intentionally a slow process to not cause instabilities by moving too much at the same time. * Always retry sendGracefulShutdownToCoordinator Migrated from #26214 * I could reproduce the issue locally with debug logging and it's clear that it's a timing issue. The GracefulShutdownReq message goes to deadletters and it's not retried because the coordinator variable was unset. * cluster-sharding-shutdown-region phase of CoordinatedShutdown timed out * Consolidate duplicate persistence sharding function Migrated from #26451 (#26452) * API approval * sharding with ddata specs timing issue fix * Enable passivate-idle-entity-after by default Migrated from #25782 (#26819) * persistent shard passivate-idle fix ddata passivate-idle fix sharding fixes (part of #26878) * Removed references to System.ValueTuple
Mostly singleton & sharding fixes, plus some dependencies
Will split it into smaller parts if needed
Included:
akka/akka#25854
akka/akka#25274
akka/akka#24442
akka/akka#25639
akka/akka#26336
akka/akka#26793
akka/akka#25648
akka/akka#26061
akka/akka#26012
akka/akka#26214
akka/akka#26451
akka/akka#26819
& persistent shard passivate-idle fix