Batch update with singleton/sharding fixes #3780

zbynek001 · 2019-05-06T21:54:52Z

Mostly singleton & sharding fixes, plus some dependencies
Will split it into smaller parts if needed

Included:
akka/akka#25854
akka/akka#25274
akka/akka#24442
akka/akka#25639
akka/akka#26336
akka/akka#26793
akka/akka#25648
akka/akka#26061
akka/akka#26012
akka/akka#26214
akka/akka#26451
akka/akka#26819
& persistent shard passivate-idle fix

Aaronontheweb · 2019-05-07T17:36:41Z

Let me know when this is ready to review - I'm working on those serialization fixes over here: #3744

zbynek001 · 2019-05-09T09:49:00Z

I think this is mostly ready, there is one open topic with persistent sharding passivate-idle

Aaronontheweb · 2019-05-16T22:26:55Z

@zbynek001 looks like there might be some conflicts with the most recent changes in dev I pushed earlier today.

Migrated from #25854

Migrated from #25274

Migrated from #24442

Migrated from #25639 (#25710) Testing of singleton leaving gossip optimization, exiting change to two oldest per role hardening ClusterSingletonManagerIsStuck restart, increase ClusterSingletonManagerIsStuck

Migrated from #26336 (#26339) Stop singleton when self MemberDowned * It's safer to stop singleton instance early in case of downing. * Instead of waiting for MemberRemoved and trying to hand over. Stop ShardRegion when self MemberDowned Upper bound when waiting for seen in shutdownSelfWhenDown

Migrated from #26793

Migrated from #25648 Stops entities of shard forcefully if they don't handle stopMessage Prints a warning log while stopping the entities fix version of backward exclude file and checks for shard stopped adds documentation for handoff timeout

… (for validation) Migrated from #26061

Migrated from #26012 (#26101) * Improve default shard rebalancing algorithm * Use rebalance-threshold=1 because it will give the best distribution, and previous default could result in too large difference between nodes * Off by one error, difference > threshold vs >= * Added more unit tests * Note that in some cases it may still not be optimal, stopping more shards than necessary, but a different strategy that looks at more than most and least is out of scope for this issue. In practise those cases shouldn't matter much. * Also note that the rebalance interval is by default 10 seconds, so typically shards will start up before next rebalance tick. It's intentionally a slow process to not cause instabilities by moving too much at the same time.

Migrated from #26214 * I could reproduce the issue locally with debug logging and it's clear that it's a timing issue. The GracefulShutdownReq message goes to deadletters and it's not retried because the coordinator variable was unset. * cluster-sharding-shutdown-region phase of CoordinatedShutdown timed out

Migrated from #26451 (#26452)

Migrated from #25782 (#26819)

ddata passivate-idle fix sharding fixes (part of #26878)

Aaronontheweb · 2019-05-21T17:33:01Z

Looks like this is already covered inside #3801

Edit: wrote that comment backwards - you took care of #3801 so I can go close that PR, I think

Aaronontheweb · 2019-05-21T18:16:35Z

Compilation issue on Linux:

Singleton/ClusterSingletonLeavingSpeedSpec.cs(118,18): error CS8137: Cannot define a class or member that utilizes tuples because the compiler required type 'System.Runtime.CompilerServices.TupleElementNamesAttribute' cannot be found. Are you missing a reference? [/home/vsts/work/1/s/src/contrib/cluster/Akka.Cluster.Tools.Tests/Akka.Cluster.Tools.Tests.csproj]
Singleton/ClusterSingletonLeavingSpeedSpec.cs(118,18): error CS8356: Predefined type 'System.ValueTuple`2' is declared in multiple referenced assemblies: 'mscorlib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089' and 'System.ValueTuple, Version=4.0.2.0, Culture=neutral, PublicKeyToken=cc7b13ffcd2ddd51' [/home/vsts/work/1/s/src/contrib/cluster/Akka.Cluster.Tools.Tests/Akka.Cluster.Tools.Tests.csproj]

Aaronontheweb

LGTM overall - but need to take a look at some of the compilation issues on Linux and some of the changes made to ClusterDaemon in more detail.

Aaronontheweb · 2019-05-21T18:22:40Z

src/contrib/cluster/Akka.Cluster.Sharding.Tests/CoordinatedShutdownShardingSpec.cs

-            _probe2.ExpectMsg<int>(10.Seconds()).Should().Be(2);
-            _region2.Tell(3, _probe2.Ref);
-            _probe2.ExpectMsg<int>(10.Seconds()).Should().Be(3);
+            AwaitAssert(() =>


Good change

Aaronontheweb · 2019-05-21T18:43:41Z

src/contrib/cluster/Akka.Cluster.Sharding.Tests/InactiveEntityPassivationSpec.cs

            }
        }

        #endregion

+        protected ClusterShardingSettings settings;
+        protected readonly TimeSpan smallTolerance = TimeSpan.FromMilliseconds(300);


Yep, saw this when I started work on #3780 - good change

Aaronontheweb · 2019-05-21T18:44:39Z

src/contrib/cluster/Akka.Cluster.Sharding/ClusterSharding.cs

@@ -1109,7 +1109,7 @@ public IActorRef ShardRegionProxy(string typeName)
            throw new ArgumentException($"Shard type [{typeName}] must be started first");
        }

-        private IShardAllocationStrategy DefaultShardAllocationStrategy(ClusterShardingSettings settings)
+        public IShardAllocationStrategy DefaultShardAllocationStrategy(ClusterShardingSettings settings)


Good change

Aaronontheweb · 2019-05-21T18:46:15Z

src/core/Akka.Persistence/Eventsourced.cs

+        /// <param name="e"></param>
+        /// <param name="keepNrOfBatches"></param>
+        /// <param name="snapshotAfter"></param>
+        internal void InternalDeleteMessagesBeforeSnapshot(SaveSnapshotSuccess e, int keepNrOfBatches, int snapshotAfter)


Like this - not user-friendly enough to make public but good enough for our internal uses of Eventsourced.

Aaronontheweb · 2019-05-21T18:46:44Z

src/core/Akka.Cluster/CoordinatedShutdownLeave.cs

@@ -45,7 +45,7 @@ public CoordinatedShutdownLeave()
            {
                // MemberRemoved is needed in case it was downed instead
                _cluster.Leave(_cluster.SelfAddress);
-                _cluster.Subscribe(Self, typeof(ClusterEvent.MemberLeft), typeof(ClusterEvent.MemberRemoved));
+                _cluster.Subscribe(Self, typeof(ClusterEvent.MemberLeft), typeof(ClusterEvent.MemberRemoved), typeof(ClusterEvent.MemberDowned));


Aaronontheweb · 2019-05-21T18:47:02Z

src/core/Akka.Cluster/ClusterEvent.cs

@@ -885,6 +901,9 @@ private static IEnumerable<IMemberEvent> CollectMemberEvents(IEnumerable<Member>
                    case MemberStatus.Exiting:
                        yield return new MemberExited(member);
                        break;
+                    case MemberStatus.Down:


Also a good catch

Aaronontheweb · 2019-05-21T18:47:17Z

src/core/Akka.Cluster/ClusterEvent.cs

+        /// Member status changed to `MemberStatus.Down` and will be removed
+        /// when all members have seen the `Down` status.
+        /// </summary>
+        public sealed class MemberDowned : MemberStatusChange


Aaronontheweb · 2019-05-21T18:50:28Z

src/contrib/cluster/Akka.Cluster.Tools/Singleton/OldestChangedBuffer.cs

@@ -153,10 +153,15 @@ private bool MatchingRole(Member member)
        private void HandleInitial(ClusterEvent.CurrentClusterState state)
        {
            _membersByAge = state.Members
-                .Where(m => (m.Status == MemberStatus.Up || m.Status == MemberStatus.Leaving) && MatchingRole(m))


Makes sense - wouldn't want to pin a ClusterSingleton on a node that's leaving.

Aaronontheweb · 2019-05-21T18:51:52Z

src/contrib/cluster/Akka.Cluster.Tools.Tests/Akka.Cluster.Tools.Tests.csproj

@@ -13,6 +13,7 @@

  <ItemGroup>
    <PackageReference Include="Microsoft.NET.Test.Sdk" Version="$(TestSdkVersion)" />
+    <PackageReference Include="System.ValueTuple" Version="4.4.0" />


Looks like this is what's causing the issues on Linux - let me take a look and see if I can figure out why.

Aaronontheweb · 2019-05-21T19:22:52Z

The ValueTuple compilation error appears to be caused by something that's installed on the machines running on Azure Pipelines - could be related to this: https://stackoverflow.com/a/44770334/377476

Aaronontheweb · 2019-05-21T19:23:04Z

Only affects Linux though.

Aaronontheweb

LGTM

Aaronontheweb · 2019-05-21T20:24:54Z

src/core/Akka.Cluster/ClusterDaemon.cs

@@ -2172,6 +2175,10 @@ private void ShutdownSelfWhenDown()
                    SendGossipRandom(MaxGossipsBeforeShuttingDownMyself);
                    Shutdown();
                }
+                else
+                {
+                    _selfDownCounter++;


Makes sense - gives the node a chance to gossip a few times prior to termination.

Aaronontheweb · 2019-05-21T20:25:28Z

src/core/Akka.Cluster/ClusterDaemon.cs

+        }
+
+        /// <summary>
+        /// Gossip the Exiting change to the two oldest nodes for quick dissemination to potential Singleton nodes


This optimization makes sense

Aaronontheweb · 2019-05-21T20:26:13Z

src/core/Akka.Cluster/ClusterDaemon.cs

+        {
+            if (exitingMembers.Any())
+            {
+                var roles = exitingMembers.SelectMany(m => m.Roles);


Might be enumerating the existingMembers collection multiple times, but this doesn't appear to be an issue per the actual unit tests and MNTR results.

Aaronontheweb · 2019-05-23T16:22:51Z

thanks for your contribution @zbynek001

* Introduce 'MemberDowned' member event Migrated from #25854 * Ignore possible state change while waiting for removal Migrated from #25274 * mark TryToIdentifySingleton as NoSerializationVerificationNeeded Migrated from #24442 * Improvements of singleton leaving scenario Migrated from #25639 (#25710) Testing of singleton leaving gossip optimization, exiting change to two oldest per role hardening ClusterSingletonManagerIsStuck restart, increase ClusterSingletonManagerIsStuck * Stop singleton and shards when self MemberDowned Migrated from #26336 (#26339) Stop singleton when self MemberDowned * It's safer to stop singleton instance early in case of downing. * Instead of waiting for MemberRemoved and trying to hand over. Stop ShardRegion when self MemberDowned Upper bound when waiting for seen in shutdownSelfWhenDown * Discards HandOverToMe in state End to avoid unhandled message warning Migrated from #26793 * Warn if handOffStopMessage not handled Migrated from #25648 Stops entities of shard forcefully if they don't handle stopMessage Prints a warning log while stopping the entities fix version of backward exclude file and checks for shard stopped adds documentation for handoff timeout * Save EntityStarted when StartEntity requested via remembered entities (for validation) Migrated from #26061 * Improve default shard rebalancing algorithm Migrated from #26012 (#26101) * Improve default shard rebalancing algorithm * Use rebalance-threshold=1 because it will give the best distribution, and previous default could result in too large difference between nodes * Off by one error, difference > threshold vs >= * Added more unit tests * Note that in some cases it may still not be optimal, stopping more shards than necessary, but a different strategy that looks at more than most and least is out of scope for this issue. In practise those cases shouldn't matter much. * Also note that the rebalance interval is by default 10 seconds, so typically shards will start up before next rebalance tick. It's intentionally a slow process to not cause instabilities by moving too much at the same time. * Always retry sendGracefulShutdownToCoordinator Migrated from #26214 * I could reproduce the issue locally with debug logging and it's clear that it's a timing issue. The GracefulShutdownReq message goes to deadletters and it's not retried because the coordinator variable was unset. * cluster-sharding-shutdown-region phase of CoordinatedShutdown timed out * Consolidate duplicate persistence sharding function Migrated from #26451 (#26452) * API approval * sharding with ddata specs timing issue fix * Enable passivate-idle-entity-after by default Migrated from #25782 (#26819) * persistent shard passivate-idle fix ddata passivate-idle fix sharding fixes (part of #26878) * Removed references to System.ValueTuple

zbynek001 force-pushed the multiple-fixes branch 4 times, most recently from a4d4cb3 to c074907 Compare May 9, 2019 00:57

zbynek001 changed the title ~~[WIP] Batch update with singleton/sharding fixes~~ Batch update with singleton/sharding fixes May 9, 2019

zbynek001 force-pushed the multiple-fixes branch from 9f4abfe to 01c7006 Compare May 11, 2019 22:32

Aaronontheweb added this to the 1.4.0 milestone May 14, 2019

Aaronontheweb self-assigned this May 14, 2019

zbynek001 added 15 commits May 17, 2019 09:25

Introduce 'MemberDowned' member event

4d03715

Migrated from #25854

Ignore possible state change while waiting for removal

8501f3f

Migrated from #25274

mark TryToIdentifySingleton as NoSerializationVerificationNeeded

274c6dc

Migrated from #24442

Improvements of singleton leaving scenario

5c03849

Migrated from #25639 (#25710) Testing of singleton leaving gossip optimization, exiting change to two oldest per role hardening ClusterSingletonManagerIsStuck restart, increase ClusterSingletonManagerIsStuck

Discards HandOverToMe in state End to avoid unhandled message warning

110e131

Migrated from #26793

Warn if handOffStopMessage not handled

89b0870

Migrated from #25648 Stops entities of shard forcefully if they don't handle stopMessage Prints a warning log while stopping the entities fix version of backward exclude file and checks for shard stopped adds documentation for handoff timeout

Save EntityStarted when StartEntity requested via remembered entities…

e7017eb

… (for validation) Migrated from #26061

Consolidate duplicate persistence sharding function

10126e2

Migrated from #26451 (#26452)

API approval

af2f475

sharding with ddata specs timing issue fix

4720e89

Enable passivate-idle-entity-after by default

511ab3a

Migrated from #25782 (#26819)

persistent shard passivate-idle fix

85afb80

ddata passivate-idle fix sharding fixes (part of #26878)

zbynek001 force-pushed the multiple-fixes branch from 43352f7 to 85afb80 Compare May 17, 2019 07:31

Aaronontheweb added 2 commits May 18, 2019 16:26

Merge branch 'dev' into multiple-fixes

79700d3

Merge branch 'dev' into multiple-fixes

163710a

Aaronontheweb mentioned this pull request May 21, 2019

[SpecFix] fixed Passivation_of_inactive_entities spec #3801

Closed

Aaronontheweb reviewed May 21, 2019

View reviewed changes

Removed references to System.ValueTuple

1ff990f

Aaronontheweb approved these changes May 21, 2019

View reviewed changes

Aaronontheweb added 4 commits May 21, 2019 17:56

Merge branch 'dev' into multiple-fixes

23a933f

Merge branch 'dev' into multiple-fixes

bc7b5cf

Merge branch 'dev' into multiple-fixes

d5d94b1

Merge branch 'dev' into multiple-fixes

d8e59ae

Aaronontheweb merged commit 2ad84ee into akkadotnet:dev May 23, 2019

zbynek001 deleted the multiple-fixes branch July 16, 2021 12:48

Batch update with singleton/sharding fixes #3780

Batch update with singleton/sharding fixes #3780

Uh oh!

Conversation

zbynek001 commented May 6, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Aaronontheweb commented May 7, 2019

Uh oh!

zbynek001 commented May 9, 2019

Uh oh!

Aaronontheweb commented May 16, 2019

Uh oh!

Aaronontheweb commented May 21, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Aaronontheweb commented May 21, 2019

Uh oh!

Aaronontheweb left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Aaronontheweb commented May 21, 2019

Uh oh!

Aaronontheweb commented May 21, 2019

Uh oh!

Aaronontheweb left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Aaronontheweb commented May 23, 2019

Uh oh!

Uh oh!

zbynek001 commented May 6, 2019 •

edited

Loading

Aaronontheweb commented May 21, 2019 •

edited

Loading