Skip to content

Conversation

zbynek001
Copy link
Contributor

@zbynek001 zbynek001 commented May 6, 2019

Mostly singleton & sharding fixes, plus some dependencies
Will split it into smaller parts if needed

Included:
akka/akka#25854
akka/akka#25274
akka/akka#24442
akka/akka#25639
akka/akka#26336
akka/akka#26793
akka/akka#25648
akka/akka#26061
akka/akka#26012
akka/akka#26214
akka/akka#26451
akka/akka#26819
& persistent shard passivate-idle fix

@Aaronontheweb
Copy link
Member

Let me know when this is ready to review - I'm working on those serialization fixes over here: #3744

@zbynek001 zbynek001 force-pushed the multiple-fixes branch 4 times, most recently from a4d4cb3 to c074907 Compare May 9, 2019 00:57
@zbynek001 zbynek001 changed the title [WIP] Batch update with singleton/sharding fixes Batch update with singleton/sharding fixes May 9, 2019
@zbynek001
Copy link
Contributor Author

I think this is mostly ready, there is one open topic with persistent sharding passivate-idle

@Aaronontheweb Aaronontheweb added this to the 1.4.0 milestone May 14, 2019
@Aaronontheweb Aaronontheweb self-assigned this May 14, 2019
@Aaronontheweb
Copy link
Member

@zbynek001 looks like there might be some conflicts with the most recent changes in dev I pushed earlier today.

zbynek001 added 15 commits May 17, 2019 09:25
Migrated from #25639 (#25710)

Testing of singleton leaving
gossip optimization, exiting change to two oldest per role
hardening ClusterSingletonManagerIsStuck restart, increase ClusterSingletonManagerIsStuck
Migrated from #26336 (#26339)

Stop singleton when self MemberDowned
  * It's safer to stop singleton instance early in case of downing.
  * Instead of waiting for MemberRemoved and trying to hand over.
Stop ShardRegion when self MemberDowned
Upper bound when waiting for seen in shutdownSelfWhenDown
Migrated from #25648

Stops entities of shard forcefully if they don't handle stopMessage
Prints a warning log while stopping the entities
fix version of backward exclude file and checks for shard stopped
adds documentation for handoff timeout
Migrated from #26012 (#26101)

* Improve default shard rebalancing algorithm

* Use rebalance-threshold=1 because it will give the best distribution,
  and previous default could result in too large difference between nodes
* Off by one error, difference > threshold vs >=
* Added more unit tests
* Note that in some cases it may still not be optimal, stopping more
  shards than necessary, but a different strategy that looks at more
  than most and least is out of scope for this issue. In practise
  those cases shouldn't matter much.
* Also note that the rebalance interval is by default 10 seconds,
  so typically shards will start up before next rebalance tick.
  It's intentionally a slow process to not cause instabilities by
  moving too much at the same time.
Migrated from #26214

* I could reproduce the issue locally with debug logging and it's clear that it's a
  timing issue. The GracefulShutdownReq message goes to deadletters and it's not
  retried because the coordinator variable was unset.
* cluster-sharding-shutdown-region phase of CoordinatedShutdown timed out
Migrated from #25782 (#26819)
ddata passivate-idle fix
sharding fixes (part of #26878)
@Aaronontheweb
Copy link
Member

Aaronontheweb commented May 21, 2019

Looks like this is already covered inside #3801

Edit: wrote that comment backwards - you took care of #3801 so I can go close that PR, I think

@Aaronontheweb
Copy link
Member

Compilation issue on Linux:

Singleton/ClusterSingletonLeavingSpeedSpec.cs(118,18): error CS8137: Cannot define a class or member that utilizes tuples because the compiler required type 'System.Runtime.CompilerServices.TupleElementNamesAttribute' cannot be found. Are you missing a reference? [/home/vsts/work/1/s/src/contrib/cluster/Akka.Cluster.Tools.Tests/Akka.Cluster.Tools.Tests.csproj]
Singleton/ClusterSingletonLeavingSpeedSpec.cs(118,18): error CS8356: Predefined type 'System.ValueTuple`2' is declared in multiple referenced assemblies: 'mscorlib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089' and 'System.ValueTuple, Version=4.0.2.0, Culture=neutral, PublicKeyToken=cc7b13ffcd2ddd51' [/home/vsts/work/1/s/src/contrib/cluster/Akka.Cluster.Tools.Tests/Akka.Cluster.Tools.Tests.csproj]

Copy link
Member

@Aaronontheweb Aaronontheweb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM overall - but need to take a look at some of the compilation issues on Linux and some of the changes made to ClusterDaemon in more detail.

_probe2.ExpectMsg<int>(10.Seconds()).Should().Be(2);
_region2.Tell(3, _probe2.Ref);
_probe2.ExpectMsg<int>(10.Seconds()).Should().Be(3);
AwaitAssert(() =>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good change

}
}

#endregion

protected ClusterShardingSettings settings;
protected readonly TimeSpan smallTolerance = TimeSpan.FromMilliseconds(300);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, saw this when I started work on #3780 - good change

@@ -1109,7 +1109,7 @@ public IActorRef ShardRegionProxy(string typeName)
throw new ArgumentException($"Shard type [{typeName}] must be started first");
}

private IShardAllocationStrategy DefaultShardAllocationStrategy(ClusterShardingSettings settings)
public IShardAllocationStrategy DefaultShardAllocationStrategy(ClusterShardingSettings settings)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good change

/// <param name="e"></param>
/// <param name="keepNrOfBatches"></param>
/// <param name="snapshotAfter"></param>
internal void InternalDeleteMessagesBeforeSnapshot(SaveSnapshotSuccess e, int keepNrOfBatches, int snapshotAfter)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Like this - not user-friendly enough to make public but good enough for our internal uses of Eventsourced.

@@ -45,7 +45,7 @@ public CoordinatedShutdownLeave()
{
// MemberRemoved is needed in case it was downed instead
_cluster.Leave(_cluster.SelfAddress);
_cluster.Subscribe(Self, typeof(ClusterEvent.MemberLeft), typeof(ClusterEvent.MemberRemoved));
_cluster.Subscribe(Self, typeof(ClusterEvent.MemberLeft), typeof(ClusterEvent.MemberRemoved), typeof(ClusterEvent.MemberDowned));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch

@@ -885,6 +901,9 @@ private static IEnumerable<IMemberEvent> CollectMemberEvents(IEnumerable<Member>
case MemberStatus.Exiting:
yield return new MemberExited(member);
break;
case MemberStatus.Down:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also a good catch

/// Member status changed to `MemberStatus.Down` and will be removed
/// when all members have seen the `Down` status.
/// </summary>
public sealed class MemberDowned : MemberStatusChange
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@@ -153,10 +153,15 @@ private bool MatchingRole(Member member)
private void HandleInitial(ClusterEvent.CurrentClusterState state)
{
_membersByAge = state.Members
.Where(m => (m.Status == MemberStatus.Up || m.Status == MemberStatus.Leaving) && MatchingRole(m))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense - wouldn't want to pin a ClusterSingleton on a node that's leaving.

@@ -13,6 +13,7 @@

<ItemGroup>
<PackageReference Include="Microsoft.NET.Test.Sdk" Version="$(TestSdkVersion)" />
<PackageReference Include="System.ValueTuple" Version="4.4.0" />
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like this is what's causing the issues on Linux - let me take a look and see if I can figure out why.

@Aaronontheweb
Copy link
Member

The ValueTuple compilation error appears to be caused by something that's installed on the machines running on Azure Pipelines - could be related to this: https://stackoverflow.com/a/44770334/377476

@Aaronontheweb
Copy link
Member

Only affects Linux though.

Copy link
Member

@Aaronontheweb Aaronontheweb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@@ -2172,6 +2175,10 @@ private void ShutdownSelfWhenDown()
SendGossipRandom(MaxGossipsBeforeShuttingDownMyself);
Shutdown();
}
else
{
_selfDownCounter++;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense - gives the node a chance to gossip a few times prior to termination.

}

/// <summary>
/// Gossip the Exiting change to the two oldest nodes for quick dissemination to potential Singleton nodes
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This optimization makes sense

{
if (exitingMembers.Any())
{
var roles = exitingMembers.SelectMany(m => m.Roles);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be enumerating the existingMembers collection multiple times, but this doesn't appear to be an issue per the actual unit tests and MNTR results.

@Aaronontheweb Aaronontheweb merged commit 2ad84ee into akkadotnet:dev May 23, 2019
@Aaronontheweb
Copy link
Member

thanks for your contribution @zbynek001

madmonkey pushed a commit to madmonkey/akka.net that referenced this pull request Jul 12, 2019
* Introduce 'MemberDowned' member event
Migrated from #25854

* Ignore possible state change while waiting for removal
Migrated from #25274

* mark TryToIdentifySingleton as NoSerializationVerificationNeeded
Migrated from #24442

* Improvements of singleton leaving scenario
Migrated from #25639 (#25710)

Testing of singleton leaving
gossip optimization, exiting change to two oldest per role
hardening ClusterSingletonManagerIsStuck restart, increase ClusterSingletonManagerIsStuck

* Stop singleton and shards when self MemberDowned
Migrated from #26336 (#26339)

Stop singleton when self MemberDowned
  * It's safer to stop singleton instance early in case of downing.
  * Instead of waiting for MemberRemoved and trying to hand over.
Stop ShardRegion when self MemberDowned
Upper bound when waiting for seen in shutdownSelfWhenDown

* Discards HandOverToMe in state End to avoid unhandled message warning
Migrated from #26793

* Warn if handOffStopMessage not handled
Migrated from #25648

Stops entities of shard forcefully if they don't handle stopMessage
Prints a warning log while stopping the entities
fix version of backward exclude file and checks for shard stopped
adds documentation for handoff timeout

* Save EntityStarted when StartEntity requested via remembered entities (for validation)
Migrated from #26061

* Improve default shard rebalancing algorithm
Migrated from #26012 (#26101)

* Improve default shard rebalancing algorithm

* Use rebalance-threshold=1 because it will give the best distribution,
  and previous default could result in too large difference between nodes
* Off by one error, difference > threshold vs >=
* Added more unit tests
* Note that in some cases it may still not be optimal, stopping more
  shards than necessary, but a different strategy that looks at more
  than most and least is out of scope for this issue. In practise
  those cases shouldn't matter much.
* Also note that the rebalance interval is by default 10 seconds,
  so typically shards will start up before next rebalance tick.
  It's intentionally a slow process to not cause instabilities by
  moving too much at the same time.

* Always retry sendGracefulShutdownToCoordinator
Migrated from #26214

* I could reproduce the issue locally with debug logging and it's clear that it's a
  timing issue. The GracefulShutdownReq message goes to deadletters and it's not
  retried because the coordinator variable was unset.
* cluster-sharding-shutdown-region phase of CoordinatedShutdown timed out

* Consolidate duplicate persistence sharding function
Migrated from #26451 (#26452)

* API approval

* sharding with ddata specs timing issue fix

* Enable passivate-idle-entity-after by default
Migrated from #25782 (#26819)

* persistent shard passivate-idle fix
ddata passivate-idle fix
sharding fixes (part of #26878)

* Removed references to System.ValueTuple
Aaronontheweb pushed a commit to Aaronontheweb/akka.net that referenced this pull request Jul 21, 2019
* Introduce 'MemberDowned' member event
Migrated from #25854

* Ignore possible state change while waiting for removal
Migrated from #25274

* mark TryToIdentifySingleton as NoSerializationVerificationNeeded
Migrated from #24442

* Improvements of singleton leaving scenario
Migrated from #25639 (#25710)

Testing of singleton leaving
gossip optimization, exiting change to two oldest per role
hardening ClusterSingletonManagerIsStuck restart, increase ClusterSingletonManagerIsStuck

* Stop singleton and shards when self MemberDowned
Migrated from #26336 (#26339)

Stop singleton when self MemberDowned
  * It's safer to stop singleton instance early in case of downing.
  * Instead of waiting for MemberRemoved and trying to hand over.
Stop ShardRegion when self MemberDowned
Upper bound when waiting for seen in shutdownSelfWhenDown

* Discards HandOverToMe in state End to avoid unhandled message warning
Migrated from #26793

* Warn if handOffStopMessage not handled
Migrated from #25648

Stops entities of shard forcefully if they don't handle stopMessage
Prints a warning log while stopping the entities
fix version of backward exclude file and checks for shard stopped
adds documentation for handoff timeout

* Save EntityStarted when StartEntity requested via remembered entities (for validation)
Migrated from #26061

* Improve default shard rebalancing algorithm
Migrated from #26012 (#26101)

* Improve default shard rebalancing algorithm

* Use rebalance-threshold=1 because it will give the best distribution,
  and previous default could result in too large difference between nodes
* Off by one error, difference > threshold vs >=
* Added more unit tests
* Note that in some cases it may still not be optimal, stopping more
  shards than necessary, but a different strategy that looks at more
  than most and least is out of scope for this issue. In practise
  those cases shouldn't matter much.
* Also note that the rebalance interval is by default 10 seconds,
  so typically shards will start up before next rebalance tick.
  It's intentionally a slow process to not cause instabilities by
  moving too much at the same time.

* Always retry sendGracefulShutdownToCoordinator
Migrated from #26214

* I could reproduce the issue locally with debug logging and it's clear that it's a
  timing issue. The GracefulShutdownReq message goes to deadletters and it's not
  retried because the coordinator variable was unset.
* cluster-sharding-shutdown-region phase of CoordinatedShutdown timed out

* Consolidate duplicate persistence sharding function
Migrated from #26451 (#26452)

* API approval

* sharding with ddata specs timing issue fix

* Enable passivate-idle-entity-after by default
Migrated from #25782 (#26819)

* persistent shard passivate-idle fix
ddata passivate-idle fix
sharding fixes (part of #26878)

* Removed references to System.ValueTuple
Aaronontheweb pushed a commit to Aaronontheweb/akka.net that referenced this pull request Jul 21, 2019
* Introduce 'MemberDowned' member event
Migrated from #25854

* Ignore possible state change while waiting for removal
Migrated from #25274

* mark TryToIdentifySingleton as NoSerializationVerificationNeeded
Migrated from #24442

* Improvements of singleton leaving scenario
Migrated from #25639 (#25710)

Testing of singleton leaving
gossip optimization, exiting change to two oldest per role
hardening ClusterSingletonManagerIsStuck restart, increase ClusterSingletonManagerIsStuck

* Stop singleton and shards when self MemberDowned
Migrated from #26336 (#26339)

Stop singleton when self MemberDowned
  * It's safer to stop singleton instance early in case of downing.
  * Instead of waiting for MemberRemoved and trying to hand over.
Stop ShardRegion when self MemberDowned
Upper bound when waiting for seen in shutdownSelfWhenDown

* Discards HandOverToMe in state End to avoid unhandled message warning
Migrated from #26793

* Warn if handOffStopMessage not handled
Migrated from #25648

Stops entities of shard forcefully if they don't handle stopMessage
Prints a warning log while stopping the entities
fix version of backward exclude file and checks for shard stopped
adds documentation for handoff timeout

* Save EntityStarted when StartEntity requested via remembered entities (for validation)
Migrated from #26061

* Improve default shard rebalancing algorithm
Migrated from #26012 (#26101)

* Improve default shard rebalancing algorithm

* Use rebalance-threshold=1 because it will give the best distribution,
  and previous default could result in too large difference between nodes
* Off by one error, difference > threshold vs >=
* Added more unit tests
* Note that in some cases it may still not be optimal, stopping more
  shards than necessary, but a different strategy that looks at more
  than most and least is out of scope for this issue. In practise
  those cases shouldn't matter much.
* Also note that the rebalance interval is by default 10 seconds,
  so typically shards will start up before next rebalance tick.
  It's intentionally a slow process to not cause instabilities by
  moving too much at the same time.

* Always retry sendGracefulShutdownToCoordinator
Migrated from #26214

* I could reproduce the issue locally with debug logging and it's clear that it's a
  timing issue. The GracefulShutdownReq message goes to deadletters and it's not
  retried because the coordinator variable was unset.
* cluster-sharding-shutdown-region phase of CoordinatedShutdown timed out

* Consolidate duplicate persistence sharding function
Migrated from #26451 (#26452)

* API approval

* sharding with ddata specs timing issue fix

* Enable passivate-idle-entity-after by default
Migrated from #25782 (#26819)

* persistent shard passivate-idle fix
ddata passivate-idle fix
sharding fixes (part of #26878)

* Removed references to System.ValueTuple
Aaronontheweb pushed a commit to Aaronontheweb/akka.net that referenced this pull request Jul 26, 2019
* Introduce 'MemberDowned' member event
Migrated from #25854

* Ignore possible state change while waiting for removal
Migrated from #25274

* mark TryToIdentifySingleton as NoSerializationVerificationNeeded
Migrated from #24442

* Improvements of singleton leaving scenario
Migrated from #25639 (#25710)

Testing of singleton leaving
gossip optimization, exiting change to two oldest per role
hardening ClusterSingletonManagerIsStuck restart, increase ClusterSingletonManagerIsStuck

* Stop singleton and shards when self MemberDowned
Migrated from #26336 (#26339)

Stop singleton when self MemberDowned
  * It's safer to stop singleton instance early in case of downing.
  * Instead of waiting for MemberRemoved and trying to hand over.
Stop ShardRegion when self MemberDowned
Upper bound when waiting for seen in shutdownSelfWhenDown

* Discards HandOverToMe in state End to avoid unhandled message warning
Migrated from #26793

* Warn if handOffStopMessage not handled
Migrated from #25648

Stops entities of shard forcefully if they don't handle stopMessage
Prints a warning log while stopping the entities
fix version of backward exclude file and checks for shard stopped
adds documentation for handoff timeout

* Save EntityStarted when StartEntity requested via remembered entities (for validation)
Migrated from #26061

* Improve default shard rebalancing algorithm
Migrated from #26012 (#26101)

* Improve default shard rebalancing algorithm

* Use rebalance-threshold=1 because it will give the best distribution,
  and previous default could result in too large difference between nodes
* Off by one error, difference > threshold vs >=
* Added more unit tests
* Note that in some cases it may still not be optimal, stopping more
  shards than necessary, but a different strategy that looks at more
  than most and least is out of scope for this issue. In practise
  those cases shouldn't matter much.
* Also note that the rebalance interval is by default 10 seconds,
  so typically shards will start up before next rebalance tick.
  It's intentionally a slow process to not cause instabilities by
  moving too much at the same time.

* Always retry sendGracefulShutdownToCoordinator
Migrated from #26214

* I could reproduce the issue locally with debug logging and it's clear that it's a
  timing issue. The GracefulShutdownReq message goes to deadletters and it's not
  retried because the coordinator variable was unset.
* cluster-sharding-shutdown-region phase of CoordinatedShutdown timed out

* Consolidate duplicate persistence sharding function
Migrated from #26451 (#26452)

* API approval

* sharding with ddata specs timing issue fix

* Enable passivate-idle-entity-after by default
Migrated from #25782 (#26819)

* persistent shard passivate-idle fix
ddata passivate-idle fix
sharding fixes (part of #26878)

* Removed references to System.ValueTuple
@zbynek001 zbynek001 deleted the multiple-fixes branch July 16, 2021 12:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants