fix node stuck at joining issues #2773

Aaronontheweb · 2017-06-20T04:10:04Z

close #2584
allowed SurviveNetworkInstabilitySpec to run all the way

allowed SurviveNetworkInstabilitySpec to run all the way

Aaronontheweb

Left some comments detailing the changes in this PR

Aaronontheweb · 2017-06-20T04:21:25Z

src/core/Akka.Cluster.Tests.MultiNode/SurviveNetworkInstabilitySpec.cs

@@ -43,7 +44,7 @@ public SurviveNetworkInstabilitySpecConfig()
            Seventh = Role("seventh");
            Eighth = Role("eighth");

-            CommonConfig = DebugConfig(false)
+            CommonConfig = DebugConfig(true)


This is still a WIP for the moment - had some massive issues with Windows Defender going beserk and eating up 100% of CPU when I ran this locally; hoping the build server sheds some better light on it.

Aaronontheweb · 2017-06-20T04:21:48Z

src/core/Akka.Cluster.Tests.MultiNode/SurviveNetworkInstabilitySpec.cs

@@ -135,7 +136,7 @@ private void AssertCanTalk(params RoleName[] alive)
            {
                foreach (var to in alive)
                {
-                    var sel = Sys.ActorSelection(Node(to) / "user" / "echo");
+                    var sel = Sys.ActorSelection(new RootActorPath(GetAddress(to)) / "user" / "echo");


GetAddress uses caching under the hood, so we can save ourselves some network calls here

Aaronontheweb · 2017-06-20T04:22:26Z

src/core/Akka.Cluster.Tests/ClusterGenerators.cs

+
+namespace Akka.Cluster.Tests
+{
+    public class ClusterGenerators


Could technically delete this class - used it for some FsCheck tests I deleted as they were no longer needed

Aaronontheweb · 2017-06-20T04:23:25Z

src/core/Akka.Cluster.Tests/VectorClockSpec.cs

+            var a = VectorClock.Create().Increment(node1).Increment(node2);
+            var b = a.Prune(node2).Increment(node1); // remove node2, increment node1
+
+            a.CompareTo(b).Should().Be(VectorClock.Ordering.Concurrent);


Verifying a remove scenario explicitly

Aaronontheweb · 2017-06-20T04:24:32Z

src/core/Akka.Cluster/ClusterDaemon.cs

@@ -1015,7 +1015,7 @@ public ClusterCoreDaemon(IActorRef publisher)
            _cluster = Cluster.Get(Context.System);
            _publisher = publisher;
            SelfUniqueAddress = _cluster.SelfUniqueAddress;
-            _vclockNode = new VectorClock.Node(VclockName(SelfUniqueAddress));
+            _vclockNode = VectorClock.Node.Create(VclockName(SelfUniqueAddress));


THIS IS THE FIX FOR #2584 - as it turns out, the constructor we were using earlier didn't compute a hash of the node - thus all of the comparisons would fail later when it came time to prune vector clocks for removed / downed nodes.

Looks like this would explain why some nodes were stuck leaving too, right?

@nvivo yeah, all of the above. If you had smooth sailing in the cluster and there were never any issues with nodes going unreachable this wouldn't be a big problem as the leader's vectorclock was the only one that mattered.

But as soon as unreachable events took place then every other individual member node would stamp their information on the VectorClock , that would cause the gossips to get corrupted if any of those nodes left the cluster, the end result being that in some cases older gossip appeared to be newer than the most recent gossip produced by the leader. So this affected everything that required cluster convergence. Nasty bug.

Aaronontheweb · 2017-06-20T04:25:07Z

src/core/Akka.Cluster/ClusterDaemon.cs

@@ -2597,6 +2595,10 @@ protected override void OnReceive(object message)
            else if (message is InternalClusterAction.InitJoinNack) { } //that seed was uninitialized
            else if (message is ReceiveTimeout)
            {
+                if (_attempts >= 2)


Added some logging for nodes that were unable to perform a join to help troubleshoot the issue

Aaronontheweb · 2017-06-20T04:25:58Z

src/core/Akka.Cluster/VectorClock.cs

+        {
+            unchecked
+            {
+                var hashCode = 23;


Should probably cache this

Aaronontheweb · 2017-06-20T04:27:02Z

src/core/Akka.Cluster/VectorClock.cs

+                {
+                    foreach (var c in value)
+                    {
+                        _computedHashValue *= _computedHashValue * 31 + c; // using the byte value of each char


This may no longer be necessary.... was an early theory I had as to why the vector clocks didn't sort properly. Turned out to be wrong.

Upon further review, turned out to be right! Tried removing this code and re-ran the specs - ran into same issue as before.

Aaronontheweb · 2017-06-20T04:27:57Z

src/core/Akka.Tests.Shared.Internals/Helpers/FSharpDelegateHelper.cs

+    /// <summary>
+    /// Maps F# methods to C# delegates
+    /// </summary>
+    public static class FsharpDelegateHelper


Helper class I've used in a ton of other projects for converting Func<> into F# first-order functions... Useful tool when working with FsCheck and you want to use something like Gen.Map2

Aaronontheweb · 2017-06-20T04:29:53Z

@heynickc this is targeting the dev branch - looks like Mono / FAKE versioning issues caused the Mono build to blow up. Can we backport whatever change we've made to the v1.3 branch to dev also ASAP so we can get this fix validated and released?

Aaronontheweb · 2017-06-20T04:38:07Z

@heynickc FYI, also looks like the changes we've made to the TC configuration (for v1.3) have caused us to not actually run any of the unit tests on any PRs that are still going into the dev branch. This will also need to be fixed.

Aaronontheweb · 2017-06-20T04:55:45Z

Checking the logs but it looks like we still have an issue with SurviveNetworkInstabilitySpec - most of the issues have been related to the TestConductor timing something out rather than the behavior of the cluster itself, at least when I was running things locally. I'll see about improving the state of affairs there...

heynickc · 2017-06-20T15:32:00Z

@Aaronontheweb PR #2774 resolves Mono Unit Tests not running and I've altered the TeamCity configuration to run Windows Unit Tests successfully.

Aaronontheweb · 2017-06-20T20:38:41Z

ty @heynickc - looks like there's still that one outstanding issue with the F# projects building on Mono in that PR, which I think we also fixed recently on the v1.3 branch

Aaronontheweb · 2017-06-20T20:46:05Z

Also need to do API approval for Akka.Cluster before merge... Going to clean up some of the items listed here beforehand though,

Aaronontheweb · 2017-06-20T22:11:45Z

Looks like the issue with the SurviveNetworkStabilitySpec is actually an Akka.Remote.TestKit bug - seeing these randomly pop up in the latter stages of the tests:

[Akka.Remote.TestKit.Proto.ProtobufDecoder][Debug]Decoding EmptyByteBufferBE into Protobuf
[Akka.Remote.TestKit.MsgDecoder][Debug]Decoding 
[Akka.Remote.TestKit.Proto.ProtobufDecoder][Debug]Decoding EmptyByteBufferBE into Protobuf
[Akka.Remote.TestKit.MsgDecoder][Debug]Decoding 
[WARNING][6/20/2017 9:51:32 PM][Thread 0045][Akka.Remote.TestKit.ConductorHandler] handled network error from [::1]:64258: Exception of type 'DotNetty.Codecs.DecoderException' was thrown.    at DotNetty.Codecs.MessageToMessageDecoder`1.ChannelRead(IChannelHandlerContext context, Object message)
   at DotNetty.Transport.Channels.AbstractChannelHandlerContext.InvokeChannelRead(Object msg)
[INFO][6/20/2017 9:51:32 PM][Thread 0028][[akka://MultiNodeClusterSpec/user/TestConductorClient#1281556566]] Terminating connection to multi-node test controller due to [Akka.Actor.FSMBase+Shutdown]
[WARNING][6/20/2017 9:51:32 PM][Thread 0031][akka://MultiNodeClusterSpec/user/TestConductorClient] DeadLetter from [akka://MultiNodeClusterSpec/deadLetters] to [akka://MultiNodeClusterSpec/user/TestConductorClient#1281556566]: <Received dead letter from [akka://MultiNodeClusterSpec/deadLetters]: Akka.Remote.TestKit.ClientFSM+ConnectionFailure: Connection between [Local: [::1]:64258] and [Remote: [::1]:4711] has failed.
Cause: DotNetty.Codecs.DecoderException: Exception of type 'DotNetty.Codecs.DecoderException' was thrown. ---> System.ArgumentException: wrong message 
   at Akka.Remote.TestKit.MsgDecoder.Decode(Object message) in D:\olympus\akka.net\src\core\Akka.Remote.TestKit\MsgDecoder.cs:line 105
   at Akka.Remote.TestKit.MsgDecoder.Decode(IChannelHandlerContext context, Object message, List`1 output) in D:\olympus\akka.net\src\core\Akka.Remote.TestKit\MsgDecoder.cs:line 110
   at DotNetty.Codecs.MessageToMessageDecoder`1.ChannelRead(IChannelHandlerContext context, Object message)
   --- End of inner exception stack trace ---
   at DotNetty.Codecs.MessageToMessageDecoder`1.ChannelRead(IChannelHandlerContext context, Object message)
   at DotNetty.Transport.Channels.AbstractChannelHandlerContext.InvokeChannelRead(Object msg)
Trace:    at DotNetty.Codecs.MessageToMessageDecoder`1.ChannelRead(IChannelHandlerContext context, Object message)
   at DotNetty.Transport.Channels.AbstractChannelHandlerContext.InvokeChannelRead(Object msg)

Makes we wonder if I busted something in 1.2 with the LengthFrameEncoder we're supposed to be using to ensure that empty frames can't make it down into the protobuf decoder...

Aaronontheweb · 2017-06-21T21:05:15Z

Resolved #2015 also - there was a bug inside the ThrottleTransportAdapter that was preventing the spec from completing. Resolved that.

Aaronontheweb · 2017-06-21T21:46:17Z

Doh! Still need to do cluster API approval

Aaronontheweb · 2017-06-21T21:52:30Z

SurviveNetworkStabilitySpec passed on its first try on CI

Aaronontheweb

Need to make some minor changes, but otherwise this is good to go

Aaronontheweb · 2017-06-21T23:05:35Z

src/core/Akka.Cluster.Tests.MultiNode/SurviveNetworkInstabilitySpec.cs

            A_Network_partition_tolerant_cluster_must_down_and_remove_quarantined_node();
-            //A_Network_partition_tolerant_cluster_must_continue_and_move_Joining_to_Up_after_downing_of_one_half();
+            A_Network_partition_tolerant_cluster_must_continue_and_move_Joining_to_Up_after_downing_of_one_half();


Enabled the rest of the spec

Aaronontheweb · 2017-06-21T23:07:05Z

src/core/Akka.Remote.TestKit/ConsoleLogger.cs

@@ -49,7 +49,7 @@ public IDisposable BeginScope<TState>(TState state)

        public void Log<TState>(LogLevel logLevel, EventId eventId, TState state, Exception exception, Func<TState, Exception, string> formatter)
        {
-            StandardOutWriter.WriteLine($"[{_name}][{logLevel}]{formatter(state, exception)}");
+            StandardOutWriter.WriteLine($"[{_name}][{logLevel}][{DateTime.UtcNow}]{formatter(state, exception)}");


Added timestamp information to the MNTR's RemoteConnection logger

Aaronontheweb · 2017-06-21T23:07:23Z

src/core/Akka.Remote.TestKit/DataTypes.cs

-    interface ICommandOp { } // messages sent from TestConductorExt to Conductor
-    interface INetworkOp { } // messages sent over the wire
-    interface IUnconfirmedClientOp : IClientOp { } // unconfirmed messages going to the Player
+    /// <summary>


Turned these into intellisense commends

Aaronontheweb · 2017-06-21T23:07:37Z

src/core/Akka.Remote.TestKit/Proto/ProtobufDecoder.cs

@@ -37,7 +37,9 @@ public ProtobufDecoder(IMessageLite prototype, ExtensionRegistry extensions)

        protected override void Decode(IChannelHandlerContext context, IByteBuffer input, List<object> output)
        {
-            _logger.LogDebug("Decoding {0} into Protobuf", input);
+            _logger.LogDebug("[{0} --> {1}] Decoding {2} into Protobuf", context.Channel.LocalAddress, context.Channel.RemoteAddress, input);


Added address information to the debug messages

Aaronontheweb · 2017-06-21T23:07:47Z

src/core/Akka.Remote.TestKit/Proto/ProtobufDecoder.cs

-            _logger.LogDebug("Decoding {0} into Protobuf", input);
+            _logger.LogDebug("[{0} --> {1}] Decoding {2} into Protobuf", context.Channel.LocalAddress, context.Channel.RemoteAddress, input);
+
+            // short-circuit if there are no readable bytes


Ah, I should get rid of this line actually

Aaronontheweb · 2017-06-21T23:08:16Z

src/core/Akka.Remote/Transport/ThrottleTransportAdapter.cs

@@ -547,33 +544,22 @@ private Task<SetThrottleAck> AskModeWithDeathCompletion(IActorRef target, Thrott
            if (target.IsNobody()) return Task.FromResult(SetThrottleAck.Instance);
            else
            {
-                return target.Ask<SetThrottleAck>(mode, timeout);
+                //return target.Ask<SetThrottleAck>(mode, timeout);

                //TODO: use PromiseActorRef here when implemented


Need to get rid of this TO-DO...

Aaronontheweb · 2017-06-21T23:08:35Z

src/core/Akka.Remote/Transport/ThrottleTransportAdapter.cs

-                //       return SetThrottleAck.Instance;
-                //    }
-                //}, TaskContinuationOptions.ExecuteSynchronously);
+                var internalTarget = target.AsInstanceOf<IInternalActorRef>();


Implemented the TODO, which included the fix needed to make the SurviveNetworkInstabilitySpec work as expected

alexvaluyskiy · 2017-06-22T08:32:24Z

Looks good

Aaronontheweb added 2 commits June 19, 2017 23:02

close akkadotnet#2584 - fix node stuck at joining issues

8c96df3

allowed SurviveNetworkInstabilitySpec to run all the way

removed old MBTs

d0462cd

Aaronontheweb commented Jun 20, 2017

View reviewed changes

removed unused file

8a8bc98

Aaronontheweb mentioned this pull request Jun 20, 2017

Akka.Cluster: nodes stuck at joining #2584

Closed

Aaronontheweb added 2 commits June 20, 2017 09:23

reverted back to original spec

a4e6629

verified that latter part of final stage can run correctly

90ab30c

Aaronontheweb added 3 commits June 21, 2017 13:39

cleaning up Remote.TestKit comments

99aafb3

fixed bug with ThrottleTransportAdapter

5468685

close akkadotnet#2015 - fixed SurviveNetworkInstabilitySpec

fc9eaa2

API approval changes

9ebb369

Merge branch 'dev' into fix-cluster-node-stuck

5496523

Aaronontheweb changed the title ~~[WIP] fix node stuck at joining issues~~ fix node stuck at joining issues Jun 21, 2017

Aaronontheweb added the ready label Jun 21, 2017

Aaronontheweb commented Jun 21, 2017

View reviewed changes

Aaronontheweb added 3 commits June 21, 2017 19:37

cleaned up PR

a86a9e3

cleaned up last remaining todo

5bcdb83

Merge branch 'dev' into fix-cluster-node-stuck

df2c8a8

alexvaluyskiy merged commit f794c24 into akkadotnet:dev Jun 22, 2017

alexvaluyskiy mentioned this pull request Jun 22, 2017

Port Akka.Cluster MultiNodeSpec: SurviveNetworkInstabilitySpec #2015

Closed

Aaronontheweb mentioned this pull request Jun 22, 2017

Enabled InitialHeartbeat Spec #2781

Merged

Aaronontheweb deleted the fix-cluster-node-stuck branch June 22, 2017 14:25

This was referenced Jun 22, 2017

Cluster.MemberRemoved does not always fire #2492

Closed

v1.2.1 initial release notes #2785

Merged

zbynek001 added a commit to zbynek001/akka.net that referenced this pull request Jun 23, 2017

fix node stuck at joining issues (akkadotnet#2773)

dea1d2c

fix node stuck at joining issues #2773

fix node stuck at joining issues #2773

Uh oh!

Conversation

Aaronontheweb commented Jun 20, 2017

Uh oh!

Aaronontheweb left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Aaronontheweb Jun 20, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Aaronontheweb commented Jun 20, 2017

Uh oh!

Aaronontheweb commented Jun 20, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Aaronontheweb commented Jun 20, 2017

Uh oh!

heynickc commented Jun 20, 2017

Uh oh!

Aaronontheweb commented Jun 20, 2017

Uh oh!

Aaronontheweb commented Jun 20, 2017

Uh oh!

Aaronontheweb commented Jun 20, 2017

Uh oh!

Aaronontheweb commented Jun 21, 2017

Uh oh!

Aaronontheweb commented Jun 21, 2017

Uh oh!

Aaronontheweb commented Jun 21, 2017

Uh oh!

Aaronontheweb left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alexvaluyskiy commented Jun 22, 2017

Uh oh!

Uh oh!

Aaronontheweb Jun 20, 2017 •

edited

Loading

Aaronontheweb commented Jun 20, 2017 •

edited

Loading