Skip to content

Conversation

Aaronontheweb
Copy link
Member

close #2584
allowed SurviveNetworkInstabilitySpec to run all the way

allowed SurviveNetworkInstabilitySpec to run all the way
Copy link
Member Author

@Aaronontheweb Aaronontheweb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some comments detailing the changes in this PR

@@ -43,7 +44,7 @@ public SurviveNetworkInstabilitySpecConfig()
Seventh = Role("seventh");
Eighth = Role("eighth");

CommonConfig = DebugConfig(false)
CommonConfig = DebugConfig(true)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is still a WIP for the moment - had some massive issues with Windows Defender going beserk and eating up 100% of CPU when I ran this locally; hoping the build server sheds some better light on it.

@@ -135,7 +136,7 @@ private void AssertCanTalk(params RoleName[] alive)
{
foreach (var to in alive)
{
var sel = Sys.ActorSelection(Node(to) / "user" / "echo");
var sel = Sys.ActorSelection(new RootActorPath(GetAddress(to)) / "user" / "echo");
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GetAddress uses caching under the hood, so we can save ourselves some network calls here


namespace Akka.Cluster.Tests
{
public class ClusterGenerators
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could technically delete this class - used it for some FsCheck tests I deleted as they were no longer needed

var a = VectorClock.Create().Increment(node1).Increment(node2);
var b = a.Prune(node2).Increment(node1); // remove node2, increment node1

a.CompareTo(b).Should().Be(VectorClock.Ordering.Concurrent);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Verifying a remove scenario explicitly

@@ -1015,7 +1015,7 @@ public ClusterCoreDaemon(IActorRef publisher)
_cluster = Cluster.Get(Context.System);
_publisher = publisher;
SelfUniqueAddress = _cluster.SelfUniqueAddress;
_vclockNode = new VectorClock.Node(VclockName(SelfUniqueAddress));
_vclockNode = VectorClock.Node.Create(VclockName(SelfUniqueAddress));
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

THIS IS THE FIX FOR #2584 - as it turns out, the constructor we were using earlier didn't compute a hash of the node - thus all of the comparisons would fail later when it came time to prune vector clocks for removed / downed nodes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like this would explain why some nodes were stuck leaving too, right?

Copy link
Member Author

@Aaronontheweb Aaronontheweb Jun 20, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nvivo yeah, all of the above. If you had smooth sailing in the cluster and there were never any issues with nodes going unreachable this wouldn't be a big problem as the leader's vectorclock was the only one that mattered.

But as soon as unreachable events took place then every other individual member node would stamp their information on the VectorClock , that would cause the gossips to get corrupted if any of those nodes left the cluster, the end result being that in some cases older gossip appeared to be newer than the most recent gossip produced by the leader. So this affected everything that required cluster convergence. Nasty bug.

@@ -2597,6 +2595,10 @@ protected override void OnReceive(object message)
else if (message is InternalClusterAction.InitJoinNack) { } //that seed was uninitialized
else if (message is ReceiveTimeout)
{
if (_attempts >= 2)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added some logging for nodes that were unable to perform a join to help troubleshoot the issue

{
unchecked
{
var hashCode = 23;
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should probably cache this

{
foreach (var c in value)
{
_computedHashValue *= _computedHashValue * 31 + c; // using the byte value of each char
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This may no longer be necessary.... was an early theory I had as to why the vector clocks didn't sort properly. Turned out to be wrong.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Upon further review, turned out to be right! Tried removing this code and re-ran the specs - ran into same issue as before.

/// <summary>
/// Maps F# methods to C# delegates
/// </summary>
public static class FsharpDelegateHelper
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Helper class I've used in a ton of other projects for converting Func<> into F# first-order functions... Useful tool when working with FsCheck and you want to use something like Gen.Map2

@Aaronontheweb
Copy link
Member Author

@heynickc this is targeting the dev branch - looks like Mono / FAKE versioning issues caused the Mono build to blow up. Can we backport whatever change we've made to the v1.3 branch to dev also ASAP so we can get this fix validated and released?

@Aaronontheweb
Copy link
Member Author

Aaronontheweb commented Jun 20, 2017

@heynickc FYI, also looks like the changes we've made to the TC configuration (for v1.3) have caused us to not actually run any of the unit tests on any PRs that are still going into the dev branch. This will also need to be fixed.

@Aaronontheweb
Copy link
Member Author

Checking the logs but it looks like we still have an issue with SurviveNetworkInstabilitySpec - most of the issues have been related to the TestConductor timing something out rather than the behavior of the cluster itself, at least when I was running things locally. I'll see about improving the state of affairs there...

@heynickc
Copy link
Contributor

@Aaronontheweb PR #2774 resolves Mono Unit Tests not running and I've altered the TeamCity configuration to run Windows Unit Tests successfully.

@Aaronontheweb
Copy link
Member Author

ty @heynickc - looks like there's still that one outstanding issue with the F# projects building on Mono in that PR, which I think we also fixed recently on the v1.3 branch

@Aaronontheweb
Copy link
Member Author

Also need to do API approval for Akka.Cluster before merge... Going to clean up some of the items listed here beforehand though,

@Aaronontheweb
Copy link
Member Author

Looks like the issue with the SurviveNetworkStabilitySpec is actually an Akka.Remote.TestKit bug - seeing these randomly pop up in the latter stages of the tests:

[Akka.Remote.TestKit.Proto.ProtobufDecoder][Debug]Decoding EmptyByteBufferBE into Protobuf
[Akka.Remote.TestKit.MsgDecoder][Debug]Decoding 
[Akka.Remote.TestKit.Proto.ProtobufDecoder][Debug]Decoding EmptyByteBufferBE into Protobuf
[Akka.Remote.TestKit.MsgDecoder][Debug]Decoding 
[WARNING][6/20/2017 9:51:32 PM][Thread 0045][Akka.Remote.TestKit.ConductorHandler] handled network error from [::1]:64258: Exception of type 'DotNetty.Codecs.DecoderException' was thrown.    at DotNetty.Codecs.MessageToMessageDecoder`1.ChannelRead(IChannelHandlerContext context, Object message)
   at DotNetty.Transport.Channels.AbstractChannelHandlerContext.InvokeChannelRead(Object msg)
[INFO][6/20/2017 9:51:32 PM][Thread 0028][[akka://MultiNodeClusterSpec/user/TestConductorClient#1281556566]] Terminating connection to multi-node test controller due to [Akka.Actor.FSMBase+Shutdown]
[WARNING][6/20/2017 9:51:32 PM][Thread 0031][akka://MultiNodeClusterSpec/user/TestConductorClient] DeadLetter from [akka://MultiNodeClusterSpec/deadLetters] to [akka://MultiNodeClusterSpec/user/TestConductorClient#1281556566]: <Received dead letter from [akka://MultiNodeClusterSpec/deadLetters]: Akka.Remote.TestKit.ClientFSM+ConnectionFailure: Connection between [Local: [::1]:64258] and [Remote: [::1]:4711] has failed.
Cause: DotNetty.Codecs.DecoderException: Exception of type 'DotNetty.Codecs.DecoderException' was thrown. ---> System.ArgumentException: wrong message 
   at Akka.Remote.TestKit.MsgDecoder.Decode(Object message) in D:\olympus\akka.net\src\core\Akka.Remote.TestKit\MsgDecoder.cs:line 105
   at Akka.Remote.TestKit.MsgDecoder.Decode(IChannelHandlerContext context, Object message, List`1 output) in D:\olympus\akka.net\src\core\Akka.Remote.TestKit\MsgDecoder.cs:line 110
   at DotNetty.Codecs.MessageToMessageDecoder`1.ChannelRead(IChannelHandlerContext context, Object message)
   --- End of inner exception stack trace ---
   at DotNetty.Codecs.MessageToMessageDecoder`1.ChannelRead(IChannelHandlerContext context, Object message)
   at DotNetty.Transport.Channels.AbstractChannelHandlerContext.InvokeChannelRead(Object msg)
Trace:    at DotNetty.Codecs.MessageToMessageDecoder`1.ChannelRead(IChannelHandlerContext context, Object message)
   at DotNetty.Transport.Channels.AbstractChannelHandlerContext.InvokeChannelRead(Object msg)

Makes we wonder if I busted something in 1.2 with the LengthFrameEncoder we're supposed to be using to ensure that empty frames can't make it down into the protobuf decoder...

@Aaronontheweb
Copy link
Member Author

Resolved #2015 also - there was a bug inside the ThrottleTransportAdapter that was preventing the spec from completing. Resolved that.

@Aaronontheweb
Copy link
Member Author

Doh! Still need to do cluster API approval

@Aaronontheweb
Copy link
Member Author

SurviveNetworkStabilitySpec passed on its first try on CI

@Aaronontheweb Aaronontheweb changed the title [WIP] fix node stuck at joining issues fix node stuck at joining issues Jun 21, 2017
Copy link
Member Author

@Aaronontheweb Aaronontheweb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to make some minor changes, but otherwise this is good to go

A_Network_partition_tolerant_cluster_must_down_and_remove_quarantined_node();
//A_Network_partition_tolerant_cluster_must_continue_and_move_Joining_to_Up_after_downing_of_one_half();
A_Network_partition_tolerant_cluster_must_continue_and_move_Joining_to_Up_after_downing_of_one_half();
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Enabled the rest of the spec

@@ -49,7 +49,7 @@ public IDisposable BeginScope<TState>(TState state)

public void Log<TState>(LogLevel logLevel, EventId eventId, TState state, Exception exception, Func<TState, Exception, string> formatter)
{
StandardOutWriter.WriteLine($"[{_name}][{logLevel}]{formatter(state, exception)}");
StandardOutWriter.WriteLine($"[{_name}][{logLevel}][{DateTime.UtcNow}]{formatter(state, exception)}");
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added timestamp information to the MNTR's RemoteConnection logger

interface ICommandOp { } // messages sent from TestConductorExt to Conductor
interface INetworkOp { } // messages sent over the wire
interface IUnconfirmedClientOp : IClientOp { } // unconfirmed messages going to the Player
/// <summary>
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Turned these into intellisense commends

@@ -37,7 +37,9 @@ public ProtobufDecoder(IMessageLite prototype, ExtensionRegistry extensions)

protected override void Decode(IChannelHandlerContext context, IByteBuffer input, List<object> output)
{
_logger.LogDebug("Decoding {0} into Protobuf", input);
_logger.LogDebug("[{0} --> {1}] Decoding {2} into Protobuf", context.Channel.LocalAddress, context.Channel.RemoteAddress, input);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added address information to the debug messages

_logger.LogDebug("Decoding {0} into Protobuf", input);
_logger.LogDebug("[{0} --> {1}] Decoding {2} into Protobuf", context.Channel.LocalAddress, context.Channel.RemoteAddress, input);

// short-circuit if there are no readable bytes
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I should get rid of this line actually

@@ -547,33 +544,22 @@ private Task<SetThrottleAck> AskModeWithDeathCompletion(IActorRef target, Thrott
if (target.IsNobody()) return Task.FromResult(SetThrottleAck.Instance);
else
{
return target.Ask<SetThrottleAck>(mode, timeout);
//return target.Ask<SetThrottleAck>(mode, timeout);

//TODO: use PromiseActorRef here when implemented
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to get rid of this TO-DO...

// return SetThrottleAck.Instance;
// }
//}, TaskContinuationOptions.ExecuteSynchronously);
var internalTarget = target.AsInstanceOf<IInternalActorRef>();
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Implemented the TODO, which included the fix needed to make the SurviveNetworkInstabilitySpec work as expected

@alexvaluyskiy
Copy link
Contributor

Looks good

@alexvaluyskiy alexvaluyskiy merged commit f794c24 into akkadotnet:dev Jun 22, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Akka.Cluster: nodes stuck at joining
4 participants