-
Notifications
You must be signed in to change notification settings - Fork 1.1k
fix node stuck at joining issues #2773
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix node stuck at joining issues #2773
Conversation
allowed SurviveNetworkInstabilitySpec to run all the way
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left some comments detailing the changes in this PR
@@ -43,7 +44,7 @@ public SurviveNetworkInstabilitySpecConfig() | |||
Seventh = Role("seventh"); | |||
Eighth = Role("eighth"); | |||
|
|||
CommonConfig = DebugConfig(false) | |||
CommonConfig = DebugConfig(true) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is still a WIP for the moment - had some massive issues with Windows Defender going beserk and eating up 100% of CPU when I ran this locally; hoping the build server sheds some better light on it.
@@ -135,7 +136,7 @@ private void AssertCanTalk(params RoleName[] alive) | |||
{ | |||
foreach (var to in alive) | |||
{ | |||
var sel = Sys.ActorSelection(Node(to) / "user" / "echo"); | |||
var sel = Sys.ActorSelection(new RootActorPath(GetAddress(to)) / "user" / "echo"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
GetAddress
uses caching under the hood, so we can save ourselves some network calls here
|
||
namespace Akka.Cluster.Tests | ||
{ | ||
public class ClusterGenerators |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could technically delete this class - used it for some FsCheck tests I deleted as they were no longer needed
var a = VectorClock.Create().Increment(node1).Increment(node2); | ||
var b = a.Prune(node2).Increment(node1); // remove node2, increment node1 | ||
|
||
a.CompareTo(b).Should().Be(VectorClock.Ordering.Concurrent); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Verifying a remove scenario explicitly
@@ -1015,7 +1015,7 @@ public ClusterCoreDaemon(IActorRef publisher) | |||
_cluster = Cluster.Get(Context.System); | |||
_publisher = publisher; | |||
SelfUniqueAddress = _cluster.SelfUniqueAddress; | |||
_vclockNode = new VectorClock.Node(VclockName(SelfUniqueAddress)); | |||
_vclockNode = VectorClock.Node.Create(VclockName(SelfUniqueAddress)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
THIS IS THE FIX FOR #2584 - as it turns out, the constructor we were using earlier didn't compute a hash of the node - thus all of the comparisons would fail later when it came time to prune vector clocks for removed / downed nodes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like this would explain why some nodes were stuck leaving too, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@nvivo yeah, all of the above. If you had smooth sailing in the cluster and there were never any issues with nodes going unreachable this wouldn't be a big problem as the leader's vectorclock was the only one that mattered.
But as soon as unreachable events took place then every other individual member node would stamp their information on the VectorClock
, that would cause the gossips to get corrupted if any of those nodes left the cluster, the end result being that in some cases older gossip appeared to be newer than the most recent gossip produced by the leader. So this affected everything that required cluster convergence. Nasty bug.
@@ -2597,6 +2595,10 @@ protected override void OnReceive(object message) | |||
else if (message is InternalClusterAction.InitJoinNack) { } //that seed was uninitialized | |||
else if (message is ReceiveTimeout) | |||
{ | |||
if (_attempts >= 2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added some logging for nodes that were unable to perform a join to help troubleshoot the issue
{ | ||
unchecked | ||
{ | ||
var hashCode = 23; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should probably cache this
{ | ||
foreach (var c in value) | ||
{ | ||
_computedHashValue *= _computedHashValue * 31 + c; // using the byte value of each char |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This may no longer be necessary.... was an early theory I had as to why the vector clocks didn't sort properly. Turned out to be wrong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Upon further review, turned out to be right! Tried removing this code and re-ran the specs - ran into same issue as before.
/// <summary> | ||
/// Maps F# methods to C# delegates | ||
/// </summary> | ||
public static class FsharpDelegateHelper |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Helper class I've used in a ton of other projects for converting Func<>
into F# first-order functions... Useful tool when working with FsCheck and you want to use something like Gen.Map2
@heynickc this is targeting the |
@heynickc FYI, also looks like the changes we've made to the TC configuration (for |
Checking the logs but it looks like we still have an issue with |
@Aaronontheweb PR #2774 resolves Mono Unit Tests not running and I've altered the TeamCity configuration to run Windows Unit Tests successfully. |
ty @heynickc - looks like there's still that one outstanding issue with the F# projects building on Mono in that PR, which I think we also fixed recently on the |
Also need to do API approval for Akka.Cluster before merge... Going to clean up some of the items listed here beforehand though, |
Looks like the issue with the
Makes we wonder if I busted something in 1.2 with the |
Resolved #2015 also - there was a bug inside the |
Doh! Still need to do cluster API approval |
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need to make some minor changes, but otherwise this is good to go
A_Network_partition_tolerant_cluster_must_down_and_remove_quarantined_node(); | ||
//A_Network_partition_tolerant_cluster_must_continue_and_move_Joining_to_Up_after_downing_of_one_half(); | ||
A_Network_partition_tolerant_cluster_must_continue_and_move_Joining_to_Up_after_downing_of_one_half(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Enabled the rest of the spec
@@ -49,7 +49,7 @@ public IDisposable BeginScope<TState>(TState state) | |||
|
|||
public void Log<TState>(LogLevel logLevel, EventId eventId, TState state, Exception exception, Func<TState, Exception, string> formatter) | |||
{ | |||
StandardOutWriter.WriteLine($"[{_name}][{logLevel}]{formatter(state, exception)}"); | |||
StandardOutWriter.WriteLine($"[{_name}][{logLevel}][{DateTime.UtcNow}]{formatter(state, exception)}"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added timestamp information to the MNTR's RemoteConnection
logger
interface ICommandOp { } // messages sent from TestConductorExt to Conductor | ||
interface INetworkOp { } // messages sent over the wire | ||
interface IUnconfirmedClientOp : IClientOp { } // unconfirmed messages going to the Player | ||
/// <summary> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Turned these into intellisense commends
@@ -37,7 +37,9 @@ public ProtobufDecoder(IMessageLite prototype, ExtensionRegistry extensions) | |||
|
|||
protected override void Decode(IChannelHandlerContext context, IByteBuffer input, List<object> output) | |||
{ | |||
_logger.LogDebug("Decoding {0} into Protobuf", input); | |||
_logger.LogDebug("[{0} --> {1}] Decoding {2} into Protobuf", context.Channel.LocalAddress, context.Channel.RemoteAddress, input); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added address information to the debug messages
_logger.LogDebug("Decoding {0} into Protobuf", input); | ||
_logger.LogDebug("[{0} --> {1}] Decoding {2} into Protobuf", context.Channel.LocalAddress, context.Channel.RemoteAddress, input); | ||
|
||
// short-circuit if there are no readable bytes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I should get rid of this line actually
@@ -547,33 +544,22 @@ private Task<SetThrottleAck> AskModeWithDeathCompletion(IActorRef target, Thrott | |||
if (target.IsNobody()) return Task.FromResult(SetThrottleAck.Instance); | |||
else | |||
{ | |||
return target.Ask<SetThrottleAck>(mode, timeout); | |||
//return target.Ask<SetThrottleAck>(mode, timeout); | |||
|
|||
//TODO: use PromiseActorRef here when implemented |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need to get rid of this TO-DO...
// return SetThrottleAck.Instance; | ||
// } | ||
//}, TaskContinuationOptions.ExecuteSynchronously); | ||
var internalTarget = target.AsInstanceOf<IInternalActorRef>(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Implemented the TODO, which included the fix needed to make the SurviveNetworkInstabilitySpec
work as expected
Looks good |
close #2584
allowed SurviveNetworkInstabilitySpec to run all the way