Skip to content

Keep the refuseUid in a better way, #22156 #23617

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Sep 25, 2017

Conversation

patriknw
Copy link
Contributor

  • The scenario described in the issue can cause the quarantine marker to
    be lost when creating a new endpoint for that address. Then when later
    creating another endpoint from an inbound connection the uid is considered
    confirmed and Ack message is accepted, triggering the unexpected seq number
    issue.
  • The refuseUid was kept in the endpoint policy markers, but that is just very
    complicated and as illustrated by this issue not always safe.
  • Instead, keep the refuseUid separately so it's not lost when registering
    new endpoint.
  • The purpose of WasGated was only to try to keep the refuseUid (as far as I know),
    and that is not needed any longer.

Refs #22156

@akka-ci akka-ci added validating PR is currently being validated by Jenkins needs-attention Indicates a PR validation failure (set by CI infrastructure) and removed validating PR is currently being validated by Jenkins labels Sep 11, 2017
@akka-ci
Copy link

akka-ci commented Sep 11, 2017

Test FAILed.

@jrudolph
Copy link
Contributor

What's the consequence of the bug fix? It seems the bug lead to other nodes quarantining a just restarted node so that it could impact e.g. rolling upgrades. Would this now be fixed?

var uidConfirmed: Boolean = uid.isDefined && (uid != refuseUid)

if (uid.isDefined && (uid == refuseUid))
throw new HopelessAssociation(localAddress, remoteAddress, uid,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lol, did not realise we have that type. Makes sense

@patriknw
Copy link
Contributor Author

@jrudolph The bug was more about that it could talk to (accept Ack message) from a node that was already quarantined. If it would have been restarted the new incarnation would have a different uid and then it is ok to talk to that address (and sys msg buffers would not be out of sync).

@jrudolph
Copy link
Contributor

If it would have been restarted the new incarnation would have a different uid and then it is ok to talk to that address (and sys msg buffers would not be out of sync).

Mmh, the original ticket said

When node A is restarted and comes back with the new UID to node B, node B has endpoint actor state corrupted:

So, I wonder if that scenario is actually fixed? Or, was the description inaccurate in the first place? Or, did the error message fixed here just correlate with another problem we haven't yet understood?

AFAIR from our other internal reports about this issue there were real functional problems observed where one or more nodes could not talk to one that was recently restarted.

@patriknw
Copy link
Contributor Author

@jrudolph ah, I didn't read the original issue much, I focused on the scenario that we had a reproducer for, that I described here: #22156 (comment)

We don't have much to go on for the original description so I don't think we will be able to reproduce that if it really involved restart. I suggest that we rename the ticket, and create a a new one if that original scenario is reported again. ok?

@akka-ci akka-ci added validating PR is currently being validated by Jenkins tested PR that was successfully built and tested by Jenkins and removed needs-attention Indicates a PR validation failure (set by CI infrastructure) validating PR is currently being validated by Jenkins labels Sep 11, 2017
@akka-ci
Copy link

akka-ci commented Sep 11, 2017

Test PASSed.

@jrudolph
Copy link
Contributor

We don't have much to go on for the original description so I don't think we will be able to reproduce that if it really involved restart. I suggest that we rename the ticket, and create a a new one if that original scenario is reported again. ok?

Yes, makes sense.

@patriknw patriknw added the 2 - pick next Used to mark issues which are next up in the queue to be worked on. The tag is non-binding label Sep 25, 2017
Copy link
Contributor

@jrudolph jrudolph left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM (let's see if the original issue will be reported again and under which circumstances)

Copy link
Contributor

@johanandren johanandren left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, not 100% sure I understand the consequences, but from the description the changes make sense to me.

* The scenario described in the issue can cause the quarantine marker to
  be lost when creating a new endpoint for that address. Then when later
  creating another endpoint from an inbound connection the uid is considered
  confirmed and Ack message is accepted, triggering the unexpected seq number
  issue.
* The refuseUid was kept in the endpoint policy markers, but that is just very
  complicated and as illustrated by this issue not always safe.
* Instead, keep the refuseUid separately so it's not lost when registering
  new endpoint.
* The purpose of WasGated was only to try to keep the refuseUid (as far as I know),
  and that is not needed any longer.

mima filter
@patriknw patriknw force-pushed the wip-22156-lost-refuseUid-patriknw branch from edef0aa to d660f89 Compare September 25, 2017 14:25
@patriknw
Copy link
Contributor Author

good, I have rebased and will merge after validation

@akka-ci akka-ci added validating PR is currently being validated by Jenkins tested PR that was successfully built and tested by Jenkins and removed tested PR that was successfully built and tested by Jenkins validating PR is currently being validated by Jenkins labels Sep 25, 2017
@akka-ci
Copy link

akka-ci commented Sep 25, 2017

Test PASSed.

@patriknw patriknw merged commit 6085f78 into master Sep 25, 2017
@patriknw patriknw deleted the wip-22156-lost-refuseUid-patriknw branch September 25, 2017 15:58
Aaronontheweb added a commit to Aaronontheweb/akka.net that referenced this pull request Apr 25, 2019
Aaronontheweb added a commit to Aaronontheweb/akka.net that referenced this pull request Apr 25, 2019
Aaronontheweb added a commit to akkadotnet/akka.net that referenced this pull request Apr 30, 2019
* completed port of Properly_quarantine_stashed_inbound_connections

* porting akka/akka#23617

* modified AssociationHandle to allow explicit DEBUG logging of disassociation reasons

* final pass of EndpointManager updates from akka/akka#23617

* fixed issue with EndpointRegistry quarantine management

* updated the reliable delivery supervisor

* fixed bug when handling inbound associations

* cleaning up AkkaProtocolState actors

* porting over OutboundConnection timeout spec

* ported InboundTimeout spec

* approved new Akka.Remote API additions
madmonkey pushed a commit to madmonkey/akka.net that referenced this pull request Jul 12, 2019
* completed port of Properly_quarantine_stashed_inbound_connections

* porting akka/akka#23617

* modified AssociationHandle to allow explicit DEBUG logging of disassociation reasons

* final pass of EndpointManager updates from akka/akka#23617

* fixed issue with EndpointRegistry quarantine management

* updated the reliable delivery supervisor

* fixed bug when handling inbound associations

* cleaning up AkkaProtocolState actors

* porting over OutboundConnection timeout spec

* ported InboundTimeout spec

* approved new Akka.Remote API additions
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
2 - pick next Used to mark issues which are next up in the queue to be worked on. The tag is non-binding tested PR that was successfully built and tested by Jenkins
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants