Skip to content

Conversation

HenryYYang
Copy link
Contributor

Description: Enforce the count is 0 if one of the other threads have reset the callback time.
Risk Level: Low
Testing:
Docs Changes:
Release Notes:
Fixes #10384

…time

Signed-off-by: Henry Yang <hyang@lyft.com>
Signed-off-by: Henry Yang <hyang@lyft.com>
@HenryYYang HenryYYang requested a review from mattklein123 as a code owner April 9, 2020 20:27
@mattklein123 mattklein123 self-assigned this Apr 9, 2020
Copy link
Member

@mattklein123 mattklein123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for looking into this and fixing. This is quite hard to reason about. I'm wondering if we should just use a lock here? Would this hit for every request? Or just during redirection? If the latter maybe it's OK?

If we do keep this approach, I think we should add some specific tests using https://github.com/envoyproxy/envoy/blob/master/source/common/common/thread_synchronizer.h. See other examples that use CAS loops that use this for testing.

Alternatively, if we believe that this eventual consistency is OK, is there a way to fix the tests so they don't flake? Test only change?

/wait

@@ -90,6 +90,11 @@ bool ClusterRefreshManagerImpl::onEvent(const std::string& cluster_name, EventTy
}
});
return true;
} else if (info->last_callback_time_ms_.load() != last_callback_time_ms) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we stick with this approach I think it would be more clear if this was a basic else statement paired with the CAS, inside a large if statement with the counter increment.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think if we have the if counter set to 0 is in a if statement, it would also race. because if thread 1 set the counter to 0, the incr in thread 2 might not trigger the threshold. Let me rewrite this logic to make it clearer.

// If someone else updated the last callback time, then they will trigger the callback.
// During this time we don't want to continue to increment the count, so we enforce the
// count is 0
*count = 0;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on the comment above, should this be a decrement to undo the increment? Or is that not safe since it might also race?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

decrement can race as well. If the decrement is called after count got set to 0 in another thread, this would set the counter to -1.

Signed-off-by: Henry Yang <hyang@lyft.com>
Signed-off-by: Henry Yang <hyang@lyft.com>
Copy link
Member

@mattklein123 mattklein123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, few more question/comment.

/wait

postCallBack = true;
}

// If a callback should be triggered(in this or some other threads) signaled by the changed
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/threads/thread


// If a callback should be triggered(in this or some other threads) signaled by the changed
// last callback time, we reset the count to 0
if (postCallBack || info->last_callback_time_ms_.load() != last_callback_time_ms) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still a little confused here by this logic so I think it needs some more comments. What is the exact sequence that leads to post_callback being false but the times being not equal?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, let me add those to the comments

Signed-off-by: Henry Yang <hyang@lyft.com>
Signed-off-by: Henry Yang <hyang@lyft.com>
Copy link
Member

@mattklein123 mattklein123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, this makes sense. Great comments!

@mattklein123 mattklein123 merged commit d8e6b40 into envoyproxy:master Apr 11, 2020
weiwei02 added a commit to DailyC/envoy that referenced this pull request Apr 13, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Flake: ClusterRefreshManagerTest.BasicFailureEvents
2 participants