-
Notifications
You must be signed in to change notification settings - Fork 5.1k
Fix refresh manager race #10727
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix refresh manager race #10727
Conversation
…time Signed-off-by: Henry Yang <hyang@lyft.com>
Signed-off-by: Henry Yang <hyang@lyft.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for looking into this and fixing. This is quite hard to reason about. I'm wondering if we should just use a lock here? Would this hit for every request? Or just during redirection? If the latter maybe it's OK?
If we do keep this approach, I think we should add some specific tests using https://github.com/envoyproxy/envoy/blob/master/source/common/common/thread_synchronizer.h. See other examples that use CAS loops that use this for testing.
Alternatively, if we believe that this eventual consistency is OK, is there a way to fix the tests so they don't flake? Test only change?
/wait
@@ -90,6 +90,11 @@ bool ClusterRefreshManagerImpl::onEvent(const std::string& cluster_name, EventTy | |||
} | |||
}); | |||
return true; | |||
} else if (info->last_callback_time_ms_.load() != last_callback_time_ms) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we stick with this approach I think it would be more clear if this was a basic else statement paired with the CAS, inside a large if statement with the counter increment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think if we have the if counter set to 0 is in a if statement, it would also race. because if thread 1 set the counter to 0, the incr in thread 2 might not trigger the threshold. Let me rewrite this logic to make it clearer.
// If someone else updated the last callback time, then they will trigger the callback. | ||
// During this time we don't want to continue to increment the count, so we enforce the | ||
// count is 0 | ||
*count = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Based on the comment above, should this be a decrement to undo the increment? Or is that not safe since it might also race?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
decrement can race as well. If the decrement is called after count got set to 0 in another thread, this would set the counter to -1.
Signed-off-by: Henry Yang <hyang@lyft.com>
Signed-off-by: Henry Yang <hyang@lyft.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, few more question/comment.
/wait
postCallBack = true; | ||
} | ||
|
||
// If a callback should be triggered(in this or some other threads) signaled by the changed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/threads/thread
|
||
// If a callback should be triggered(in this or some other threads) signaled by the changed | ||
// last callback time, we reset the count to 0 | ||
if (postCallBack || info->last_callback_time_ms_.load() != last_callback_time_ms) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm still a little confused here by this logic so I think it needs some more comments. What is the exact sequence that leads to post_callback being false but the times being not equal?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, let me add those to the comments
Signed-off-by: Henry Yang <hyang@lyft.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, this makes sense. Great comments!
redis: fix refresh manager race (envoyproxy#10727)
Description: Enforce the count is 0 if one of the other threads have reset the callback time.
Risk Level: Low
Testing:
Docs Changes:
Release Notes:
Fixes #10384