Fix refresh manager race #10727

HenryYYang · 2020-04-09T20:27:50Z

Description: Enforce the count is 0 if one of the other threads have reset the callback time.
Risk Level: Low
Testing:
Docs Changes:
Release Notes:
Fixes #10384

…time Signed-off-by: Henry Yang <hyang@lyft.com>

Signed-off-by: Henry Yang <hyang@lyft.com>

mattklein123

Thanks for looking into this and fixing. This is quite hard to reason about. I'm wondering if we should just use a lock here? Would this hit for every request? Or just during redirection? If the latter maybe it's OK?

If we do keep this approach, I think we should add some specific tests using https://github.com/envoyproxy/envoy/blob/master/source/common/common/thread_synchronizer.h. See other examples that use CAS loops that use this for testing.

Alternatively, if we believe that this eventual consistency is OK, is there a way to fix the tests so they don't flake? Test only change?

/wait

mattklein123 · 2020-04-09T23:07:14Z

source/extensions/common/redis/cluster_refresh_manager_impl.cc

@@ -90,6 +90,11 @@ bool ClusterRefreshManagerImpl::onEvent(const std::string& cluster_name, EventTy
          }
        });
        return true;
+      } else if (info->last_callback_time_ms_.load() != last_callback_time_ms) {


If we stick with this approach I think it would be more clear if this was a basic else statement paired with the CAS, inside a large if statement with the counter increment.

I think if we have the if counter set to 0 is in a if statement, it would also race. because if thread 1 set the counter to 0, the incr in thread 2 might not trigger the threshold. Let me rewrite this logic to make it clearer.

mattklein123 · 2020-04-09T23:07:44Z

source/extensions/common/redis/cluster_refresh_manager_impl.cc

+        // If someone else updated the last callback time, then they will trigger the callback.
+        // During this time we don't want to continue to increment the count, so we enforce the
+        // count is 0
+        *count = 0;


Based on the comment above, should this be a decrement to undo the increment? Or is that not safe since it might also race?

decrement can race as well. If the decrement is called after count got set to 0 in another thread, this would set the counter to -1.

Signed-off-by: Henry Yang <hyang@lyft.com>

mattklein123

Thanks, few more question/comment.

/wait

source/extensions/common/redis/cluster_refresh_manager_impl.cc

mattklein123 · 2020-04-10T17:55:53Z

source/extensions/common/redis/cluster_refresh_manager_impl.cc

+        postCallBack = true;
+      }
+
+      // If a callback should be triggered(in this or some other threads) signaled by the changed


s/threads/thread

mattklein123 · 2020-04-10T18:05:57Z

source/extensions/common/redis/cluster_refresh_manager_impl.cc

+
+      // If a callback should be triggered(in this or some other threads) signaled by the changed
+      // last callback time, we reset the count to 0
+      if (postCallBack || info->last_callback_time_ms_.load() != last_callback_time_ms) {


I'm still a little confused here by this logic so I think it needs some more comments. What is the exact sequence that leads to post_callback being false but the times being not equal?

ok, let me add those to the comments

Signed-off-by: Henry Yang <hyang@lyft.com>

mattklein123

Thanks, this makes sense. Great comments!

redis: fix refresh manager race (envoyproxy#10727)

HenryYYang added 2 commits April 9, 2020 13:17

Enforce the count is 0 if one of the threads have reset the callback …

e57cae5

…time Signed-off-by: Henry Yang <hyang@lyft.com>

format fix

6718fd4

Signed-off-by: Henry Yang <hyang@lyft.com>

HenryYYang requested a review from mattklein123 as a code owner April 9, 2020 20:27

mattklein123 self-assigned this Apr 9, 2020

mattklein123 requested changes Apr 9, 2020

View reviewed changes

repokitteh-read-only bot added the waiting label Apr 9, 2020

rewrite to make it clearer

d8dacc1

Signed-off-by: Henry Yang <hyang@lyft.com>

repokitteh-read-only bot removed the waiting label Apr 10, 2020

fix format

d07c777

Signed-off-by: Henry Yang <hyang@lyft.com>

mattklein123 requested changes Apr 10, 2020

View reviewed changes

repokitteh-read-only bot added the waiting label Apr 10, 2020

Add comment

252d2a5

Signed-off-by: Henry Yang <hyang@lyft.com>

repokitteh-read-only bot removed the waiting label Apr 10, 2020

fix typo

ff55a52

Signed-off-by: Henry Yang <hyang@lyft.com>

mattklein123 approved these changes Apr 11, 2020

View reviewed changes

mattklein123 merged commit d8e6b40 into envoyproxy:master Apr 11, 2020

weiwei02 added a commit to DailyC/envoy that referenced this pull request Apr 13, 2020

Merge pull request #2 from envoyproxy/master

1f18677

redis: fix refresh manager race (envoyproxy#10727)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix refresh manager race #10727

Fix refresh manager race #10727

Uh oh!

HenryYYang commented Apr 9, 2020

Uh oh!

mattklein123 left a comment

Uh oh!

mattklein123 Apr 9, 2020

Uh oh!

HenryYYang Apr 10, 2020

Uh oh!

mattklein123 Apr 9, 2020

Uh oh!

HenryYYang Apr 10, 2020

Uh oh!

mattklein123 left a comment

Uh oh!

Uh oh!

mattklein123 Apr 10, 2020

Uh oh!

mattklein123 Apr 10, 2020

Uh oh!

HenryYYang Apr 10, 2020

Uh oh!

mattklein123 left a comment

Uh oh!

Uh oh!

Fix refresh manager race #10727

Fix refresh manager race #10727

Uh oh!

Conversation

HenryYYang commented Apr 9, 2020

Uh oh!

mattklein123 left a comment

Choose a reason for hiding this comment

Uh oh!

mattklein123 Apr 9, 2020

Choose a reason for hiding this comment

Uh oh!

HenryYYang Apr 10, 2020

Choose a reason for hiding this comment

Uh oh!

mattklein123 Apr 9, 2020

Choose a reason for hiding this comment

Uh oh!

HenryYYang Apr 10, 2020

Choose a reason for hiding this comment

Uh oh!

mattklein123 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

mattklein123 Apr 10, 2020

Choose a reason for hiding this comment

Uh oh!

mattklein123 Apr 10, 2020

Choose a reason for hiding this comment

Uh oh!

HenryYYang Apr 10, 2020

Choose a reason for hiding this comment

Uh oh!

mattklein123 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!