-
Notifications
You must be signed in to change notification settings - Fork 8.1k
Fix TestEdsCache flake #37824
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix TestEdsCache flake #37824
Conversation
🤔 🐛 You appear to be fixing a bug in Go code, yet your PR doesn't include updates to any test files. Did you forget to add a test? Courtesy of your friendly test nag. |
/test unit-tests_istio |
} | ||
|
||
// edsUpdateInSerial run s.edsUpdateByKeys in serial and wait for complete. | ||
func (s *ServiceEntryStore) edsUpdateInSerial(keys map[instancesKey]struct{}, push bool) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand why we use a queue if it's serial. Why not just call the functions directly? Can you add a comment in the code explaining this? Or even better, add a test case that fails if we do NOT use the queue (since as far as I know it passes without queue)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok I saw comment from somewhere else actually
s.mutex.Unlock() | ||
|
||
s.edsUpdate(instances, true) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure I understand this change. I thought the whole point of the EDS was to ensure that we process EDS events in order. But now we release the lock then enqueue the update -- so we lose strict ordering?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The order is guaranteed by enqueueing, in the function, the instances enqueue by order. So they are processed in strict order, I think
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But between mutex unlock and s.edsUpdate, anything can happen. In extreme scenario the thread is not live for 10s, and 100s of events are processed in this time. Then finally we wake back up and wipe out the old configs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this case, the service instances store can be updated to the latest state. And in edsUpdate
, the eds cache is then updated to latest state. I could not think how it can not work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, I think I misunderstood. This is just telling it the keys it needs to update. The rest happens under a lock
case <-s.edsQueue.Closed(): | ||
return | ||
default: | ||
wg.Wait() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did see this leak in tests when running locally. Not sure why - maybe a flake?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe, the there is a race between wait and queue stopped. Need to fix
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
after looking again I think it seems reasonable, just concerned about the leak commented.
cc @ramaraochavali can you look as well? it is complex so good to have 2 eyes
s.mutex.Unlock() | ||
|
||
s.edsUpdate(instances, true) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, I think I misunderstood. This is just telling it the keys it needs to update. The rest happens under a lock
LGTM as well. But the method naming is very confusing - some place we update cache and some places we do actual push. I do not have better alternatives though. |
I have the same feelings about the naming. Could do refactor separately |
/test gencheck_istio |
I think this broke the benchmark https://storage.googleapis.com/istio-prow/logs/benchmark-report_istio_postsubmit/1502305181302263808/artifacts/benchmark-log.txt |
fix in #37901 |
* Run eds cache update synchrounously and wait for complete * update * Fix dead lock * Fix goroutine leak * make use of channel rather than waitgroup to prevent blocking
* Run eds cache update synchrounously and wait for complete * update * Fix dead lock * Fix goroutine leak * make use of channel rather than waitgroup to prevent blocking
* Fix TestEdsCache flake (#37824) * Run eds cache update synchrounously and wait for complete * update * Fix dead lock * Fix goroutine leak * make use of channel rather than waitgroup to prevent blocking * refactor eds functions (#37892) * refactor eds functions Signed-off-by: Rama Chavali <rama.rao@salesforce.com> * fix test Signed-off-by: Rama Chavali <rama.rao@salesforce.com> * revert Signed-off-by: Rama Chavali <rama.rao@salesforce.com> * fix Signed-off-by: Rama Chavali <rama.rao@salesforce.com> * rename Signed-off-by: Rama Chavali <rama.rao@salesforce.com> * rename Signed-off-by: Rama Chavali <rama.rao@salesforce.com> * workload instance cause stale CDS clusters of type STRICT_DNS (#39947) * workload instance cause stale CDS clusters with of type STRICT_DNS * added release note * fewer full push triggers * code review comments for release note * extend logic for DNS_ROUND_ROBIN * update trigger reason to EndpointUpdate * add unit tests * resolve cherrypick conflicts Co-authored-by: Zhonghu Xu <xuzhonghu@huawei.com> Co-authored-by: Rama Chavali <rama.rao@salesforce.com>
Please provide a description of this PR:
run eds cache update in serial and wait for completion
fix #37756