Fix TestEdsCache flake #37824

hzxuzhonghu · 2022-03-09T08:00:20Z

Please provide a description of this PR:

run eds cache update in serial and wait for completion

istio-policy-bot · 2022-03-09T08:00:22Z

🤔 🐛 You appear to be fixing a bug in Go code, yet your PR doesn't include updates to any test files. Did you forget to add a test?

Courtesy of your friendly test nag.

hzxuzhonghu · 2022-03-09T09:40:37Z

/test unit-tests_istio

howardjohn · 2022-03-09T14:51:47Z

pilot/pkg/serviceregistry/serviceentry/servicediscovery.go

+}
+
+// edsUpdateInSerial run s.edsUpdateByKeys in serial and wait for complete.
+func (s *ServiceEntryStore) edsUpdateInSerial(keys map[instancesKey]struct{}, push bool) {


I don't understand why we use a queue if it's serial. Why not just call the functions directly? Can you add a comment in the code explaining this? Or even better, add a test case that fails if we do NOT use the queue (since as far as I know it passes without queue)

ok I saw comment from somewhere else actually

howardjohn · 2022-03-09T18:17:50Z

pilot/pkg/serviceregistry/serviceentry/servicediscovery.go

+	s.mutex.Unlock()

 	s.edsUpdate(instances, true)


Not sure I understand this change. I thought the whole point of the EDS was to ensure that we process EDS events in order. But now we release the lock then enqueue the update -- so we lose strict ordering?

The order is guaranteed by enqueueing, in the function, the instances enqueue by order. So they are processed in strict order, I think

But between mutex unlock and s.edsUpdate, anything can happen. In extreme scenario the thread is not live for 10s, and 100s of events are processed in this time. Then finally we wake back up and wipe out the old configs?

In this case, the service instances store can be updated to the latest state. And in edsUpdate, the eds cache is then updated to latest state. I could not think how it can not work.

ok, I think I misunderstood. This is just telling it the keys it needs to update. The rest happens under a lock

howardjohn · 2022-03-10T01:49:52Z

pilot/pkg/serviceregistry/serviceentry/servicediscovery.go

+	case <-s.edsQueue.Closed():
+		return
+	default:
+		wg.Wait()


I did see this leak in tests when running locally. Not sure why - maybe a flake?

Maybe, the there is a race between wait and queue stopped. Need to fix

howardjohn

after looking again I think it seems reasonable, just concerned about the leak commented.

cc @ramaraochavali can you look as well? it is complex so good to have 2 eyes

howardjohn · 2022-03-10T01:53:09Z

pilot/pkg/serviceregistry/serviceentry/servicediscovery.go

+	s.mutex.Unlock()

 	s.edsUpdate(instances, true)


ok, I think I misunderstood. This is just telling it the keys it needs to update. The rest happens under a lock

ramaraochavali · 2022-03-10T09:42:13Z

LGTM as well. But the method naming is very confusing - some place we update cache and some places we do actual push. I do not have better alternatives though.

hzxuzhonghu · 2022-03-11T01:41:37Z

I have the same feelings about the naming. Could do refactor separately

hzxuzhonghu · 2022-03-11T09:27:53Z

/test gencheck_istio

howardjohn · 2022-03-11T21:25:20Z

I think this broke the benchmark https://storage.googleapis.com/istio-prow/logs/benchmark-report_istio_postsubmit/1502305181302263808/artifacts/benchmark-log.txt

hzxuzhonghu · 2022-03-14T08:33:53Z

fix in #37901

* Run eds cache update synchrounously and wait for complete * update * Fix dead lock * Fix goroutine leak * make use of channel rather than waitgroup to prevent blocking

* Fix TestEdsCache flake (#37824) * Run eds cache update synchrounously and wait for complete * update * Fix dead lock * Fix goroutine leak * make use of channel rather than waitgroup to prevent blocking * refactor eds functions (#37892) * refactor eds functions Signed-off-by: Rama Chavali <rama.rao@salesforce.com> * fix test Signed-off-by: Rama Chavali <rama.rao@salesforce.com> * revert Signed-off-by: Rama Chavali <rama.rao@salesforce.com> * fix Signed-off-by: Rama Chavali <rama.rao@salesforce.com> * rename Signed-off-by: Rama Chavali <rama.rao@salesforce.com> * rename Signed-off-by: Rama Chavali <rama.rao@salesforce.com> * workload instance cause stale CDS clusters of type STRICT_DNS (#39947) * workload instance cause stale CDS clusters with of type STRICT_DNS * added release note * fewer full push triggers * code review comments for release note * extend logic for DNS_ROUND_ROBIN * update trigger reason to EndpointUpdate * add unit tests * resolve cherrypick conflicts Co-authored-by: Zhonghu Xu <xuzhonghu@huawei.com> Co-authored-by: Rama Chavali <rama.rao@salesforce.com>

Run eds cache update synchrounously and wait for complete

87556b8

hzxuzhonghu requested a review from a team as a code owner March 9, 2022 08:00

istio-policy-bot added area/networking release-notes-none Indicates a PR that does not require release notes. labels Mar 9, 2022

istio-testing added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Mar 9, 2022

hzxuzhonghu added 2 commits March 9, 2022 16:32

update

22752a8

Fix dead lock

7f1058a

Fix goroutine leak

1ee49ba

howardjohn reviewed Mar 9, 2022

View reviewed changes

hzxuzhonghu changed the title ~~Fix TestEdsCache falke~~ Fix TestEdsCache flake Mar 10, 2022

howardjohn reviewed Mar 10, 2022

View reviewed changes

make use of channel rather than waitgroup to prevent blocking

0727557

ramaraochavali approved these changes Mar 11, 2022

View reviewed changes

istio-testing merged commit 80e1666 into istio:master Mar 11, 2022

hzxuzhonghu deleted the fix-testeds branch March 14, 2022 08:05

GregHanson mentioned this pull request Aug 3, 2022

Cherry pick 39947 #40260

Merged

Fix TestEdsCache flake #37824

Fix TestEdsCache flake #37824

Uh oh!

Conversation

hzxuzhonghu commented Mar 9, 2022 • edited by istio-policy-bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

istio-policy-bot commented Mar 9, 2022

Uh oh!

hzxuzhonghu commented Mar 9, 2022

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

howardjohn left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ramaraochavali commented Mar 10, 2022

Uh oh!

hzxuzhonghu commented Mar 11, 2022

Uh oh!

hzxuzhonghu commented Mar 11, 2022

Uh oh!

howardjohn commented Mar 11, 2022

Uh oh!

hzxuzhonghu commented Mar 14, 2022

Uh oh!

Uh oh!

hzxuzhonghu commented Mar 9, 2022 •

edited by istio-policy-bot

Loading