Skip to content

Conversation

hzxuzhonghu
Copy link
Member

@hzxuzhonghu hzxuzhonghu commented Mar 9, 2022

Please provide a description of this PR:

run eds cache update in serial and wait for completion

fix #37756

@hzxuzhonghu hzxuzhonghu requested a review from a team as a code owner March 9, 2022 08:00
@istio-policy-bot
Copy link

🤔 🐛 You appear to be fixing a bug in Go code, yet your PR doesn't include updates to any test files. Did you forget to add a test?

Courtesy of your friendly test nag.

@istio-policy-bot istio-policy-bot added area/networking release-notes-none Indicates a PR that does not require release notes. labels Mar 9, 2022
@istio-testing istio-testing added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Mar 9, 2022
@hzxuzhonghu
Copy link
Member Author

/test unit-tests_istio

}

// edsUpdateInSerial run s.edsUpdateByKeys in serial and wait for complete.
func (s *ServiceEntryStore) edsUpdateInSerial(keys map[instancesKey]struct{}, push bool) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand why we use a queue if it's serial. Why not just call the functions directly? Can you add a comment in the code explaining this? Or even better, add a test case that fails if we do NOT use the queue (since as far as I know it passes without queue)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok I saw comment from somewhere else actually

Comment on lines +477 to 479
s.mutex.Unlock()

s.edsUpdate(instances, true)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure I understand this change. I thought the whole point of the EDS was to ensure that we process EDS events in order. But now we release the lock then enqueue the update -- so we lose strict ordering?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The order is guaranteed by enqueueing, in the function, the instances enqueue by order. So they are processed in strict order, I think

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But between mutex unlock and s.edsUpdate, anything can happen. In extreme scenario the thread is not live for 10s, and 100s of events are processed in this time. Then finally we wake back up and wipe out the old configs?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case, the service instances store can be updated to the latest state. And in edsUpdate, the eds cache is then updated to latest state. I could not think how it can not work.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, I think I misunderstood. This is just telling it the keys it needs to update. The rest happens under a lock

@hzxuzhonghu hzxuzhonghu changed the title Fix TestEdsCache falke Fix TestEdsCache flake Mar 10, 2022
case <-s.edsQueue.Closed():
return
default:
wg.Wait()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did see this leak in tests when running locally. Not sure why - maybe a flake?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe, the there is a race between wait and queue stopped. Need to fix

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

Copy link
Member

@howardjohn howardjohn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

after looking again I think it seems reasonable, just concerned about the leak commented.

cc @ramaraochavali can you look as well? it is complex so good to have 2 eyes

Comment on lines +477 to 479
s.mutex.Unlock()

s.edsUpdate(instances, true)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, I think I misunderstood. This is just telling it the keys it needs to update. The rest happens under a lock

@ramaraochavali
Copy link
Contributor

LGTM as well. But the method naming is very confusing - some place we update cache and some places we do actual push. I do not have better alternatives though.

@hzxuzhonghu
Copy link
Member Author

I have the same feelings about the naming. Could do refactor separately

@hzxuzhonghu
Copy link
Member Author

/test gencheck_istio

@istio-testing istio-testing merged commit 80e1666 into istio:master Mar 11, 2022
@howardjohn
Copy link
Member

@hzxuzhonghu hzxuzhonghu deleted the fix-testeds branch March 14, 2022 08:05
@hzxuzhonghu
Copy link
Member Author

fix in #37901

aryan16 pushed a commit to aryan16/istio that referenced this pull request Mar 28, 2022
* Run eds cache update synchrounously and wait for complete

* update

* Fix dead lock

* Fix goroutine leak

* make use of channel rather than waitgroup to prevent blocking
GregHanson pushed a commit to GregHanson/istio that referenced this pull request Aug 3, 2022
* Run eds cache update synchrounously and wait for complete

* update

* Fix dead lock

* Fix goroutine leak

* make use of channel rather than waitgroup to prevent blocking
@GregHanson GregHanson mentioned this pull request Aug 3, 2022
istio-testing pushed a commit that referenced this pull request Aug 4, 2022
* Fix TestEdsCache flake (#37824)

* Run eds cache update synchrounously and wait for complete

* update

* Fix dead lock

* Fix goroutine leak

* make use of channel rather than waitgroup to prevent blocking

* refactor eds functions (#37892)

* refactor eds functions

Signed-off-by: Rama Chavali <rama.rao@salesforce.com>

* fix test

Signed-off-by: Rama Chavali <rama.rao@salesforce.com>

* revert

Signed-off-by: Rama Chavali <rama.rao@salesforce.com>

* fix

Signed-off-by: Rama Chavali <rama.rao@salesforce.com>

* rename

Signed-off-by: Rama Chavali <rama.rao@salesforce.com>

* rename

Signed-off-by: Rama Chavali <rama.rao@salesforce.com>

* workload instance cause stale CDS clusters of type STRICT_DNS (#39947)

* workload instance cause stale CDS clusters with of type STRICT_DNS

* added release note

* fewer full push triggers

* code review comments for release note

* extend logic for DNS_ROUND_ROBIN

* update trigger reason to EndpointUpdate

* add unit tests

* resolve cherrypick conflicts

Co-authored-by: Zhonghu Xu <xuzhonghu@huawei.com>
Co-authored-by: Rama Chavali <rama.rao@salesforce.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/networking release-notes-none Indicates a PR that does not require release notes. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

TestEdsCache is flaky
5 participants