-
Notifications
You must be signed in to change notification settings - Fork 8.1k
Make racetest green - Fixed data races and flakiness #6625
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Codecov Report
@@ Coverage Diff @@
## master #6625 +/- ##
=======================================
- Coverage 68% 68% -<1%
=======================================
Files 361 361
Lines 31493 31586 +93
=======================================
- Hits 21329 21308 -21
- Misses 9305 9412 +107
- Partials 859 866 +7
Continue to review full report at Codecov.
|
/assign @andraxylia |
Cc @costinm |
Racetest still fails |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@costinm should provide final review here. This looks fine overall, just one question and two nits.
pilot/pkg/proxy/envoy/v2/ads.go
Outdated
@@ -645,6 +652,8 @@ func (con *XdsConnection) send(res *xdsapi.DiscoveryResponse) error { | |||
err := con.stream.Send(res) | |||
done <- err | |||
if res.Nonce != "" { | |||
con.mutex.Lock() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking at this now, is the con.stream.Send
actually safe? gRPC streams cannot be written to concurrently, so if this function is racing to set the field in the con struct, it's likely also racing to write to the stream.
Either way, rather than defering the unlock I prefer just wrapping the switch in the lock/unlock:
if res.Nonce != "" {
con.mutex.Lock()
switch res.TypeUrl {
// ...
}
con.mutex.Unlock()
}
(total nit, not blocking)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ZackButcher I appreciate any comment even if just nits.
The mutex is protection against r/w of the struct fields as detected by the racetest tool. I don't mind moving the mutex lock to before calling the Send but I tend to avoid keeping locks when doing some lengthy I/O operations and would actually expect the Send itself to be protected by its implementation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, that's my point: the gRPC stream implementation is not safe for concurrent writes. See the documentation. Specifically: "But it is not safe to call SendMsg on the same stream in different goroutines."
You can read and write concurrently from the same stream in different threads, but you cannot have two concurrent writers (or two concurrent readers, which race to receive the message).
pilot/pkg/proxy/envoy/v2/ads.go
Outdated
@@ -209,6 +209,9 @@ type XdsConnection struct { | |||
|
|||
// Time of last push failure. | |||
LastPushFailure time.Time | |||
|
|||
// mutex protects changes to this XDS connection | |||
mutex sync.RWMutex |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please put the mutex before the fields it guards; if the mutex guards all fields in the struct then the mutex should be the first field in the struct. And nit, please name it m
or mu
or similar, rather than mutex
(which IMO is overly long given its context/usage).
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: ymesika Assign the PR to them by writing The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
I hope you don't mind that I "reused" this PR for fixing other issues raised when running the Fixes included are:
Sometimes all racetest tests are passing but it still exits with error code 2. If you have an idea why please let me know. Example. Due to the nature of race testing it is reasonable that the test is flaky. I will keep watching this test and spot reasons for failures in other PRs where it fails. cc @hklai |
@ymesika: The following test failed, say
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
Fingers crossed that racetest remains green! |
To be honest I now believe that racetest and Slack are mutual exclusive so expect it to fail again now that Slack is up. |
* Revert "Remove v2 transition commands since everything is now v2 (#6665)" This reverts commit 6339eb6. * Revert "Pilot param clusterRegistriesNamespace should default to pilot namespace (#6446)" This reverts commit b9294f7. * Revert "iptable just "return" by uid as the parameter u indicates (#6561)" This reverts commit 22a0b88. * Revert "Remove node agent service, residue from flexvolume driver. (#6651)" This reverts commit db3da82. * Revert "Continuously reapply galley CA bundle to prevent overwrite (#6599)" This reverts commit f9e8fd8. * Revert "Do not count typeConfigs if it is error. (#6527)" This reverts commit eb1de31. * Revert "Make racetest green - Fixed data races and flakiness (#6625)" This reverts commit 30b8ecb. * Revert "Improve push squashing (#6641)" This reverts commit 399cd2d.
Adds to #6730 |
* Avoid data race in ADS * One more protection * More accurate place to gain lock * Switched to RWMutex and protected relevant r/w locations * Review comments addressed * File wasn't included in previous commit * Avoid reading conn while it still being updated (data race) * Make sure service is created before creating endpoints for it * Resolving r/w data race on refreshTicker by using a close channel instead
* Revert "Remove v2 transition commands since everything is now v2 (istio#6665)" This reverts commit 6339eb6. * Revert "Pilot param clusterRegistriesNamespace should default to pilot namespace (istio#6446)" This reverts commit b9294f7. * Revert "iptable just "return" by uid as the parameter u indicates (istio#6561)" This reverts commit 22a0b88. * Revert "Remove node agent service, residue from flexvolume driver. (istio#6651)" This reverts commit db3da82. * Revert "Continuously reapply galley CA bundle to prevent overwrite (istio#6599)" This reverts commit f9e8fd8. * Revert "Do not count typeConfigs if it is error. (istio#6527)" This reverts commit eb1de31. * Revert "Make racetest green - Fixed data races and flakiness (istio#6625)" This reverts commit 30b8ecb. * Revert "Improve push squashing (istio#6641)" This reverts commit 399cd2d.
Raised by the
racetest
test.