Skip to content

Conversation

ymesika
Copy link
Member

@ymesika ymesika commented Jun 26, 2018

Raised by the racetest test.

@codecov
Copy link

codecov bot commented Jun 26, 2018

Codecov Report

Merging #6625 into master will decrease coverage by 1%.
The diff coverage is 100%.

Impacted file tree graph

@@           Coverage Diff           @@
##           master   #6625    +/-   ##
=======================================
- Coverage      68%     68%   -<1%     
=======================================
  Files         361     361            
  Lines       31493   31586    +93     
=======================================
- Hits        21329   21308    -21     
- Misses       9305    9412   +107     
- Partials      859     866     +7
Impacted Files Coverage Δ
pilot/pkg/model/jwks_resolver.go 68% <100%> (+1%) ⬆️
pilot/pkg/proxy/envoy/v2/ads.go 76% <100%> (+1%) ⬆️
pilot/pkg/proxy/envoy/v2/debug.go 33% <100%> (+1%) ⬆️
mixer/pkg/protobuf/yaml/dynamic/handler.go 77% <100%> (ø) ⬆️
...olarwinds/internal/papertrail/papertrail_logger.go 62% <0%> (-18%) ⬇️
pilot/pkg/config/memory/monitor.go 82% <0%> (-9%) ⬇️
mixer/adapter/servicecontrol/distValueBuilder.go 92% <0%> (-3%) ⬇️
pilot/pkg/serviceregistry/kube/queue.go 86% <0%> (-3%) ⬇️
mixer/adapter/fluentd/fluentd.go 74% <0%> (-1%) ⬇️
pilot/pkg/serviceregistry/kube/controller.go 68% <0%> (-1%) ⬇️
... and 18 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update b12fef5...228ffde. Read the comment docs.

@ymesika ymesika changed the title Avoid data race in ADS [WIP] Avoid data race in ADS Jun 26, 2018
@istio-testing istio-testing added the do-not-merge/work-in-progress Block merging of a PR because it isn't ready yet. label Jun 26, 2018
@ymesika ymesika changed the title [WIP] Avoid data race in ADS Avoid data race in ADS Jun 26, 2018
@istio-testing istio-testing removed the do-not-merge/work-in-progress Block merging of a PR because it isn't ready yet. label Jun 26, 2018
@ymesika
Copy link
Member Author

ymesika commented Jun 26, 2018

/assign @andraxylia

@rshriram
Copy link
Member

Cc @costinm

@rshriram
Copy link
Member

Racetest still fails

Copy link
Contributor

@ZackButcher ZackButcher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@costinm should provide final review here. This looks fine overall, just one question and two nits.

@@ -645,6 +652,8 @@ func (con *XdsConnection) send(res *xdsapi.DiscoveryResponse) error {
err := con.stream.Send(res)
done <- err
if res.Nonce != "" {
con.mutex.Lock()
Copy link
Contributor

@ZackButcher ZackButcher Jun 26, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at this now, is the con.stream.Send actually safe? gRPC streams cannot be written to concurrently, so if this function is racing to set the field in the con struct, it's likely also racing to write to the stream.

Either way, rather than defering the unlock I prefer just wrapping the switch in the lock/unlock:

if res.Nonce != "" {
    con.mutex.Lock()
    switch res.TypeUrl {
        // ...
    }
    con.mutex.Unlock()
}

(total nit, not blocking)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ZackButcher I appreciate any comment even if just nits.
The mutex is protection against r/w of the struct fields as detected by the racetest tool. I don't mind moving the mutex lock to before calling the Send but I tend to avoid keeping locks when doing some lengthy I/O operations and would actually expect the Send itself to be protected by its implementation.

Copy link
Contributor

@ZackButcher ZackButcher Jun 27, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, that's my point: the gRPC stream implementation is not safe for concurrent writes. See the documentation. Specifically: "But it is not safe to call SendMsg on the same stream in different goroutines."

You can read and write concurrently from the same stream in different threads, but you cannot have two concurrent writers (or two concurrent readers, which race to receive the message).

@@ -209,6 +209,9 @@ type XdsConnection struct {

// Time of last push failure.
LastPushFailure time.Time

// mutex protects changes to this XDS connection
mutex sync.RWMutex
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please put the mutex before the fields it guards; if the mutex guards all fields in the struct then the mutex should be the first field in the struct. And nit, please name it m or mu or similar, rather than mutex (which IMO is overly long given its context/usage).

@istio-testing
Copy link
Collaborator

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: ymesika
To fully approve this pull request, please assign additional approvers.
We suggest the following additional approvers: andraxylia, geeknoid

Assign the PR to them by writing /assign @andraxylia @geeknoid in a comment when ready.

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@ymesika ymesika changed the title Avoid data race in ADS Make racetest green - Fixed data races and flakiness Jun 27, 2018
@ymesika
Copy link
Member Author

ymesika commented Jun 27, 2018

I hope you don't mind that I "reused" this PR for fixing other issues raised when running the racetest. Please review.

Fixes included are:

  • Data races in Envoy v2 ADS
  • Avoid printing (reading) GRPC connection details while it's still being used (data race)
  • Flakiness due to creating endpoints before related service is ready
  • Data race on the time ticker of Pilot's JWKS resolver

Sometimes all racetest tests are passing but it still exits with error code 2. If you have an idea why please let me know. Example.

Due to the nature of race testing it is reasonable that the test is flaky. I will keep watching this test and spot reasons for failures in other PRs where it fails.

cc @hklai

@istio-testing
Copy link
Collaborator

istio-testing commented Jun 27, 2018

@ymesika: The following test failed, say /retest to rerun them all:

Test name Commit Details Rerun command
prow/e2e-dashboard.sh 228ffde link /test e2e-dashboard

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@rshriram
Copy link
Member

Fingers crossed that racetest remains green!

@rshriram rshriram merged commit 30b8ecb into istio:master Jun 27, 2018
@ymesika
Copy link
Member Author

ymesika commented Jun 27, 2018

To be honest I now believe that racetest and Slack are mutual exclusive so expect it to fail again now that Slack is up.

hklai added a commit that referenced this pull request Jun 27, 2018
hklai added a commit that referenced this pull request Jun 27, 2018
* Revert "Remove v2 transition commands since everything is now v2 (#6665)"

This reverts commit 6339eb6.

* Revert "Pilot param clusterRegistriesNamespace should default to pilot namespace (#6446)"

This reverts commit b9294f7.

* Revert "iptable just "return" by uid as the parameter u indicates (#6561)"

This reverts commit 22a0b88.

* Revert "Remove node agent service, residue from flexvolume driver. (#6651)"

This reverts commit db3da82.

* Revert "Continuously reapply galley CA bundle to prevent overwrite (#6599)"

This reverts commit f9e8fd8.

* Revert "Do not count typeConfigs if it is error. (#6527)"

This reverts commit eb1de31.

* Revert "Make racetest green - Fixed data races and flakiness (#6625)"

This reverts commit 30b8ecb.

* Revert "Improve push squashing (#6641)"

This reverts commit 399cd2d.
@ymesika
Copy link
Member Author

ymesika commented Jun 29, 2018

Adds to #6730

@ymesika ymesika deleted the adsDataRace branch June 29, 2018 09:21
quanjielin pushed a commit to quanjielin/istio that referenced this pull request Jul 2, 2018
* Avoid data race in ADS

* One more protection

* More accurate place to gain lock

* Switched to RWMutex and protected relevant r/w locations

* Review comments addressed

* File wasn't included in previous commit

* Avoid reading conn while it still being updated (data race)

* Make sure service is created before creating endpoints for it

* Resolving r/w data race on refreshTicker by using a close channel instead
quanjielin pushed a commit to quanjielin/istio that referenced this pull request Jul 2, 2018
* Revert "Remove v2 transition commands since everything is now v2 (istio#6665)"

This reverts commit 6339eb6.

* Revert "Pilot param clusterRegistriesNamespace should default to pilot namespace (istio#6446)"

This reverts commit b9294f7.

* Revert "iptable just "return" by uid as the parameter u indicates (istio#6561)"

This reverts commit 22a0b88.

* Revert "Remove node agent service, residue from flexvolume driver. (istio#6651)"

This reverts commit db3da82.

* Revert "Continuously reapply galley CA bundle to prevent overwrite (istio#6599)"

This reverts commit f9e8fd8.

* Revert "Do not count typeConfigs if it is error. (istio#6527)"

This reverts commit eb1de31.

* Revert "Make racetest green - Fixed data races and flakiness (istio#6625)"

This reverts commit 30b8ecb.

* Revert "Improve push squashing (istio#6641)"

This reverts commit 399cd2d.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants