Skip to content

Conversation

andraxylia
Copy link
Contributor

No description provided.

@andraxylia andraxylia requested a review from a team February 14, 2018 04:31
@andraxylia andraxylia requested a review from costinm February 14, 2018 04:31
@istio-merge-robot
Copy link

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
We suggest the following additional approver: myidpt

Assign the PR to them by writing /assign @myidpt in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these OWNERS Files:

You can indicate your approval by writing /approve in a comment
You can cancel your approval by writing /approve cancel in a comment

Copy link
Member

@ldemailly ldemailly left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cool, does it change the results for #1041 ?
(maybe patch the 2 together in your local branch - or a test pr)

@@ -46,4 +47,4 @@
]
}
]
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(not important/nit) it's strange that just that one doesn't have a newline suddenly ?
I suppose it must mean the newlines are ignored for purpose of pass/fail ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is what gets generated.

@andraxylia
Copy link
Contributor Author

@ldemailly yes, I would like to patch and test, this is why I asked you to rebase that PR

Copy link
Member

@rshriram rshriram left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please dont do this.. We have considered this option and decided to avoid this as it will not solve the 404 problem fully. If anything, it will convert from 404 to 503 as the clusters will be there but won't have endpoints until the next eds call.

Let me run through this once more to see if things have changed..

cc @kyessenov ..

@andraxylia
Copy link
Contributor Author

I checked with Envoy and this is what we are supposed to do, this is the correct Envoy behaviour. We should have done this already. Do you have some arguments to support why this is not a good idea?

@kyessenov
Copy link
Contributor

kyessenov commented Feb 15, 2018 via email

@kyessenov
Copy link
Contributor

kyessenov commented Feb 15, 2018 via email

@ldemailly
Copy link
Member

that's good points to further explore and refine, but what is the exact downside of that flag (if it works and it does convert even some classes of errors to non errors) applied incrementally ?

@costinm
Copy link
Contributor

costinm commented Feb 15, 2018

Let's test this - if it does indeed reduces the 503 ( I don't think we have 404s any more ) - it's great.
That is what Harvey and Piotr suggested - we may do other things to improve further.

@kyessenov
Copy link
Contributor

kyessenov commented Feb 15, 2018 via email

@kyessenov
Copy link
Contributor

kyessenov commented Feb 15, 2018 via email

@ldemailly
Copy link
Member

ldemailly commented Feb 15, 2018

for historical context, here is the chat on envoy-dev I had with Matt on october 5th:

laurent [3:53 PM]
we kind of need a quick fix for those 404s when rds comes before cds, any creative suggestions ? (edited)
(that doesn't involve switching to grpc and ads)
maybe some way for envoy to verify routes are not dead ends and ignore the ones that are dead ends (or trigger a synchronous rds refresh when found such dead end)

mklein [3:56 PM]
put a hack in pilot to add a sleep or something
between refreshes
or sequence them
with sleeps
is that possible?

laurent [3:57 PM]
partially, it doesn't cover the case where we have multiple pilots and one of them is lagging behind in getting updates from k8s
also it 100% delays updates by 1 cycle when the race is client side

mklein [3:58 PM]
“also it 100% delays updates by 1 cycle when the race is client side” what does that mean?

laurent [3:59 PM]
if lucky, the operator makes a change, the changes propagate in the next refresh cycle (say 10s) in the right order
on average it takes 5s for changes to show up with x% of case where it's broken for another 10s
if we delay it'll always take 15s or such for all changes

mklein [4:00 PM]
I’m not sure there is any easy solution here other than switching to ADS if you need atomic/sequenced updates

laurent [4:01 PM]
do you think we can't hack (in our branch maybe if needed) a "validation" to rds that checks the cds is there and spew unhappyness in the logs about bad routes being ignored for now, instead of 404s to the user ? (edited)
I have not looked at envoy code or architecture so my apologies if what I say is complete nonsense
asking for your advice on anything creative that would be quick (as a side note someone here mentioned it should be 503 and not 404 to avoid end user app making a decision that something is truly gone)

mklein [4:03 PM]
unfortunately it doesn’t really work that way. All updates inside Envoy are atomic. There is no going back
by the time you know it’s bad, there is nothing to go back to

laurent [4:04 PM]
if the json is not formatted well; you don't apply it right ? so can't that validation do a bit more semantic-ish validation (even if the change can be the other direction, the cds could disappear, but just trying to address the most common, observed issue not cover 100% correctness) (edited)

mklein [4:05 PM]
ah
yes
there is a hack here you can do
one sec and I have you covered
brb

laurent [4:05 PM]
woot - thx, take your time.

mklein [4:08 PM]
@laurent see here: https://envoyproxy.github.io/envoy/configuration/http_conn_man/route_config/route_config.html#config-http-conn-man-route-table
set “validate_clusters” to true in your response
for RDS it defaults to false
if you set to true, envoy will reject update
and will keep using current route table

laurent [4:09 PM]
oh cool ! it's already there ! fantastic
let me get a test repro'ing the user report and then try this, that's awesome

@costinm
Copy link
Contributor

costinm commented Feb 15, 2018

If it reduces / eliminates 503, without breaking routes - it's a good change.
We have eventual consistency - some delays are ok, next RDS update will take effect.

For now - this seem the best way to improve the 503 situation. It may not be perfect,
and we may need additional changes in envoy or pilot for 0.7 (for example delay the
RDS on pilot side) - but I don't see any clear harm.

Why wouldn't we validate_cluster ? If the cluster is invalid/missing, what's the point in accepting the route ?

@costinm
Copy link
Contributor

costinm commented Feb 15, 2018

Yes, Matt's advice to sequence RDS update is also possible if this one doesn't solve all 503.
We can easily delay the clearCache for routes - it's better to have correct behavior slower
than 503s.

@kyessenov
Copy link
Contributor

Again, this doesn't eliminate the problem. In lyft's world, clusters are static. You don't create new ones dynamically. What it does, it replaces with two problems 1) slower effective RDS updates 2) 404/503 prior to EDS refresh. It's a trade-off to choose which set of problems you like.

Copy link
Contributor

@costinm costinm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please keep track of this for cherry-picking into 0.6, I think if it solves the reported 503s it's a perfect example of critical fix to cherry pick.

@kyessenov
Copy link
Contributor

How is it a fix if it does not solve the problem of inconsistent xDS config?

@costinm
Copy link
Contributor

costinm commented Feb 15, 2018

If after this change we don't see 503 errors - it is a fix for the 503 bug, which affects real users.
I don't know what other problem you have regarding xDS - it is eventually consistent, so a future
RDS update will get the new route.

@costinm
Copy link
Contributor

costinm commented Feb 15, 2018

Kuat: have you tested with this option on and noticed any failure ?

I understand you believe ADS is the only possible solution - but if this or sequencing
the RDS/CDS improves things, I think it is still a win and there is no valid reason to block this.

@kyessenov
Copy link
Contributor

Please provide some evidence this has positive effect and a flag to control the behavior. I would also want to limit this to ingress since mesh-internal traffic we have client under control.

@costinm
Copy link
Contributor

costinm commented Feb 15, 2018

Mesh internal traffic can also see invalid clusters.

Laurent has a test reproducing 503 - so we'll test with it. If the fix solves it - I don't think
we need a 'return 503' (or invalid_clusters) flag is really needed.

@kyessenov
Copy link
Contributor

I think this deserves a flag since it can have some serious consequences. I'm thinking that envoy-to-envoy there are better ways to hide these 503s.

@rshriram
Copy link
Member

rshriram commented Feb 15, 2018 via email

@ldemailly
Copy link
Member

for anyone who want to test/compare : this PR's pilot:

        image: gcr.io/istio-testing/pilot:31c44885a0825713110bc92c5b6d85235e26a767

vs

        image: gcr.io/istio-release/pilot:0.6.0-pre20180213-09-15-00

I'm about to test it with #1041

ldemailly added a commit that referenced this pull request Feb 15, 2018
Let’s see if we get the failures or not
Oliver Liu and others added 10 commits March 5, 2018 20:49
* move v1alpha2 to networking crd instead of config

* moved to v1alpha3 api version

* revert back to rdsv2 as unrelated to v1alpha3 changes

* remove group version constants from model.go

* added config test cases for new types

* lint fix

* bug fix
* Deb URL to not Depend on istioctl_url

* rename proxyURL to debURL

* fix lint
* EDS tests and refactoring

* Add missing copyright/licence

* Lint fixes

* More lint

* Use 8080 until envoy config is fixed

* More lint

* Add port for grpc

* Add port for pilot e2e

* make fmt

* Lint

* Few more debug statements, but I think the bug was in using values vs pointers in creating the lb

* More logging around resource list update, fix bug

* Format and checkstyle fix attempt

* Format and checkstyle fix attempt

* Cleanup TestEnvoyRun, was testing v1 bootstrap no longer used

* Blind attempt to fix codecov. The file is tested indirectly

* Skip coverage

* Skip coverage - add rationale

* Fix the skip code
* fix wrong url when clearing cache stats

* prefix http:// to pilot_url if the entered url doesn't start with it.
…ng (#4011)

* Rewrite generate.sh script to generate types.go in golang to use the list of schemas directly.

* Remove unintended file.

* Include new types.go from the new script to show the effect.

* Delete old generate.sh

* Lint and update comments.

* Update comments. Regenerate types.go.

* Add Copyright

* Lint.
Automatic merge from submit-queue.

Add more tests to runtime2/dispatcher.

Adds more tests to runtime2/dispatcher.

Also makes the Logger under runtime2/testing, so that we can start using it in end-to-end Mixer tests.
@andraxylia andraxylia requested review from a team March 6, 2018 23:24
@codecov
Copy link

codecov bot commented Mar 6, 2018

Codecov Report

❗ No coverage uploaded for pull request base (master@e6d05c5). Click here to learn what that means.
The diff coverage is 62%.

Impacted file tree graph

@@           Coverage Diff            @@
##             master   #3482   +/-   ##
========================================
  Coverage          ?     75%           
========================================
  Files             ?     303           
  Lines             ?   27538           
  Branches          ?       0           
========================================
  Hits              ?   20612           
  Misses            ?    5593           
  Partials          ?    1333
Impacted Files Coverage Δ
pilot/pkg/proxy/agent.go 100% <ø> (ø)
broker/pkg/controller/controller.go 39% <ø> (ø)
broker/pkg/model/config/store.go 100% <ø> (ø)
pilot/pkg/kube/inject/inject.go 80% <ø> (ø)
pilot/pkg/config/memory/monitor.go 91% <ø> (ø)
broker/pkg/model/osb/catalog.go 100% <ø> (ø)
pilot/pkg/proxy/envoy/v1/resources.go 83% <ø> (ø)
istioctl/cmd/istioctl/gendeployment/yaml.go 0% <ø> (ø)
mixer/pkg/config/crd/admit.go 59% <ø> (ø)
istioctl/cmd/istioctl/gendeployment/helm.go 100% <ø> (ø)
... and 45 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update e6d05c5...65622f8. Read the comment docs.

@andraxylia
Copy link
Contributor Author

Closing in favor of #4024 , since the set of files changed got messed up by the merge.

@andraxylia andraxylia closed this Mar 7, 2018
@ldemailly
Copy link
Member

I don't see validate_cluster=true in #4024 (which seems to be about the timeouts?)

@istio-testing
Copy link
Collaborator

@andraxylia: The following tests failed, say /retest to rerun them all:

Test name Commit Details Rerun command
prow/istio-unit-tests.sh 65622f8 link /test istio-unit-tests
prow/e2e-simpleTests.sh 65622f8 link /test e2e-simple
prow/e2e-bookInfoTests.sh 65622f8 link /test e2e-bookInfo
prow/istio-pilot-e2e.sh 65622f8 link /test istio-pilot-e2e

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@andraxylia
Copy link
Contributor Author

Rebase messed up the set of files, closing this PR and opening a new one from the same branch #4031

@ldemailly
Copy link
Member

looks like it's #4039 now

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
do-not-merge Block automatic merging of a PR.
Projects
None yet
Development

Successfully merging this pull request may close these issues.