Add validate_cluster=true #3482

andraxylia · 2018-02-14T04:31:19Z

No description provided.

istio-merge-robot · 2018-02-14T04:31:37Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
We suggest the following additional approver: myidpt

Assign the PR to them by writing /assign @myidpt in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these OWNERS Files:

pilot/OWNERS

You can indicate your approval by writing /approve in a comment
You can cancel your approval by writing /approve cancel in a comment

ldemailly

cool, does it change the results for #1041 ?
(maybe patch the 2 together in your local branch - or a test pr)

ldemailly · 2018-02-14T04:33:26Z

pilot/pkg/proxy/envoy/v1/testdata/rds-router-gateway-weighted-server1.json.golden

@@ -46,4 +47,4 @@
    ]
   }
  ]
- }
+ }


(not important/nit) it's strange that just that one doesn't have a newline suddenly ?
I suppose it must mean the newlines are ignored for purpose of pass/fail ?

this is what gets generated.

andraxylia · 2018-02-14T04:44:18Z

@ldemailly yes, I would like to patch and test, this is why I asked you to rebase that PR

rshriram

Please dont do this.. We have considered this option and decided to avoid this as it will not solve the 404 problem fully. If anything, it will convert from 404 to 503 as the clusters will be there but won't have endpoints until the next eds call.

Let me run through this once more to see if things have changed..

cc @kyessenov ..

andraxylia · 2018-02-15T01:02:38Z

I checked with Envoy and this is what we are supposed to do, this is the correct Envoy behaviour. We should have done this already. Do you have some arguments to support why this is not a good idea?

kyessenov · 2018-02-15T01:04:39Z

Shriram just explained it, but there are at least two problems with this approach: 1. RDS needs to be blocked on EDS update, not just CDS update which is not possible 2. What happens when clusters are removed?

…

On Wed, Feb 14, 2018 at 5:02 PM Andra Cismaru ***@***.***> wrote: I checked with Envoy and this is what we are supposed to do, this is the correct Envoy behaviour. We should have done this already. Do you have some arguments to support why this is not a good idea? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#3482 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AJGIxpI21W4QIDX7x5fexvCzgnFix6chks5tU4I0gaJpZM4SEx-i> .

kyessenov · 2018-02-15T01:05:23Z

see envoyproxy/envoy#1930 for cluster warming.

…

On Wed, Feb 14, 2018 at 5:04 PM Kuat Yessenov ***@***.***> wrote: Shriram just explained it, but there are at least two problems with this approach: 1. RDS needs to be blocked on EDS update, not just CDS update which is not possible 2. What happens when clusters are removed? On Wed, Feb 14, 2018 at 5:02 PM Andra Cismaru ***@***.***> wrote: > I checked with Envoy and this is what we are supposed to do, this is the > correct Envoy behaviour. We should have done this already. Do you have some > arguments to support why this is not a good idea? > > — > You are receiving this because you were mentioned. > > > Reply to this email directly, view it on GitHub > <#3482 (comment)>, or mute > the thread > <https://github.com/notifications/unsubscribe-auth/AJGIxpI21W4QIDX7x5fexvCzgnFix6chks5tU4I0gaJpZM4SEx-i> > . >

ldemailly · 2018-02-15T01:09:53Z

that's good points to further explore and refine, but what is the exact downside of that flag (if it works and it does convert even some classes of errors to non errors) applied incrementally ?

costinm · 2018-02-15T01:11:07Z

Let's test this - if it does indeed reduces the 503 ( I don't think we have 404s any more ) - it's great.
That is what Harvey and Piotr suggested - we may do other things to improve further.

kyessenov · 2018-02-15T01:11:40Z

The downside is more non-determinism since now RDS updates are denied _if_ they come prior to CDS. Since clusters can be both added and removed at the same time, I'm not sure the thing will converge fast enough across CDS and RDS update sequence.

…

On Wed, Feb 14, 2018 at 5:10 PM Laurent Demailly ***@***.***> wrote: that's good points to further explore and refine, but what is the exact downside of that flag (if it works and it does convert even some classes of errors to non errors) applied incrementally ? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#3482 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AJGIxtIdo_UUiOnESYCOj2b7nBhwyH2Kks5tU4PogaJpZM4SEx-i> .

kyessenov · 2018-02-15T01:13:59Z

How do we qualify the testing? Do we have a reproducible set-up so that we can make a data-driven decision? This has both positive and negative sides, so we need to evaluate the trade-off.

…

On Wed, Feb 14, 2018 at 5:11 PM Kuat Yessenov ***@***.***> wrote: The downside is more non-determinism since now RDS updates are denied _if_ they come prior to CDS. Since clusters can be both added and removed at the same time, I'm not sure the thing will converge fast enough across CDS and RDS update sequence. On Wed, Feb 14, 2018 at 5:10 PM Laurent Demailly ***@***.***> wrote: > that's good points to further explore and refine, but what is the exact > downside of that flag (if it works and it does convert even some classes of > errors to non errors) applied incrementally ? > > — > You are receiving this because you were mentioned. > > > Reply to this email directly, view it on GitHub > <#3482 (comment)>, or mute > the thread > <https://github.com/notifications/unsubscribe-auth/AJGIxtIdo_UUiOnESYCOj2b7nBhwyH2Kks5tU4PogaJpZM4SEx-i> > . >

ldemailly · 2018-02-15T01:17:25Z

for historical context, here is the chat on envoy-dev I had with Matt on october 5th:

laurent [3:53 PM]
we kind of need a quick fix for those 404s when rds comes before cds, any creative suggestions ? (edited)
(that doesn't involve switching to grpc and ads)
maybe some way for envoy to verify routes are not dead ends and ignore the ones that are dead ends (or trigger a synchronous rds refresh when found such dead end)

mklein [3:56 PM]
put a hack in pilot to add a sleep or something
between refreshes
or sequence them
with sleeps
is that possible?

laurent [3:57 PM]
partially, it doesn't cover the case where we have multiple pilots and one of them is lagging behind in getting updates from k8s
also it 100% delays updates by 1 cycle when the race is client side

mklein [3:58 PM]
“also it 100% delays updates by 1 cycle when the race is client side” what does that mean?

laurent [3:59 PM]
if lucky, the operator makes a change, the changes propagate in the next refresh cycle (say 10s) in the right order
on average it takes 5s for changes to show up with x% of case where it's broken for another 10s
if we delay it'll always take 15s or such for all changes

mklein [4:00 PM]
I’m not sure there is any easy solution here other than switching to ADS if you need atomic/sequenced updates

laurent [4:01 PM]
do you think we can't hack (in our branch maybe if needed) a "validation" to rds that checks the cds is there and spew unhappyness in the logs about bad routes being ignored for now, instead of 404s to the user ? (edited)
I have not looked at envoy code or architecture so my apologies if what I say is complete nonsense
asking for your advice on anything creative that would be quick (as a side note someone here mentioned it should be 503 and not 404 to avoid end user app making a decision that something is truly gone)

mklein [4:03 PM]
unfortunately it doesn’t really work that way. All updates inside Envoy are atomic. There is no going back
by the time you know it’s bad, there is nothing to go back to

laurent [4:04 PM]
if the json is not formatted well; you don't apply it right ? so can't that validation do a bit more semantic-ish validation (even if the change can be the other direction, the cds could disappear, but just trying to address the most common, observed issue not cover 100% correctness) (edited)

mklein [4:05 PM]
ah
yes
there is a hack here you can do
one sec and I have you covered
brb

laurent [4:05 PM]
woot - thx, take your time.

mklein [4:08 PM]
@laurent see here: https://envoyproxy.github.io/envoy/configuration/http_conn_man/route_config/route_config.html#config-http-conn-man-route-table
set “validate_clusters” to true in your response
for RDS it defaults to false
if you set to true, envoy will reject update
and will keep using current route table

laurent [4:09 PM]
oh cool ! it's already there ! fantastic
let me get a test repro'ing the user report and then try this, that's awesome

costinm · 2018-02-15T01:17:33Z

If it reduces / eliminates 503, without breaking routes - it's a good change.
We have eventual consistency - some delays are ok, next RDS update will take effect.

For now - this seem the best way to improve the 503 situation. It may not be perfect,
and we may need additional changes in envoy or pilot for 0.7 (for example delay the
RDS on pilot side) - but I don't see any clear harm.

Why wouldn't we validate_cluster ? If the cluster is invalid/missing, what's the point in accepting the route ?

costinm · 2018-02-15T01:19:09Z

Yes, Matt's advice to sequence RDS update is also possible if this one doesn't solve all 503.
We can easily delay the clearCache for routes - it's better to have correct behavior slower
than 503s.

kyessenov · 2018-02-15T01:21:04Z

Again, this doesn't eliminate the problem. In lyft's world, clusters are static. You don't create new ones dynamically. What it does, it replaces with two problems 1) slower effective RDS updates 2) 404/503 prior to EDS refresh. It's a trade-off to choose which set of problems you like.

costinm

Please keep track of this for cherry-picking into 0.6, I think if it solves the reported 503s it's a perfect example of critical fix to cherry pick.

kyessenov · 2018-02-15T01:22:51Z

How is it a fix if it does not solve the problem of inconsistent xDS config?

costinm · 2018-02-15T01:25:01Z

If after this change we don't see 503 errors - it is a fix for the 503 bug, which affects real users.
I don't know what other problem you have regarding xDS - it is eventually consistent, so a future
RDS update will get the new route.

costinm · 2018-02-15T01:29:49Z

Kuat: have you tested with this option on and noticed any failure ?

I understand you believe ADS is the only possible solution - but if this or sequencing
the RDS/CDS improves things, I think it is still a win and there is no valid reason to block this.

kyessenov · 2018-02-15T01:30:01Z

Please provide some evidence this has positive effect and a flag to control the behavior. I would also want to limit this to ingress since mesh-internal traffic we have client under control.

costinm · 2018-02-15T01:32:05Z

Mesh internal traffic can also see invalid clusters.

Laurent has a test reproducing 503 - so we'll test with it. If the fix solves it - I don't think
we need a 'return 503' (or invalid_clusters) flag is really needed.

kyessenov · 2018-02-15T01:33:49Z

I think this deserves a flag since it can have some serious consequences. I'm thinking that envoy-to-envoy there are better ways to hide these 503s.

rshriram · 2018-02-15T02:39:16Z

Keep in mind that 503/404 needs to be tested with a 60s refresh delay. If you are testing with a 1s delay, you don’t even need this optimization.

…

On Wed, Feb 14, 2018 at 8:33 PM Kuat ***@***.***> wrote: I think this deserves a flag since it can have some serious consequences. I'm thinking that envoy-to-envoy there are better ways to hide these 503s. — You are receiving this because your review was requested. Reply to this email directly, view it on GitHub <#3482 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AH0qd1PAjcYQK9YWhfJY9bINv219CyZfks5tU4mHgaJpZM4SEx-i> .

ldemailly · 2018-02-15T02:48:29Z

for anyone who want to test/compare : this PR's pilot:

        image: gcr.io/istio-testing/pilot:31c44885a0825713110bc92c5b6d85235e26a767

vs

        image: gcr.io/istio-release/pilot:0.6.0-pre20180213-09-15-00

I'm about to test it with #1041

Let’s see if we get the failures or not

* move v1alpha2 to networking crd instead of config * moved to v1alpha3 api version * revert back to rdsv2 as unrelated to v1alpha3 changes * remove group version constants from model.go * added config test cases for new types * lint fix * bug fix

* Deb URL to not Depend on istioctl_url * rename proxyURL to debURL * fix lint

…ion (#3664)

* EDS tests and refactoring * Add missing copyright/licence * Lint fixes * More lint * Use 8080 until envoy config is fixed * More lint * Add port for grpc * Add port for pilot e2e * make fmt * Lint * Few more debug statements, but I think the bug was in using values vs pointers in creating the lb * More logging around resource list update, fix bug * Format and checkstyle fix attempt * Format and checkstyle fix attempt * Cleanup TestEnvoyRun, was testing v1 bootstrap no longer used * Blind attempt to fix codecov. The file is tested indirectly * Skip coverage * Skip coverage - add rationale * Fix the skip code

* fix wrong url when clearing cache stats * prefix http:// to pilot_url if the entered url doesn't start with it.

…ng (#4011) * Rewrite generate.sh script to generate types.go in golang to use the list of schemas directly. * Remove unintended file. * Include new types.go from the new script to show the effect. * Delete old generate.sh * Lint and update comments. * Update comments. Regenerate types.go. * Add Copyright * Lint.

Automatic merge from submit-queue. Add more tests to runtime2/dispatcher. Adds more tests to runtime2/dispatcher. Also makes the Logger under runtime2/testing, so that we can start using it in end-to-end Mixer tests.

codecov · 2018-03-06T23:32:36Z

Codecov Report

❗ No coverage uploaded for pull request base (master@e6d05c5). Click here to learn what that means.
The diff coverage is 62%.

@@           Coverage Diff            @@
##             master   #3482   +/-   ##
========================================
  Coverage          ?     75%           
========================================
  Files             ?     303           
  Lines             ?   27538           
  Branches          ?       0           
========================================
  Hits              ?   20612           
  Misses            ?    5593           
  Partials          ?    1333

Impacted Files	Coverage Δ
pilot/pkg/proxy/agent.go	`100% <ø> (ø)`
broker/pkg/controller/controller.go	`39% <ø> (ø)`
broker/pkg/model/config/store.go	`100% <ø> (ø)`
pilot/pkg/kube/inject/inject.go	`80% <ø> (ø)`
pilot/pkg/config/memory/monitor.go	`91% <ø> (ø)`
broker/pkg/model/osb/catalog.go	`100% <ø> (ø)`
pilot/pkg/proxy/envoy/v1/resources.go	`83% <ø> (ø)`
istioctl/cmd/istioctl/gendeployment/yaml.go	`0% <ø> (ø)`
mixer/pkg/config/crd/admit.go	`59% <ø> (ø)`
istioctl/cmd/istioctl/gendeployment/helm.go	`100% <ø> (ø)`
... and 45 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update e6d05c5...65622f8. Read the comment docs.

andraxylia · 2018-03-06T23:57:09Z

Closing in favor of #4024 , since the set of files changed got messed up by the merge.

ldemailly · 2018-03-07T00:04:50Z

I don't see validate_cluster=true in #4024 (which seems to be about the timeouts?)

istio-testing · 2018-03-07T00:05:58Z

@andraxylia: The following tests failed, say /retest to rerun them all:

Test name	Commit	Details	Rerun command
prow/istio-unit-tests.sh	`65622f8`	link	`/test istio-unit-tests`
prow/e2e-simpleTests.sh	`65622f8`	link	`/test e2e-simple`
prow/e2e-bookInfoTests.sh	`65622f8`	link	`/test e2e-bookInfo`
prow/istio-pilot-e2e.sh	`65622f8`	link	`/test istio-pilot-e2e`

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

andraxylia · 2018-03-07T00:09:53Z

Rebase messed up the set of files, closing this PR and opening a new one from the same branch #4031

ldemailly · 2018-03-07T20:13:21Z

looks like it's #4039 now

Add validate_cluster=true

31c4488

andraxylia requested a review from a team February 14, 2018 04:31

googlebot added the cla: yes label Feb 14, 2018

andraxylia requested a review from costinm February 14, 2018 04:31

ldemailly reviewed Feb 14, 2018

View reviewed changes

rshriram previously requested changes Feb 14, 2018

View reviewed changes

costinm approved these changes Feb 15, 2018

View reviewed changes

ldemailly added a commit that referenced this pull request Feb 15, 2018

Applying #3482

64d4eb5

Let’s see if we get the failures or not

ldemailly mentioned this pull request Feb 15, 2018

Tests reproing 503s (was 404s) during routerule apply - with t.Skip() until we have fix #1041

Merged

Oliver Liu and others added 10 commits March 5, 2018 20:49

Add tests for Plugged-cert CA. (#4000)

95322c3

Deb URL to not Depend on istioctl_url (#3993)

c5d9338

* Deb URL to not Depend on istioctl_url * rename proxyURL to debURL * fix lint

remove unnecessary deployment-level sidecar.istio.io/inject annotat…

09c368f

…ion (#3664)

fix wrong url when clearing cache stats (#3985)

16272d6

* fix wrong url when clearing cache stats * prefix http:// to pilot_url if the entered url doesn't start with it.

Add more tests to runtime2/dispatcher. (#3988)

1ca7ebf

Automatic merge from submit-queue. Add more tests to runtime2/dispatcher. Adds more tests to runtime2/dispatcher. Also makes the Logger under runtime2/testing, so that we can start using it in end-to-end Mixer tests.

Add validate_cluster=true

84ab1e8

Merge branch 'fix_503' of https://github.com/istio/istio into fix_503

65622f8

andraxylia requested review from a team March 6, 2018 23:24

andraxylia closed this Mar 7, 2018

andraxylia mentioned this pull request Mar 7, 2018

Fix 503s part 1: add validate_cluster=true #4039

Merged

@@ @@ -46,4 +47,4 @@ @@
                   ]
                  }
                 ]
-               }
+               }

Add validate_cluster=true #3482

Add validate_cluster=true #3482

Uh oh!

Conversation

andraxylia commented Feb 14, 2018

Uh oh!

istio-merge-robot commented Feb 14, 2018

Uh oh!

ldemailly left a comment

Choose a reason for hiding this comment

Uh oh!

ldemailly Feb 14, 2018

Choose a reason for hiding this comment

Uh oh!

andraxylia Feb 14, 2018

Choose a reason for hiding this comment

Uh oh!

andraxylia commented Feb 14, 2018

Uh oh!

rshriram left a comment

Choose a reason for hiding this comment

Uh oh!

andraxylia commented Feb 15, 2018

Uh oh!

kyessenov commented Feb 15, 2018 via email

Uh oh!

kyessenov commented Feb 15, 2018 via email

Uh oh!

ldemailly commented Feb 15, 2018

Uh oh!

costinm commented Feb 15, 2018

Uh oh!

kyessenov commented Feb 15, 2018 via email

Uh oh!

kyessenov commented Feb 15, 2018 via email

Uh oh!

ldemailly commented Feb 15, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

costinm commented Feb 15, 2018

Uh oh!

costinm commented Feb 15, 2018

Uh oh!

kyessenov commented Feb 15, 2018

Uh oh!

costinm left a comment

Choose a reason for hiding this comment

Uh oh!

kyessenov commented Feb 15, 2018

Uh oh!

costinm commented Feb 15, 2018

Uh oh!

costinm commented Feb 15, 2018

Uh oh!

kyessenov commented Feb 15, 2018

Uh oh!

costinm commented Feb 15, 2018

Uh oh!

kyessenov commented Feb 15, 2018

Uh oh!

rshriram commented Feb 15, 2018 via email

Uh oh!

ldemailly commented Feb 15, 2018

Uh oh!

codecov bot commented Mar 6, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

andraxylia commented Mar 6, 2018

Uh oh!

ldemailly commented Mar 7, 2018

Uh oh!

istio-testing commented Mar 7, 2018

Uh oh!

andraxylia commented Mar 7, 2018

Uh oh!

ldemailly commented Mar 7, 2018

Uh oh!

Uh oh!

ldemailly commented Feb 15, 2018 •

edited

Loading

codecov bot commented Mar 6, 2018 •

edited

Loading