-
Notifications
You must be signed in to change notification settings - Fork 8.1k
Add validate_cluster=true #3482
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: Assign the PR to them by writing The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these OWNERS Files:
You can indicate your approval by writing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cool, does it change the results for #1041 ?
(maybe patch the 2 together in your local branch - or a test pr)
@@ -46,4 +47,4 @@ | |||
] | |||
} | |||
] | |||
} | |||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(not important/nit) it's strange that just that one doesn't have a newline suddenly ?
I suppose it must mean the newlines are ignored for purpose of pass/fail ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is what gets generated.
@ldemailly yes, I would like to patch and test, this is why I asked you to rebase that PR |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please dont do this.. We have considered this option and decided to avoid this as it will not solve the 404 problem fully. If anything, it will convert from 404 to 503 as the clusters will be there but won't have endpoints until the next eds call.
Let me run through this once more to see if things have changed..
cc @kyessenov ..
I checked with Envoy and this is what we are supposed to do, this is the correct Envoy behaviour. We should have done this already. Do you have some arguments to support why this is not a good idea? |
Shriram just explained it, but there are at least two problems with this
approach:
1. RDS needs to be blocked on EDS update, not just CDS update which is not
possible
2. What happens when clusters are removed?
…On Wed, Feb 14, 2018 at 5:02 PM Andra Cismaru ***@***.***> wrote:
I checked with Envoy and this is what we are supposed to do, this is the
correct Envoy behaviour. We should have done this already. Do you have some
arguments to support why this is not a good idea?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#3482 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AJGIxpI21W4QIDX7x5fexvCzgnFix6chks5tU4I0gaJpZM4SEx-i>
.
|
see envoyproxy/envoy#1930 for cluster warming.
…On Wed, Feb 14, 2018 at 5:04 PM Kuat Yessenov ***@***.***> wrote:
Shriram just explained it, but there are at least two problems with this
approach:
1. RDS needs to be blocked on EDS update, not just CDS update which is not
possible
2. What happens when clusters are removed?
On Wed, Feb 14, 2018 at 5:02 PM Andra Cismaru ***@***.***>
wrote:
> I checked with Envoy and this is what we are supposed to do, this is the
> correct Envoy behaviour. We should have done this already. Do you have some
> arguments to support why this is not a good idea?
>
> —
> You are receiving this because you were mentioned.
>
>
> Reply to this email directly, view it on GitHub
> <#3482 (comment)>, or mute
> the thread
> <https://github.com/notifications/unsubscribe-auth/AJGIxpI21W4QIDX7x5fexvCzgnFix6chks5tU4I0gaJpZM4SEx-i>
> .
>
|
that's good points to further explore and refine, but what is the exact downside of that flag (if it works and it does convert even some classes of errors to non errors) applied incrementally ? |
Let's test this - if it does indeed reduces the 503 ( I don't think we have 404s any more ) - it's great. |
The downside is more non-determinism since now RDS updates are denied _if_
they come prior to CDS. Since clusters can be both added and removed at the
same time, I'm not sure the thing will converge fast enough across CDS and
RDS update sequence.
…On Wed, Feb 14, 2018 at 5:10 PM Laurent Demailly ***@***.***> wrote:
that's good points to further explore and refine, but what is the exact
downside of that flag (if it works and it does convert even some classes of
errors to non errors) applied incrementally ?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#3482 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AJGIxtIdo_UUiOnESYCOj2b7nBhwyH2Kks5tU4PogaJpZM4SEx-i>
.
|
How do we qualify the testing? Do we have a reproducible set-up so that we
can make a data-driven decision? This has both positive and negative sides,
so we need to evaluate the trade-off.
…On Wed, Feb 14, 2018 at 5:11 PM Kuat Yessenov ***@***.***> wrote:
The downside is more non-determinism since now RDS updates are denied _if_
they come prior to CDS. Since clusters can be both added and removed at the
same time, I'm not sure the thing will converge fast enough across CDS and
RDS update sequence.
On Wed, Feb 14, 2018 at 5:10 PM Laurent Demailly ***@***.***>
wrote:
> that's good points to further explore and refine, but what is the exact
> downside of that flag (if it works and it does convert even some classes of
> errors to non errors) applied incrementally ?
>
> —
> You are receiving this because you were mentioned.
>
>
> Reply to this email directly, view it on GitHub
> <#3482 (comment)>, or mute
> the thread
> <https://github.com/notifications/unsubscribe-auth/AJGIxtIdo_UUiOnESYCOj2b7nBhwyH2Kks5tU4PogaJpZM4SEx-i>
> .
>
|
for historical context, here is the chat on envoy-dev I had with Matt on october 5th:
|
If it reduces / eliminates 503, without breaking routes - it's a good change. For now - this seem the best way to improve the 503 situation. It may not be perfect, Why wouldn't we validate_cluster ? If the cluster is invalid/missing, what's the point in accepting the route ? |
Yes, Matt's advice to sequence RDS update is also possible if this one doesn't solve all 503. |
Again, this doesn't eliminate the problem. In lyft's world, clusters are static. You don't create new ones dynamically. What it does, it replaces with two problems 1) slower effective RDS updates 2) 404/503 prior to EDS refresh. It's a trade-off to choose which set of problems you like. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please keep track of this for cherry-picking into 0.6, I think if it solves the reported 503s it's a perfect example of critical fix to cherry pick.
How is it a fix if it does not solve the problem of inconsistent xDS config? |
If after this change we don't see 503 errors - it is a fix for the 503 bug, which affects real users. |
Kuat: have you tested with this option on and noticed any failure ? I understand you believe ADS is the only possible solution - but if this or sequencing |
Please provide some evidence this has positive effect and a flag to control the behavior. I would also want to limit this to ingress since mesh-internal traffic we have client under control. |
Mesh internal traffic can also see invalid clusters. Laurent has a test reproducing 503 - so we'll test with it. If the fix solves it - I don't think |
I think this deserves a flag since it can have some serious consequences. I'm thinking that envoy-to-envoy there are better ways to hide these 503s. |
Keep in mind that 503/404 needs to be tested with a 60s refresh delay. If
you are testing with a 1s delay, you don’t even need this optimization.
…On Wed, Feb 14, 2018 at 8:33 PM Kuat ***@***.***> wrote:
I think this deserves a flag since it can have some serious consequences.
I'm thinking that envoy-to-envoy there are better ways to hide these 503s.
—
You are receiving this because your review was requested.
Reply to this email directly, view it on GitHub
<#3482 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AH0qd1PAjcYQK9YWhfJY9bINv219CyZfks5tU4mHgaJpZM4SEx-i>
.
|
for anyone who want to test/compare : this PR's pilot:
vs
I'm about to test it with #1041 |
* move v1alpha2 to networking crd instead of config * moved to v1alpha3 api version * revert back to rdsv2 as unrelated to v1alpha3 changes * remove group version constants from model.go * added config test cases for new types * lint fix * bug fix
* Deb URL to not Depend on istioctl_url * rename proxyURL to debURL * fix lint
* EDS tests and refactoring * Add missing copyright/licence * Lint fixes * More lint * Use 8080 until envoy config is fixed * More lint * Add port for grpc * Add port for pilot e2e * make fmt * Lint * Few more debug statements, but I think the bug was in using values vs pointers in creating the lb * More logging around resource list update, fix bug * Format and checkstyle fix attempt * Format and checkstyle fix attempt * Cleanup TestEnvoyRun, was testing v1 bootstrap no longer used * Blind attempt to fix codecov. The file is tested indirectly * Skip coverage * Skip coverage - add rationale * Fix the skip code
* fix wrong url when clearing cache stats * prefix http:// to pilot_url if the entered url doesn't start with it.
…ng (#4011) * Rewrite generate.sh script to generate types.go in golang to use the list of schemas directly. * Remove unintended file. * Include new types.go from the new script to show the effect. * Delete old generate.sh * Lint and update comments. * Update comments. Regenerate types.go. * Add Copyright * Lint.
Automatic merge from submit-queue. Add more tests to runtime2/dispatcher. Adds more tests to runtime2/dispatcher. Also makes the Logger under runtime2/testing, so that we can start using it in end-to-end Mixer tests.
Codecov Report
@@ Coverage Diff @@
## master #3482 +/- ##
========================================
Coverage ? 75%
========================================
Files ? 303
Lines ? 27538
Branches ? 0
========================================
Hits ? 20612
Misses ? 5593
Partials ? 1333
Continue to review full report at Codecov.
|
Closing in favor of #4024 , since the set of files changed got messed up by the merge. |
I don't see validate_cluster=true in #4024 (which seems to be about the timeouts?) |
@andraxylia: The following tests failed, say
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
Rebase messed up the set of files, closing this PR and opening a new one from the same branch #4031 |
looks like it's #4039 now |
No description provided.