Skip to content

Conversation

eddycharly
Copy link
Member

@eddycharly eddycharly commented Apr 18, 2022

This PR removes helm mode setting.
The ha vs standalone mode can be replaced with a simpler replicas count (the ha mode only influenced replicas count, making it hard to understand).
It also adds a couple of warnings in NOTES.TXT when deployed in non-ha and/or without namespace exclusions.

@eddycharly eddycharly requested a review from treydock April 18, 2022 20:01
@codecov-commenter
Copy link

codecov-commenter commented Apr 18, 2022

Codecov Report

Merging #3628 (e37a4a2) into main (11a4884) will not change coverage.
The diff coverage is n/a.

@@           Coverage Diff           @@
##             main    #3628   +/-   ##
=======================================
  Coverage   27.13%   27.13%           
=======================================
  Files         137      137           
  Lines       19265    19265           
=======================================
  Hits         5227     5227           
  Misses      13399    13399           
  Partials      639      639           

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 11a4884...e37a4a2. Read the comment docs.

Signed-off-by: Charles-Edouard Brétéché <charled.breteche@gmail.com>
Copy link
Contributor

@chipzoller chipzoller left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just wrote documentation on the mode switch and so I strongly suggest keeping it. Reason being that mode should influence all the other resources and configurations based on either ha or standalone. For example, if mode=ha it should create the PDB, Pod anti-affinity, set replicas, etc. And for mode=standalone it would just set replicas. See this PR to get a sense for what I added. Idea was to link to that Security vs Operability section.

@eddycharly
Copy link
Member Author

This is IMHO quite confusing to have settings for both replicaCount and ha mode.

In practice ha only drives replicas count and everything else is derived from the replicas count.
Using replicaCount is a common practice while having an ha setting is a lot more rare.

@eddycharly
Copy link
Member Author

For example, if mode=ha it should create the PDB

The PDB should probably be created when there is more than 1 replica, regardless of ha or standalone.

@eddycharly
Copy link
Member Author

As an example, this is how ingress-nginx allows configuring an highly available release.

@chipzoller
Copy link
Contributor

chipzoller commented Apr 18, 2022

This is IMHO quite confusing to have settings for both replicaCount and ha mode.

In practice ha only drives replicas count and everything else is derived from the replicas count. Using replicaCount is a common practice while having an ha setting is a lot more rare.

I agree. Let's just remove replicaCount in that case. This would also make the whole two replicas a non-issue. But when you used chained triggers in the the Helm template, it makes it more difficult to understand what things are set from one value. If we keep ha as a value of mode then everything else should be driven off of ha directly, not ha -> replicas: 3 -> foo.

@chipzoller
Copy link
Contributor

For example, if mode=ha it should create the PDB

The PDB should probably be created when there is more than 1 replica, regardless of ha or standalone.

So the only instance in which the PDB should be created is when mode=ha. See previous comment.

@eddycharly
Copy link
Member Author

Let's just remove replicaCount in that case. This would also make the whole two replicas a non-issue

Why not use replicaCount directly instead of hard coding it to 3 ?

I don’t understand the advantage of the mode=ha setting against more explicit replicaCount.

@chipzoller
Copy link
Contributor

Because there are only two "valid" values of replicaCount: either 1 or 3. So rather than letting users potentially shoot themselves in the foot by configuring, for example, 2, which isn't supported, why not give them easy buttons for "standalone" installation or "ha" installation? All the other Kubernetes resources, parameters, configurations etc would then be driven off of those two values.

@eddycharly
Copy link
Member Author

eddycharly commented Apr 19, 2022

Why 2 is not a valid value ? How about 4 or 5 or using an HPA ?

@chipzoller
Copy link
Contributor

Because it breaks the leader election process. We need an odd number and right now only 3 has been tested and marked as supported.

@eddycharly
Copy link
Member Author

Are we really using leader election ?
It seems to me we are using leases that are more like locks and don’t require an odd number of clients.

@chipzoller
Copy link
Contributor

@realshuting can you comment here? What's your feeling if we force either 1 or 3 replicas? We've had problems in the past with an even number (mainly 2) to the point where we don't support 2.

@treydock
Copy link
Contributor

I don't often see "meta" parameters that modify behavior of multiple things but aren't directly used in a template but what is more common more direct values like replicaCount and basing decisions off that. If the concern is people doing the wrong thing, can add some logic to Helm chart so that the only valid values for replicaCount are 1 or 3. If a user provided some other value, the Helm chart would fail to apply.

If we do stick with mode=ha and mode=standalone, I'd recommend having some logic to use fail function to ensure those are the only two possible values.

@realshuting
Copy link
Member

realshuting commented Apr 19, 2022

@realshuting can you comment here? What's your feeling if we force either 1 or 3 replicas? We've had problems in the past with an even number (mainly 2) to the point where we don't support 2.

@eddycharly - there was an issue when running Kyverno with 2 replicas, please see #2053. We are still investigating the issue, please chime in if you've seen this issue before.

It seems to me we are using leases that are more like locks and don’t require an odd number of clients.

This is true, but note that the webhook server does not run leader election, all instances are active to serve admission requests.

For HA, Kyverno must be configured with 3 or more replicas. It is not required to configure Kyverno with an odd number of instances, but this is something recommended even for Kubernetes (cc @mritunjaysharma394):

https://discuss.kubernetes.io/t/high-availability-host-numbers/13143
https://moviecultists.com/should-be-an-odd-number-of-master-nodes-for-quorum

@chipzoller
Copy link
Contributor

Thanks for correcting me, Shuting.

So can we go with the mode parameter but still allow replicaCount?

@treydock
Copy link
Contributor

So can we go with the mode parameter but still allow replicaCount?

I think that might be confusing and also could lead to issues with Helm chart where some places use mode and some use replicaCount. Since replicaCount can be more than 1 or 3 (ie 5), I'd lean towards just using replicaCount only. We could easily produce a failure in the Helm chart if the value is 2 to avoid people doing something unsupported.

@chipzoller
Copy link
Contributor

So here's what I'd recommend:

  1. Allow a mode switch with only two possible values: standalone and ha. Anything else would result in failure. mode is not required, but we would specify in the documentation that mode SHOULD be used as the recommended path to a production deployment. Setting mode=ha results in 3 replicas, PDB, pod anti-affinity, whatever else. Setting mode=standalone results in 1 replica. See flow chart above for printed messages if there are or are not excluded Namespaces and point them to do the docs link.
  2. Allow a replicaCount switch. Value of 2 is prohibited. Allowable values are 1, 3, and 4-N. Values 1 and 3 cannot be set along with the mode switch and print a message saying, "use standalone or ha mode". Values 4-N can exist with or without the mode switch and result in basically mode=ha plus whatever number of replicas are specified.
  3. Other switches and values allowed as current.

Is this too complex? The whole goal is to provide a simpler path for users to deploy Kyverno based on two most common profiles while also letting them know about the risks associated with webhooks. If there's a better way to control the experience, let's put it on the table.

@eddycharly
Copy link
Member Author

Again, the mode value sounds confusing to me and does not seem to bring anything more than using replicaCount directly.

Saying in the docs that for production usage replicas count should be at least 3 and that pods should be spread across hosts/azs is clear and easy to understand.

IMHO talking about quorum is misleading as it’s not related to kyverno.

Talking about leases and how we use them could be useful though.

@chipzoller
Copy link
Contributor

chipzoller commented Apr 19, 2022

Ok, see this flow chart. How does this look? Ignore the "Optional" box on the far left for now.

image

Saying in the docs that for production usage replicas count should be at least 3 and that pods should be spread across hosts/azs is clear and easy to understand.

The problem is most users won't read the documentation and we should try and provide some sane defaults according to our best practices based upon things like replicas.

IMHO talking about quorum is misleading as it’s not related to kyverno.

Yes, it was my mistake. There is no leader election in Kyverno like there is for the control plane (because of RAFT used by etcd).

@chipzoller
Copy link
Contributor

@treydock @eddycharly ?

@eddycharly
Copy link
Member Author

Still not convinced that having a mode setting is the best approach ;-)

I understand the desire to simplify the configuration but I feel like it makes things more confusing, less customisable and introduces unnecessary complexity.

Our discussions here make me think that we have something too complex and hard to understand (for us and for end users), using replicaCount directly could help getting something simpler to use in the end.

@chipzoller
Copy link
Contributor

Look at the flow chart. There is no mode setting any longer. Everything is driven off replicaCount.

@treydock
Copy link
Contributor

That new flow chart looks fine.

@eddycharly
Copy link
Member Author

@chipzoller it looks good.
Shall we merge this pr that removes mode ?

Copy link
Contributor

@chipzoller chipzoller left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some formatting comments. We will also need to go back to NOTES.txt based on the agreed upon messages.

@eddycharly
Copy link
Member Author

I can update the warning messages in a follow up PR. This one is more about removing mode.

@chipzoller
Copy link
Contributor

Yes, that's fine, but since you have these files "open" might as well polish what's here.

@vyankyGH vyankyGH self-assigned this Apr 20, 2022
@chipzoller
Copy link
Contributor

I think it's fine to merge, I just want to make sure I understand:

"So this would force a user to specify replicaCount in order for the chart to deploy correctly?"

Signed-off-by: Charles-Edouard Brétéché <charled.breteche@gmail.com>
@eddycharly
Copy link
Member Author

I think it's fine to merge, I just want to make sure I understand:

"So this would force a user to specify replicaCount in order for the chart to deploy correctly?"

When replicas is not set (or null), a Deployment will default the number of replicas to 1 so no need to specify it.

@chipzoller chipzoller enabled auto-merge (squash) April 20, 2022 14:19
@chipzoller chipzoller merged commit 12bbca2 into kyverno:main Apr 20, 2022
realshuting pushed a commit that referenced this pull request Apr 21, 2022
Signed-off-by: Charles-Edouard Brétéché <charled.breteche@gmail.com>
Signed-off-by: ShutingZhao <shuting@nirmata.com>
realshuting added a commit that referenced this pull request Apr 25, 2022
* feat: mutate existing, replace GR by UR in webhook server (#3601)

* add attributes for post mutation

Signed-off-by: ShutingZhao <shuting@nirmata.com>

* add UR informer to webhook server

Signed-off-by: ShutingZhao <shuting@nirmata.com>

* - replace gr with ur in the webhook server; - create ur for mutateExsiting policies

Signed-off-by: ShutingZhao <shuting@nirmata.com>

* replace gr by ur across entire packages

Signed-off-by: ShutingZhao <shuting@nirmata.com>

* add YAMLs

Signed-off-by: ShutingZhao <shuting@nirmata.com>

* update api docs & fix unit tests

Signed-off-by: ShutingZhao <shuting@nirmata.com>

* add UR deletion handler

Signed-off-by: ShutingZhao <shuting@nirmata.com>

* add api docs for v1beta1

Signed-off-by: ShutingZhao <shuting@nirmata.com>

* fix clientset method

Signed-off-by: ShutingZhao <shuting@nirmata.com>

* fix v1beta1 client registration

Signed-off-by: ShutingZhao <shuting@nirmata.com>

* feat: mutate existing - generates UR for admission requests (#3623)

Signed-off-by: ShutingZhao <shuting@nirmata.com>

* replace with UR in policy controller generate rules (#3635)

Signed-off-by: prateekpandey14 <prateek.pandey@nirmata.com>

* - enable mutate engine to process mutateExisting rules; - add unit tests

Signed-off-by: ShutingZhao <shuting@nirmata.com>

* implemented ur background reconciliation for mutateExisting policies

Signed-off-by: ShutingZhao <shuting@nirmata.com>

* fix webhook update error

Signed-off-by: ShutingZhao <shuting@nirmata.com>

* temporary comment out new unit tests

Signed-off-by: ShutingZhao <shuting@nirmata.com>

* feat: mutate existing, replace GR by UR in webhook server (#3601)

* add attributes for post mutation

Signed-off-by: ShutingZhao <shuting@nirmata.com>

* add UR informer to webhook server

Signed-off-by: ShutingZhao <shuting@nirmata.com>

* - replace gr with ur in the webhook server; - create ur for mutateExsiting policies

Signed-off-by: ShutingZhao <shuting@nirmata.com>

* replace gr by ur across entire packages

Signed-off-by: ShutingZhao <shuting@nirmata.com>

* fix missing policy.kyverno.io/policy-name label (#3599)

Signed-off-by: prateekpandey14 <prateek.pandey@nirmata.com>

* refactor cli code from pkg to cmd (#3591)

* refactor cli code from pkg to cmd

Signed-off-by: Mritunjay Sharma <mritunjaysharma394@gmail.com>

* fixes in imports

Signed-off-by: Mritunjay Sharma <mritunjaysharma394@gmail.com>

* fixes tests

Signed-off-by: Mritunjay Sharma <mritunjaysharma394@gmail.com>

* fixed conflicts

Signed-off-by: Mritunjay Sharma <mritunjaysharma394@gmail.com>

* moved non-commands to utils

Signed-off-by: Mritunjay Sharma <mritunjaysharma394@gmail.com>

Co-authored-by: Vyankatesh Kudtarkar <vyankateshkd@gmail.com>

* add YAMLs

Signed-off-by: ShutingZhao <shuting@nirmata.com>

* update api docs & fix unit tests

Signed-off-by: ShutingZhao <shuting@nirmata.com>

* add UR deletion handler

Signed-off-by: ShutingZhao <shuting@nirmata.com>

* add api docs for v1beta1

Signed-off-by: ShutingZhao <shuting@nirmata.com>

* fix clientset method

Signed-off-by: ShutingZhao <shuting@nirmata.com>

* add-kms-libraries for cosign (#3603)

* add-kms-libraries

Signed-off-by: anushkamittal20 <anumittal4641@gmail.com>

* Shifted providers to cosign package

Signed-off-by: anushkamittal20 <anumittal4641@gmail.com>
Signed-off-by: ShutingZhao <shuting@nirmata.com>

* Add support for custom image extractors (#3596)

Signed-off-by: Sambhav Kothari <skothari44@bloomberg.net>

* Update vulnerable dependencies (#3577)

Signed-off-by: Shubham Gupta <shubham.gupta2956@gmail.com>

Co-authored-by: Jim Bugwadia <jim@nirmata.com>
Signed-off-by: ShutingZhao <shuting@nirmata.com>

* fix v1beta1 client registration

Signed-off-by: ShutingZhao <shuting@nirmata.com>

* feat: mutate existing - generates UR for admission requests (#3623)

Signed-off-by: ShutingZhao <shuting@nirmata.com>

* updating version in Chart.yaml (#3618)

* updatimg version in Chart.yaml

Signed-off-by: Prateeknandle <prateeknandle@gmail.com>

* changes from, make gen-helm

Signed-off-by: Prateeknandle <prateeknandle@gmail.com>

Co-authored-by: Vyankatesh Kudtarkar <vyankateshkd@gmail.com>
Signed-off-by: ShutingZhao <shuting@nirmata.com>

* Allow kyverno-policies to have preconditions defined (#3606)

* Allow kyverno-policies to have preconditions defined

Signed-off-by: Trey Dockendorf <tdockendorf@osc.edu>

* Fix docs

Signed-off-by: Trey Dockendorf <tdockendorf@osc.edu>
Signed-off-by: ShutingZhao <shuting@nirmata.com>

* replace with UR in policy controller generate rules (#3635)

Signed-off-by: prateekpandey14 <prateek.pandey@nirmata.com>
Signed-off-by: ShutingZhao <shuting@nirmata.com>

* - enable mutate engine to process mutateExisting rules; - add unit tests

Signed-off-by: ShutingZhao <shuting@nirmata.com>

* implemented ur background reconciliation for mutateExisting policies

Signed-off-by: ShutingZhao <shuting@nirmata.com>

* fix webhook update error

Signed-off-by: ShutingZhao <shuting@nirmata.com>

* temporary comment out new unit tests

Signed-off-by: ShutingZhao <shuting@nirmata.com>

* Image verify attestors (#3614)

* fix logs

Signed-off-by: Jim Bugwadia <jim@nirmata.com>

* fix logs

Signed-off-by: Jim Bugwadia <jim@nirmata.com>

* support multiple attestors

Signed-off-by: Jim Bugwadia <jim@nirmata.com>

* rm CLI tests (not currently supported)

Signed-off-by: Jim Bugwadia <jim@nirmata.com>

* apply attestor repo

Signed-off-by: Jim Bugwadia <jim@nirmata.com>

* fix linter issues

Signed-off-by: Jim Bugwadia <jim@nirmata.com>

* fix entryError assignment

Signed-off-by: Jim Bugwadia <jim@nirmata.com>

* fix tests

Signed-off-by: Jim Bugwadia <jim@nirmata.com>

* format

Signed-off-by: Jim Bugwadia <jim@nirmata.com>

* add intermediary certs

Signed-off-by: Jim Bugwadia <jim@nirmata.com>

* Allow defining imagePullSecrets (#3633)

* Allow defining imagePullSecrets

Signed-off-by: Trey Dockendorf <tdockendorf@osc.edu>

* Use dict for imagePullSecrets

Signed-off-by: Trey Dockendorf <tdockendorf@osc.edu>

* Simplify how imagePullSecrets is defined

Signed-off-by: Trey Dockendorf <tdockendorf@osc.edu>
Signed-off-by: ShutingZhao <shuting@nirmata.com>

* Fix race condition in pCache (#3632)

* fix race condition in pCache

Signed-off-by: ShutingZhao <shuting@nirmata.com>

* refact: remove unused Run function from generate (#3638)

Signed-off-by: prateekpandey14 <prateek.pandey@nirmata.com>

* Remove helm mode setting (#3628)

Signed-off-by: Charles-Edouard Brétéché <charled.breteche@gmail.com>
Signed-off-by: ShutingZhao <shuting@nirmata.com>

* refactor: image utils (#3630)

Signed-off-by: Charles-Edouard Brétéché <charled.breteche@gmail.com>
Signed-off-by: ShutingZhao <shuting@nirmata.com>

* -resolve lift comments; -fix informer sync issue

Signed-off-by: ShutingZhao <shuting@nirmata.com>

* refact the update request cleanup controller

Signed-off-by: prateekpandey14 <prateek.pandey@nirmata.com>

* - fix delete request for mutateExisting; - fix context variable substitution; - improve logging

Signed-off-by: ShutingZhao <shuting@nirmata.com>

* - enable events; - add last applied annotation

Signed-off-by: ShutingZhao <shuting@nirmata.com>

* enable mutate existing on policy creation

Signed-off-by: ShutingZhao <shuting@nirmata.com>

* update autogen code

Signed-off-by: ShutingZhao <shuting@nirmata.com>

* merge main

Signed-off-by: ShutingZhao <shuting@nirmata.com>

* add unit tests

Signed-off-by: ShutingZhao <shuting@nirmata.com>

* address list comments

Signed-off-by: ShutingZhao <shuting@nirmata.com>

* update api docs

Signed-off-by: ShutingZhao <shuting@nirmata.com>

* fix "Implicit memory aliasing in for loop"

Signed-off-by: ShutingZhao <shuting@nirmata.com>

* remove unused definitions

Signed-off-by: ShutingZhao <shuting@nirmata.com>

* update api docs

Signed-off-by: ShutingZhao <shuting@nirmata.com>

Co-authored-by: Prateek Pandey <prateek.pandey@nirmata.com>
Co-authored-by: Mritunjay Kumar Sharma <mritunjaysharma394@gmail.com>
Co-authored-by: Vyankatesh Kudtarkar <vyankateshkd@gmail.com>
Co-authored-by: Anushka Mittal <55237170+anushkamittal20@users.noreply.github.com>
Co-authored-by: Sambhav Kothari <sambhavs.email@gmail.com>
Co-authored-by: Shubham Gupta <shubham.gupta2956@gmail.com>
Co-authored-by: Jim Bugwadia <jim@nirmata.com>
Co-authored-by: Prateek Nandle <56027872+Prateeknandle@users.noreply.github.com>
Co-authored-by: treydock <tdockendorf@osc.edu>
Co-authored-by: Charles-Edouard Brétéché <charled.breteche@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants