-
Notifications
You must be signed in to change notification settings - Fork 858
Description
Note
Milestone of #3766, which we are seeking feedback on. We will move forward with pieces that seem non-contentious, though.
Storage Compatibility
Defaulting
Right now, Agones defaults values in the webhook, e.g. GameServer defaulting is the only real thing the GameServer mutation webhook does. Defaulting serves a couple of purposes:
- Code can rely on default values, e.g. hook for eviction.safe. It’s complicated to constantly have to verify the object has been properly defaulted.
- Users see the rendered default value in
kubectl describe
. This increases discoverability for new API elements, asdescribe
is a common UI element.
The problem is that defaulting in the webhook alone is not safe across configuration changes. Example using eviction.safe
:
- t0: Agones deployed at 1.29 with
SafeToEvict
disabled. - t1: GameServer
foo
created.eviction.safe
is not defaulted because the feature gate was not enabled. - t2: Agones upgraded to 1.30,
SafeToEvict
enabled.eviction.safe
is defaulted on new objects. - t3: controller fails silently for foo, logging unknown eviction safe value “”.
The reason for the failure is that the hook blindly assumes the value was defaulted by the webhook, but the webhook never had a chance to run. Note that for this particular case, the gap here is very narrow in time - in particular t1 and t2 need to occur such that defaulting of the GameServer occurs on a 1.29 agones-extensions
container, but a 1.30 agones-controller
container attempts to create the Pod. That said, this condition could easily occur during rollout of 1.30, though, and cause a multi-second hiccup.
To solve this problem, I propose we:
- Change our defaulting code to be idempotent. Right now there is a comment ApplyDefaults applies default values to the GameServer if they are not already populated which suggests that this should be the case, but is immediately followed by two non-idempotent operations (at least non-idempotent over an upgrade).
- If defaulting is idempotent, then the event handlers for controller can run the defaulting before the GameServer enters the queue, e.g. we can call ApplyDefaults here (and in similar places we
Get
in the controller - probably via helper function). This effectively ensures defaulting on creation and on subsequent update, but it relies on idempotent defaulting.
Unknown or Disabled Fields
A similar problem exists for API fields that should not be present but have already been set, which typically only occurs if a feature gate has been disabled (either due to downgrade or explicit disablement). This takes a couple of forms:
- An old-version controller reads a new-version field that it has no knowledge of. Again using the example of
eviction.safe
, imagine downgrading from 1.30 (where the feature gate is Beta and therefore the API is defaulted) to 1.28 (prior toeviction.safe
even existing). - A controller has knowledge of a field but the feature gate is disabled. This path results in validation errors on create, but in the case the user actively disabled a feature gate, we need to cover the latent/existing objects as well.
In either of these cases, we need to ensure that on "first touch", the controller drops the unknown fields, rather than preserving them. In general, this is a safer handling of latent unknown fields - otherwise when the feature gate is reenabled, a preserved field could surprise the user. (Note that the first case, where the field is not present at all in the CRD, is generally covered by field pruning, so mostly this is figuring out logic for the latter case.)
Update vs Patch for controllers
Note that controllers have a similar problem to the SDK of using Update
vs Patch
, mentioned here - different controller versions may drop fields. However:
- skew between controller versions is not expected to last long (minutes at most), so the problem is short-lived. This is a little different than the
sdk-server
, which requires a full rollout. - we call out in the constraints that some features may flap until rolled out: RFC: In-place Agones Upgrades #3766 (comment)