Skip to content

In-place Agones Upgrades: Storage Compatibility #3771

@zmerlynn

Description

@zmerlynn

Note

Milestone of #3766, which we are seeking feedback on. We will move forward with pieces that seem non-contentious, though.

Storage Compatibility

Defaulting

Right now, Agones defaults values in the webhook, e.g. GameServer defaulting is the only real thing the GameServer mutation webhook does. Defaulting serves a couple of purposes:

  • Code can rely on default values, e.g. hook for eviction.safe. It’s complicated to constantly have to verify the object has been properly defaulted.
  • Users see the rendered default value in kubectl describe. This increases discoverability for new API elements, as describe is a common UI element.

The problem is that defaulting in the webhook alone is not safe across configuration changes. Example using eviction.safe:

  • t0: Agones deployed at 1.29 with SafeToEvict disabled.
  • t1: GameServer foo created. eviction.safe is not defaulted because the feature gate was not enabled.
  • t2: Agones upgraded to 1.30, SafeToEvict enabled. eviction.safe is defaulted on new objects.
  • t3: controller fails silently for foo, logging unknown eviction safe value “”.

The reason for the failure is that the hook blindly assumes the value was defaulted by the webhook, but the webhook never had a chance to run. Note that for this particular case, the gap here is very narrow in time - in particular t1 and t2 need to occur such that defaulting of the GameServer occurs on a 1.29 agones-extensions container, but a 1.30 agones-controller container attempts to create the Pod. That said, this condition could easily occur during rollout of 1.30, though, and cause a multi-second hiccup.

To solve this problem, I propose we:

  • Change our defaulting code to be idempotent. Right now there is a comment ApplyDefaults applies default values to the GameServer if they are not already populated which suggests that this should be the case, but is immediately followed by two non-idempotent operations (at least non-idempotent over an upgrade).
  • If defaulting is idempotent, then the event handlers for controller can run the defaulting before the GameServer enters the queue, e.g. we can call ApplyDefaults here (and in similar places we Get in the controller - probably via helper function). This effectively ensures defaulting on creation and on subsequent update, but it relies on idempotent defaulting.

Unknown or Disabled Fields

A similar problem exists for API fields that should not be present but have already been set, which typically only occurs if a feature gate has been disabled (either due to downgrade or explicit disablement). This takes a couple of forms:

  • An old-version controller reads a new-version field that it has no knowledge of. Again using the example of eviction.safe, imagine downgrading from 1.30 (where the feature gate is Beta and therefore the API is defaulted) to 1.28 (prior to eviction.safe even existing).
  • A controller has knowledge of a field but the feature gate is disabled. This path results in validation errors on create, but in the case the user actively disabled a feature gate, we need to cover the latent/existing objects as well.

In either of these cases, we need to ensure that on "first touch", the controller drops the unknown fields, rather than preserving them. In general, this is a safer handling of latent unknown fields - otherwise when the feature gate is reenabled, a preserved field could surprise the user. (Note that the first case, where the field is not present at all in the CRD, is generally covered by field pruning, so mostly this is figuring out logic for the latter case.)

Update vs Patch for controllers

Note that controllers have a similar problem to the SDK of using Update vs Patch, mentioned here - different controller versions may drop fields. However:

  • skew between controller versions is not expected to last long (minutes at most), so the problem is short-lived. This is a little different than the sdk-server, which requires a full rollout.
  • we call out in the constraints that some features may flap until rolled out: RFC: In-place Agones Upgrades #3766 (comment)

Metadata

Metadata

Assignees

Labels

kind/featureNew features for AgonesstalePending closure unless there is a strong objection.

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions