Skip to content

Error in DeployMachineClasses deadlocks Shoot create/delete #12611

@matthias-horne

Description

@matthias-horne

How to categorize this issue?

/area control-plane
/kind bug

What happened:

Creating a shoot cluster failed because the Worker could not be reconciled, because the MachineClasses could not be created in DeployMachineClasses in the provider extension. This was caused by a missing configuration that should have been caught during Shoot validation. but that specific validation was missing.

The error that was thrown during worker reconciliation was not propagated to the Shoot level, so the user was not informed about their misconfiguration. Instead the shoot creation failed with a generic worker status machineDeployments has not been updated.

The user then deleted the cluster without fixing the wrong configuration. Deleting the cluster fails as well, because DeployMachineClasses is also called when deleting the Worker:

// Redeploy generated machine classes to update credentials machine-controller-manager used.
log.Info("Deploying the machine classes")
if err := workerDelegate.DeployMachineClasses(ctx); err != nil {
return fmt.Errorf("failed to deploy the machine classes: %w", err)
}

The shoot is now deadlocked and cannot move forward or backwards.

What you expected to happen:

  1. Errors during Worker reconciliation should be propagated up to the Shoot level to inform the user why it failed.
  2. A shoot should not end up in a deadlock when DeployMachineClasses fails. It should delete the cluster successfully.

How to reproduce it (as minimally and precisely as possible):

Since this depends on errors in a closed-source extension there are no easy steps to reproduce. Feel free to reach out to me for details.

Anything else we need to know?:

I already started working on a fix for the error propagation to the shoot. We can focus on the deadlock issue here.

Environment:

  • Gardener version: 1.118.3
  • Kubernetes version (use kubectl version): 1.32.5
  • Cloud provider or hardware configuration:
  • Others:

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions