Skip to content

[Flaky Test] E2E tests for extensions can fail due to unavailability of clusterrole-machine-controller-manager.local.extensions.gardener.cloud webhook #9020

@plkokanov

Description

@plkokanov

How to categorize this issue?

/area testing
/kind flake

Which test(s)/suite(s) are flaking:
E2E tests for extensions which use the KinD setup can sometimes flake during the step which deploys the extensions' charts in the local KinD cluster.

CI link:

Reason for failure:
This can happen if the extensions' charts contain a clusterrole resource. E.g. the shoot-rsyslog-relp extension deploys a ClusterRole as part of the skaffold deployment for the shoot-rsyslog-relp-admission used for the e2e tests.

This skaffold deployment can fail with the following error:

Error: INSTALLATION FAILED: 1 error occurred:
	* Internal error occurred: failed calling webhook "clusterrole-machine-controller-manager.local.extensions.gardener.cloud": failed to call webhook: Post "[https://gardener-extension-provider-local.extension-provider-local-5mf8n.svc:443/clusterrole-machine-controller-manager?timeout=5s](https://gardener-extension-provider-local.extension-provider-local-5mf8n.svc/clusterrole-machine-controller-manager?timeout=5s)": dial tcp 10.2.126.230:443: connect: connection refused

The reason for the failure is that the gardener-extension-provider-local pods could get evicted by VPA during the deployment of the extension charts, meaning that the gardener-extension-provider-locals webhook server will be temporarily unavailable.

The clusterrole-machine-controller-manager.local.extensions.gardener.cloud webhook does not have any selectors:

return &extensionswebhook.Webhook{
Name: name,
Provider: provider,
Types: types,
Target: target,
Path: name,
Webhook: &admission.Webhook{Handler: handler, RecoverPanic: true},
FailurePolicy: &failurePolicy,
TimeoutSeconds: pointer.Int32(5),
}, nil

However, it is only responsible for the system:machine-controller-manager-runtime ClusterRole:
if newObj.GetName() != "system:machine-controller-manager-runtime" {
return nil
}
.

Therefore, anything that tries to deploy a ClusterRole while the gardener-extension-provider-local pods are down will fail.

Anything else we need to know:

Metadata

Metadata

Assignees

No one assigned

    Labels

    area/auto-scalingAuto-scaling (CA/HPA/VPA/HVPA, predominantly control plane, but also otherwise) relatedarea/scalabilityScalability relatedarea/testingTesting relatedkind/flakeTracking or fixing a flaky test

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions