-
Notifications
You must be signed in to change notification settings - Fork 527
Description
How to categorize this issue?
/area testing
/kind flake
Which test(s)/suite(s) are flaking:
E2E tests for extensions which use the KinD setup can sometimes flake during the step which deploys the extensions' charts in the local KinD cluster.
CI link:
- https://prow.gardener.cloud/view/gs/gardener-prow/pr-logs/pull/gardener_gardener-extension-shoot-rsyslog-relp/34/pull-gardener-extension-shoot-rsyslog-relp-e2e-kind/1723972989701066752
- https://prow.gardener.cloud/view/gs/gardener-prow/pr-logs/pull/gardener_gardener-extension-shoot-rsyslog-relp/38/pull-gardener-extension-shoot-rsyslog-relp-e2e-kind/1731974321494036480
Reason for failure:
This can happen if the extensions' charts contain a clusterrole
resource. E.g. the shoot-rsyslog-relp
extension deploys a ClusterRole as part of the skaffold deployment for the shoot-rsyslog-relp-admission
used for the e2e tests.
This skaffold deployment can fail with the following error:
Error: INSTALLATION FAILED: 1 error occurred:
* Internal error occurred: failed calling webhook "clusterrole-machine-controller-manager.local.extensions.gardener.cloud": failed to call webhook: Post "[https://gardener-extension-provider-local.extension-provider-local-5mf8n.svc:443/clusterrole-machine-controller-manager?timeout=5s](https://gardener-extension-provider-local.extension-provider-local-5mf8n.svc/clusterrole-machine-controller-manager?timeout=5s)": dial tcp 10.2.126.230:443: connect: connection refused
The reason for the failure is that the gardener-extension-provider-local
pods could get evicted by VPA during the deployment of the extension charts, meaning that the gardener-extension-provider-local
s webhook server will be temporarily unavailable.
The clusterrole-machine-controller-manager.local.extensions.gardener.cloud
webhook does not have any selectors:
gardener/pkg/provider-local/webhook/machinecontrollermanager/add.go
Lines 71 to 80 in bcaed6d
return &extensionswebhook.Webhook{ | |
Name: name, | |
Provider: provider, | |
Types: types, | |
Target: target, | |
Path: name, | |
Webhook: &admission.Webhook{Handler: handler, RecoverPanic: true}, | |
FailurePolicy: &failurePolicy, | |
TimeoutSeconds: pointer.Int32(5), | |
}, nil |
However, it is only responsible for the
system:machine-controller-manager-runtime
ClusterRole:if newObj.GetName() != "system:machine-controller-manager-runtime" { | |
return nil | |
} |
Therefore, anything that tries to deploy a ClusterRole while the gardener-extension-provider-local
pods are down will fail.
Anything else we need to know: