-
Notifications
You must be signed in to change notification settings - Fork 526
Description
How to categorize this issue?
/area networking testing
/kind bug
What happened:
In the provider-local HA setup (tested with single-zone but should also apply to multi-zone), kube-apiserver talks directly to the kubelet API instead of using the VPN connection.
With this, operations like kubectl logs
and kubectl port-forward
(for which the kubelet API is called by kube-apiserver) work even if the VPN connection is broken.
As the VPN tunnel check performed by gardenlet uses a port-forward operation (code), the shoot can be reconciled successfully and be marked as healthy even if the VPN connection is broken.
This problem might cause bugs and regressions in the VPN setup to go unnoticed.
E.g., in #9597 there was a problem in the HA VPN configuration (fixed in a later commit).
Nevertheless, most test cases of pull-gardener-e2e-kind-ha-{single,multi}-zone
succeeded. I.e., shoot creations were successful although the VPN connection was never working.
The problem was only discovered by chance in the credentials rotation test case (ref).
What you expected to happen:
If the VPN connection cannot be established successfully
- shoot reconciliations should fail
- shoot status should be set to unhealthy
- e2e tests should fail accordingly
How to reproduce it (as minimally and precisely as possible):
make kind-ha-single-zone-up gardener-ha-single-zone-up
- Apply the following patch to
example/provider-local/shoot.yaml
--- a/example/provider-local/shoot.yaml
+++ b/example/provider-local/shoot.yaml
@@ -8,6 +8,10 @@ metadata:
shoot.gardener.cloud/cloud-config-execution-max-delay-seconds: "0"
authentication.gardener.cloud/issuer: "managed"
spec:
+ controlPlane:
+ highAvailability:
+ failureTolerance:
+ type: node
cloudProfileName: local
secretBindingName: local # dummy, doesn't contain any credentials
region: local
kubectl apply -f example/provider-local/shoot.yaml
- Wait for the shoot to be reconciled successfully and healthy.
- Verify manually that the VPN connection works:
k -n kube-system logs deploy/metrics-server --request-timeout 2s
k -n kube-system port-forward svc/metrics-server 8443:443 --request-timeout 2s
k top no
- Break the VPN connection:
k -n shoot--local--local scale sts vpn-seed-server --replicas 0
- Ensure there are no more open TCP connections from kube-apiserver to kubelet:
k -n shoot--local--local delete po -l role=apiserver
- Repeat the VPN verification from step 5.
logs
andport-forward
work, while connection to the metrics-server (k top no
) doesn't work. - Observe that the shoot status is healthy.
Anything else we need to know?:
This only applies to HA clusters, where routes to the shoot networks are configured explicitly in the kube-apiserver pods.
For non-HA clusters, there is an EgressSelectorConfiguration
that connects to the envoy-proxy
container in the vpn-seed-server
using HTTPConnect
instead of using explicitly configured IP routes.
E.g.:
$ k -n shoot--local--local exec -it deploy/kube-apiserver -c vpn-path-controller -- sh
~ # ip r
default via 169.254.1.1 dev eth0
10.3.0.0/16 via 192.168.123.195 dev bond0 # shoot pod network
10.4.0.0/16 via 192.168.123.195 dev bond0 # shoot service network
192.168.123.0/26 dev tap0 proto kernel scope link src 192.168.123.9 # VPN network
192.168.123.64/26 dev tap1 proto kernel scope link src 192.168.123.72 # VPN network
192.168.123.192/26 dev bond0 proto kernel scope link src 192.168.123.237 # VPN network
169.254.1.1 dev eth0 scope link
~ # ip r get 10.1.54.75 # node IP
10.1.54.75 via 169.254.1.1 dev eth0 src 10.1.178.85 uid 0
cache
Note, that there is no route for the shoot node network. This is because Shoot.spec.networking.nodes
is empty, as is overlaps with Seed.spec.networks.pods
(provider-local starts pods in the seed as shoot nodes).
Hence, kube-apiserver can talk directly to the kubelet API via the seed pod network.
There are even multiple mechanisms for allowing this direct communication path from kube-apiserver to kubelet:
allow-machine-pods
NetworkPolicy
:gardener/pkg/provider-local/controller/infrastructure/actuator.go
Lines 61 to 65 in 70fe495
Ingress: []networkingv1.NetworkPolicyIngressRule{{ From: []networkingv1.NetworkPolicyPeer{ {PodSelector: &metav1.LabelSelector{MatchLabels: map[string]string{v1beta1constants.LabelNetworkPolicyToShootNetworks: v1beta1constants.LabelNetworkPolicyAllowed}}}, }, }}, machines
Service
: https://github.com/gardener/machine-controller-manager-provider-local/blob/aa28b3aede72b45440183187c23db89ea76840d5/pkg/local/create_machine.go#L67-L86- webhook for adding
networking.resources.gardener.cloud/to-machines-tcp-10250=allowed
label tokube-apiserver
:metav1.SetMetaDataLabel(&new.Spec.Template.ObjectMeta, gardenerutils.NetworkPolicyLabel("machines", 10250), v1beta1constants.LabelNetworkPolicyAllowed)
To verify that kube-apiserver of local HA shoots talks directly to the kubelet API, use the following steps:
- Create a HA shoot. Wait for the shoot to be reconciled successfully and healthy.
k -n shoot--local--local delete netpol allow-machine-pods
k -n shoot--local--local delete svc machines
- Ensure there are no more open TCP connections from kube-apiserver to kubelet:
k -n shoot--local--local delete po -l role=apiserver
- Repeat the VPN verification from step 5 above.
logs
andport-forward
don't work (don't use the VPN connection), while connection to the metrics-server (k top no
) works (uses the working VPN connection). - Observe that the shoot status is unhealthy because the
port-forward
operation doesn't work.
Environment:
- Gardener version:
v1.93.0-dev