Skip to content

[provider-local] VPN tunnel check succeeds even if VPN is broken #9604

@timebertt

Description

@timebertt

How to categorize this issue?

/area networking testing
/kind bug

What happened:

In the provider-local HA setup (tested with single-zone but should also apply to multi-zone), kube-apiserver talks directly to the kubelet API instead of using the VPN connection.
With this, operations like kubectl logs and kubectl port-forward (for which the kubelet API is called by kube-apiserver) work even if the VPN connection is broken.

As the VPN tunnel check performed by gardenlet uses a port-forward operation (code), the shoot can be reconciled successfully and be marked as healthy even if the VPN connection is broken.

This problem might cause bugs and regressions in the VPN setup to go unnoticed.
E.g., in #9597 there was a problem in the HA VPN configuration (fixed in a later commit).
Nevertheless, most test cases of pull-gardener-e2e-kind-ha-{single,multi}-zone succeeded. I.e., shoot creations were successful although the VPN connection was never working.
The problem was only discovered by chance in the credentials rotation test case (ref).

What you expected to happen:

If the VPN connection cannot be established successfully

  • shoot reconciliations should fail
  • shoot status should be set to unhealthy
  • e2e tests should fail accordingly

How to reproduce it (as minimally and precisely as possible):

  1. make kind-ha-single-zone-up gardener-ha-single-zone-up
  2. Apply the following patch to example/provider-local/shoot.yaml
--- a/example/provider-local/shoot.yaml
+++ b/example/provider-local/shoot.yaml
@@ -8,6 +8,10 @@ metadata:
     shoot.gardener.cloud/cloud-config-execution-max-delay-seconds: "0"
     authentication.gardener.cloud/issuer: "managed"
 spec:
+  controlPlane:
+    highAvailability:
+      failureTolerance:
+        type: node
   cloudProfileName: local
   secretBindingName: local # dummy, doesn't contain any credentials
   region: local
  1. kubectl apply -f example/provider-local/shoot.yaml
  2. Wait for the shoot to be reconciled successfully and healthy.
  3. Verify manually that the VPN connection works:
    • k -n kube-system logs deploy/metrics-server --request-timeout 2s
    • k -n kube-system port-forward svc/metrics-server 8443:443 --request-timeout 2s
    • k top no
  4. Break the VPN connection: k -n shoot--local--local scale sts vpn-seed-server --replicas 0
  5. Ensure there are no more open TCP connections from kube-apiserver to kubelet: k -n shoot--local--local delete po -l role=apiserver
  6. Repeat the VPN verification from step 5. logs and port-forward work, while connection to the metrics-server (k top no) doesn't work.
  7. Observe that the shoot status is healthy.

Anything else we need to know?:

This only applies to HA clusters, where routes to the shoot networks are configured explicitly in the kube-apiserver pods.
For non-HA clusters, there is an EgressSelectorConfiguration that connects to the envoy-proxy container in the vpn-seed-server using HTTPConnect instead of using explicitly configured IP routes.
E.g.:

$ k -n shoot--local--local exec -it deploy/kube-apiserver -c vpn-path-controller -- sh
~ # ip r
default via 169.254.1.1 dev eth0
10.3.0.0/16 via 192.168.123.195 dev bond0 # shoot pod network
10.4.0.0/16 via 192.168.123.195 dev bond0 # shoot service network
192.168.123.0/26 dev tap0 proto kernel scope link src 192.168.123.9 # VPN network
192.168.123.64/26 dev tap1 proto kernel scope link src 192.168.123.72 # VPN network
192.168.123.192/26 dev bond0 proto kernel scope link src 192.168.123.237 # VPN network
169.254.1.1 dev eth0 scope link
~ # ip r get 10.1.54.75 # node IP
10.1.54.75 via 169.254.1.1 dev eth0 src 10.1.178.85 uid 0
    cache

Note, that there is no route for the shoot node network. This is because Shoot.spec.networking.nodes is empty, as is overlaps with Seed.spec.networks.pods (provider-local starts pods in the seed as shoot nodes).
Hence, kube-apiserver can talk directly to the kubelet API via the seed pod network.

There are even multiple mechanisms for allowing this direct communication path from kube-apiserver to kubelet:

To verify that kube-apiserver of local HA shoots talks directly to the kubelet API, use the following steps:

  1. Create a HA shoot. Wait for the shoot to be reconciled successfully and healthy.
  2. k -n shoot--local--local delete netpol allow-machine-pods
  3. k -n shoot--local--local delete svc machines
  4. Ensure there are no more open TCP connections from kube-apiserver to kubelet: k -n shoot--local--local delete po -l role=apiserver
  5. Repeat the VPN verification from step 5 above. logs and port-forward don't work (don't use the VPN connection), while connection to the metrics-server (k top no) works (uses the working VPN connection).
  6. Observe that the shoot status is unhealthy because the port-forward operation doesn't work.

Environment:

  • Gardener version: v1.93.0-dev

Metadata

Metadata

Labels

area/networkingNetworking relatedarea/testingTesting relatedkind/bugBuglifecycle/rottenDenotes an issue or PR that has aged beyond stale and will be auto-closed.

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions