Skip to content

Conversation

tklauser
Copy link
Member

See individual commits.

@tklauser tklauser requested review from a team as code owners June 23, 2021 08:52
@tklauser tklauser requested review from ldelossa and rolinh June 23, 2021 08:52
@tklauser tklauser temporarily deployed to ci June 23, 2021 08:52 Inactive
@tklauser
Copy link
Member Author

tklauser commented Jun 23, 2021

The new EKS (ENI) failure is #355.

@tklauser
Copy link
Member Author

The External Workloads failure looks like a real issue with 1.10. It looks like the Cilium endpoints in the cluster are unreachable for Cilium agent running in the VM.

 Cluster health:                                                                        1/3 reachable   (2021-06-23T09:00:33Z)
  Name                                                                                 IP              Node        Endpoints
  cilium-cilium-cli-963673456-vm/gke-cilium-cilium-cli-96-default-pool-138a09de-66px   10.168.0.97     reachable   unreachable
  cilium-cilium-cli-963673456-vm/gke-cilium-cilium-cli-96-default-pool-138a09de-n7d7   10.168.0.98     reachable   unreachable

I'm trying to replicate this on my own GKE cluster.

/cc @jrajahalme for visibility

@jrajahalme
Copy link
Member

The External Workloads failure looks like a real issue with 1.10. It looks like the Cilium endpoints in the cluster are unreachable for Cilium agent running in the VM.

 Cluster health:                                                                        1/3 reachable   (2021-06-23T09:00:33Z)
  Name                                                                                 IP              Node        Endpoints
  cilium-cilium-cli-963673456-vm/gke-cilium-cilium-cli-96-default-pool-138a09de-66px   10.168.0.97     reachable   unreachable
  cilium-cilium-cli-963673456-vm/gke-cilium-cilium-cli-96-default-pool-138a09de-n7d7   10.168.0.98     reachable   unreachable

Most likely a red herring, as the VM has successfully connected to the etcd container in the clustermesh-apiserver pod. So basically datapath is working, it looks like. This has to be more specific to the DNS service?

@tklauser tklauser force-pushed the pr/tklauser/bump-cilium-1.10.1 branch from 68aded4 to fb2aaef Compare June 23, 2021 21:29
@tklauser tklauser temporarily deployed to ci June 23, 2021 21:29 Inactive
@tklauser tklauser force-pushed the pr/tklauser/bump-cilium-1.10.1 branch from fb2aaef to 44ccd55 Compare June 24, 2021 16:14
@tklauser tklauser temporarily deployed to ci June 24, 2021 16:14 Inactive
@jrajahalme
Copy link
Member

jrajahalme commented Jun 25, 2021

Looked into the external workloads fail. It works if the VM in installed with Cilium 1.9.8. With Cilium 1.10.1 in the VM no services are returned by cilium service list in the VM. Comparing Cilium agent logs in the VM between these versions I see this common part:

level=debug msg="Added new service frontend:10.35.240.10/ports=[dns-tcp dns]/selector=map[k8s-app:kube-dns]" serviceName=jarno-test/kube-system/kube-dns subsys=k8s
level=debug msg="Updating backends to map[10.32.0.168:map[dns:0xc0003ed840 dns-tcp:0xc0003ed880] 10.32.1.20:map[dns:0xc0003ed900 dns-tcp:0xc0003ed960]]" serviceName=jarno-test/kube-system/kube-dns subsys=k8s
level=debug msg="Kubernetes service definition changed" action=service-updated endpoints="10.32.0.168:53/TCP,10.32.0.168:53/UDP,10.32.1.20:53/TCP,10.32.1.20:53/UDP" k8sNamespace=kube-system k8sSvcName=kube-dns old-service=nil service="frontend:10.35.240.10/ports=[dns-tcp dns]/selector=map[k8s-app:kube-dns]" subsys=k8s-watcher

There is a small difference for v1.10.1, the last line has no ports:

> level=debug msg="Kubernetes service definition changed" action=service-updated endpoints="10.32.0.168:53/TCP,10.32.0.168:53/UDP,10.32.1.20:53/TCP,10.32.1.20:53/UDP" k8sNamespace=kube-system k8sSvcName=kube-dns old-service=nil service="frontends:[10.35.240.10]/ports=[]/selector=map[k8s-app:kube-dns]" subsys=k8s-watcher

But the following only appears (after the line above) on Cilium 1.9.8:

level=debug msg="Upserting service" backends="[{0  {10.32.0.168 {TCP 53} 0}} {0  {10.32.1.20 {TCP 53} 0}}]" loadBalancerSourceRanges="[]" serviceIP="{10.35.240.10 {TCP 53} 0}" serviceName=kube-dns serviceNamespace=kube-system sessionAffinity=false sessionAffinityTimeout=0 subsys=service svcHealthCheckNodePort=0 svcTrafficPolicy=Cluster svcType=ClusterIP
level=debug msg="Acquired service ID" backends="[{0  {10.32.0.168 {TCP 53} 0}} {0  {10.32.1.20 {TCP 53} 0}}]" loadBalancerSourceRanges="[]" serviceID=4 serviceIP="{10.35.240.10 {TCP 53} 0}" serviceName=kube-dns serviceNamespace=kube-system sessionAffinity=false sessionAffinityTimeout=0 subsys=service svcHealthCheckNodePort=0 svcTrafficPolicy=Cluster svcType=ClusterIP
level=debug msg="Adding new backend" backendID=4 backends="[{0  {10.32.0.168 {TCP 53} 0}} {0  {10.32.1.20 {TCP 53} 0}}]" l3n4Addr="{10.32.0.168 {TCP 53} 0}" loadBalancerSourceRanges="[]" serviceID=4 serviceIP="{10.35.240.10 {TCP 53} 0}" serviceName=kube-dns serviceNamespace=kube-system sessionAffinity=false sessionAffinityTimeout=0 subsys=service svcHealthCheckNodePort=0 svcTrafficPolicy=Cluster svcType=ClusterIP
level=debug msg="Adding new backend" backendID=5 backends="[{0  {10.32.0.168 {TCP 53} 0}} {0  {10.32.1.20 {TCP 53} 0}}]" l3n4Addr="{10.32.1.20 {TCP 53} 0}" loadBalancerSourceRanges="[]" serviceID=4 serviceIP="{10.35.240.10 {TCP 53} 0}" serviceName=kube-dns serviceNamespace=kube-system sessionAffinity=false sessionAffinityTimeout=0 subsys=service svcHealthCheckNodePort=0 svcTrafficPolicy=Cluster svcType=ClusterIP

So something has changed in Cilium 1.10.1 to cause the service backends not being populated when running as a VM.

@jrajahalme
Copy link
Member

The first missing log line "Upserting service" is unconditionally logged by service.UpsertService(), which is called by watchers.addK8sSVCs(), which is called for UpdateService event type unconditionally right after the "Kubernetes service definition changed" log. Assuming the event type is consistent between v1.9.8 and v.1.10.1, it must be that watchers.addK8sSVCs() is not calling service.UpsertService() in this case.

@jrajahalme
Copy link
Member

Fails the same way with Cilium v1.10.0 & v1.10.1.

@jrajahalme
Copy link
Member

Fix in cilium/cilium#16662

@tklauser
Copy link
Member Author

Converting to draft until cilium/cilium#16662 is merged, backported and shows up in the next 1.10.x release.

@tklauser tklauser marked this pull request as draft June 28, 2021 07:57
@tklauser tklauser changed the title Bump Cilium version to 1.10.1, Hubble to v0.8.0 Bump Cilium version to 1.10.2, Hubble to v0.8.0 Jul 2, 2021
@tklauser tklauser force-pushed the pr/tklauser/bump-cilium-1.10.1 branch from 44ccd55 to 0e7800a Compare July 2, 2021 15:47
@tklauser tklauser marked this pull request as ready for review July 2, 2021 15:47
@tklauser tklauser temporarily deployed to ci July 2, 2021 15:47 Inactive
@tklauser
Copy link
Member Author

tklauser commented Jul 2, 2021

Rebased and updated to Cilium 1.10.2 which should fix the external workloads failure.

@tklauser tklauser force-pushed the pr/tklauser/bump-cilium-1.10.1 branch from 0e7800a to 3e7fe7f Compare July 5, 2021 09:29
@tklauser tklauser temporarily deployed to ci July 5, 2021 09:29 Inactive
@tklauser tklauser force-pushed the pr/tklauser/bump-cilium-1.10.1 branch from 3e7fe7f to 3c86261 Compare July 5, 2021 10:17
@tklauser tklauser temporarily deployed to ci July 5, 2021 10:17 Inactive
@tklauser tklauser temporarily deployed to ci July 5, 2021 10:36 Inactive
@tklauser tklauser force-pushed the pr/tklauser/bump-cilium-1.10.1 branch from 593661a to 099d4d4 Compare July 5, 2021 10:43
tklauser added 3 commits July 5, 2021 12:43
This fix is needed in addition to commit 4c77144 ("install/aks: set
'enable-ipv4-masquerade' as false"), otherwise installation on AKS fails
with:

    level=fatal msg="Failed to populate masquerading settings" error="--enable-ipv4-masquerade and --masquerade (deprecated) are mutually exclusive"

Signed-off-by: Tobias Klauser <tobias@cilium.io>
The cilium and hubble dependencies need to be updated at the same time
since hubble v0.8.0 depends on cilium v1.10.x and hubble v0.7.1 won't
compile with cilium v1.10.x.

Signed-off-by: Tobias Klauser <tobias@cilium.io>
Signed-off-by: Tobias Klauser <tobias@cilium.io>
@tklauser tklauser force-pushed the pr/tklauser/bump-cilium-1.10.1 branch from 099d4d4 to 65cbc02 Compare July 5, 2021 10:44
@tklauser tklauser temporarily deployed to ci July 5, 2021 10:44 Inactive
@tklauser tklauser merged commit e24a31a into master Jul 5, 2021
@tklauser tklauser deleted the pr/tklauser/bump-cilium-1.10.1 branch July 5, 2021 11:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants