-
Notifications
You must be signed in to change notification settings - Fork 3.4k
Description
As pointed by @aanm, the Upgrade test is failing on k8s 1.1{3,4} job which is running with kube-proxy and the 4.9 kernel:
/home/jenkins/workspace/cilium-master-K8s-all/1.13-gopath/src/github.com/cilium/cilium/test/ginkgo-ext/scopes.go:514
migrate-svc restart count values do not match
Expected
<int>: 0
to be identical to
<int>: 9
/home/jenkins/workspace/cilium-master-K8s-all/1.13-gopath/src/github.com/cilium/cilium/test/k8sT/Updates.go:520
Currently, the Upgrade test is disabled on the k8s 1.19 job which runs the 4.9 kernel. Therefore, we didn't see failure in the #12628 PR.
First of all, why the test didn't fail on the net-next job? The explanation is the following: the job runs w/o kube-proxy, and the E-W loadbalancing is done via bpf_sock (migrate-svc
is the ClusterIP svc, thus it's handled by bpf_sock). This means that the service xlation happens only once for TCP and connected UDP. Therefore, the xlation happened before the upgrade, and all the changes to the LB maps and the datapath programs were not affecting the established connections to the svc.
On 4.9 the ClusterIP non-host netns svc xlation is handled by bpf_lxc. What happens there is that first we do lb4_lookup_service()
and only then lb4_local()
which does the lookup in the CT map. During the upgrade we first update the LB maps, and only then reload the datapath programs. This means that the old programs won't be able to the svc when calling lb4_lookup_service()
(due to not set the proto
field).
The issue would have been spotted on net-next, if we changed the migrate-svc to NodePort and ran the requests from a node which is not managed by cilium (aka k8s3
). Also, when doing the review, we overlooked the fact that the CT lookup happens after lb4_lookup_service()
.