-
Notifications
You must be signed in to change notification settings - Fork 3.4k
v1.15 Backports 2024-05-24 #32691
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
v1.15 Backports 2024-05-24 #32691
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! My commit looks good
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ty~!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Yutaro! My commits look good.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approving for backport of my PR. I also glanced through my codeowners and nothing obvious seemed out of place (though I would delegate to the original authors if they have specific concerns)
42f0011
to
8368e1e
Compare
e6114df
to
93fab9b
Compare
[ upstream commit 715ed4c ] This script would check for CHARTS_PATH and CHARTS_TOOL, but it never actually invoked them. We've renamed the script now so it's out of date. Signed-off-by: Joe Stringer <joe@cilium.io> Signed-off-by: Yutaro Hayakawa <yutaro.hayakawa@isovalent.com>
[ upstream commit a37fd43 ] Signed-off-by: Michael Mykhaylov <32168861+mikemykhaylov@users.noreply.github.com> Signed-off-by: Yutaro Hayakawa <yutaro.hayakawa@isovalent.com>
[ upstream commit 57db22b ] Syft appears to offer higher quality results, including being able to identify the versions of individual libraries in golang binaries, which is not something that we were able to do efficiently with `bom`. Syft can also generate the SBOMs in spdx-json format, which was a pending TODO from our initial implementation of SBOM generation. Signed-off-by: Feroz Salam <feroz.salam@isovalent.com> Signed-off-by: Yutaro Hayakawa <yutaro.hayakawa@isovalent.com>
[ upstream commit 104a302 ] The clustermesh logic is currently affected by a possible, although rare, race condition occurring if the cluster configuration is being retrieved while the connection to the remote cluster is stopped. Indeed, this operation stops two controllers -- the one handling the connection to the remote cluster and the one responsible for the retrieval of the cluster config. However, this causes the getRemoteCluster function to possibly terminate before the termination of the second controller, in turn leading to a panic due to send on closed channel. Let's fix this issue by explicitly removing only the first controller, and letting the other terminate normally due to the parent context having been terminated. Hence, ensuring that the controller has always terminated before closing the cfgch channel. Fixes: #32179 Signed-off-by: Marco Iorio <marco.iorio@isovalent.com> Signed-off-by: Yutaro Hayakawa <yutaro.hayakawa@isovalent.com>
[ upstream commit 4aa6a7f ] [ backporter's note: lb_service.go and lb_service_test.go are renamed in the upstream. Move the logic from service.go and service_test.go. ] Filtering out backends which are terminating when creating local endpoint state. This will result in quicker route withdrawal if local backends go into terminating state, without waiting for graceful shutdown period of the pods. Signed-off-by: harsimran pabla <hpabla@isovalent.com> Signed-off-by: Yutaro Hayakawa <yutaro.hayakawa@isovalent.com>
[ upstream commit 3989f02 ] Adds a note to the docs that `kube-apiserver` entity is not available for ingress traffic in AKS, EKS and GKE. Signed-off-by: darox <maderdario@gmail.com> Signed-off-by: Yutaro Hayakawa <yutaro.hayakawa@isovalent.com>
[ upstream commit 580d31b ] It's causing unnecessary flakes. Money is not a concern so let's not bother using spot instances. Signed-off-by: Michi Mutsuzaki <michi@isovalent.com> Signed-off-by: Yutaro Hayakawa <yutaro.hayakawa@isovalent.com>
[ upstream commit 83e5a9e ] Similar to EKS, money is not a concern here so let's not bother with preemptible instances. Signed-off-by: Michi Mutsuzaki <michi@isovalent.com> Signed-off-by: Yutaro Hayakawa <yutaro.hayakawa@isovalent.com>
[ upstream commit d15410d ] [ backporter's note: Adjust the minor cilium-cli invocation conflict ] it's fine to ignore the "No egress gateway found" drop reason as this may be caused by the kind=echo pods sending traffic while the egressgw policy map is still being populated. The actual connectivity test will ensure that the map is in sync with the policy and that egressgw traffic always go through the correct gateway Signed-off-by: Gilberto Bertin <jibi@cilium.io> Signed-off-by: Yutaro Hayakawa <yutaro.hayakawa@isovalent.com>
[ upstream commit 9392745 ] Sometimes, the L4LB tests timeout waiting for the docker (in docker) instance to be ready. ``` +[06:55:18] docker run --privileged --name lb-node -d --network cilium-l4lb -v /lib/modules:/lib/modules docker:dind ca31f2a72a098bf612d569fee976e7e17779a1546337748dc62b63c99da13271 +[06:55:18] docker exec -t lb-node mount bpffs /sys/fs/bpf -t bpf +[06:55:18] docker run --name nginx -d --network cilium-l4lb nginx 544abbd0503584c582da50480d5f96eae5ecadb426d48d6fee078558b45451b2 +[06:55:18] docker exec -t lb-node docker ps +[06:55:18] sleep 1 +[06:55:19] docker exec -t lb-node docker ps +[06:55:19] sleep 1 +[06:55:20] docker exec -t lb-node docker ps Error response from daemon: Container ca31f2a72a098bf612d569fee976e7e17779a1546337748dc62b63c99da13271 is not running +[06:55:20] sleep 1 +[06:55:21] docker exec -t lb-node docker ps Error response from daemon: Container ca31f2a72a098bf612d569fee976e7e17779a1546337748dc62b63c99da13271 is not running +[06:55:21] sleep 1 +[06:55:22] docker exec -t lb-node docker ps Error response from daemon: Container ca31f2a72a098bf612d569fee976e7e17779a1546337748dc62b63c99da13271 is not running +[06:55:22] sleep 1 +[06:55:23] docker exec -t lb-node docker ps Error response from daemon: Container ca31f2a72a098bf612d569fee976e7e17779a1546337748dc62b63c99da13271 is not running +[06:55:20] docker exec -t lb-node docker ps Error response from daemon: Container ca31f2a72a098bf612d569fee976e7e17779a1546337748dc62b63c99da13271 is not running ``` Unfortunately, fetching the LB logs after the failed test doesn't help either, as this fails with the same error. ``` Run docker exec -t lb-node docker logs cilium-lb docker exec -t lb-node docker logs cilium-lb ... Error response from daemon: Container ca31f2a72a098bf612d569fee976e7e17779a1546337748dc62b63c99da13271 is not running ``` Therefore, this commit adds an additional job step that fetches the status and logs of the docker instance itself. Signed-off-by: Marco Hofstetter <marco.hofstetter@isovalent.com> Signed-off-by: Yutaro Hayakawa <yutaro.hayakawa@isovalent.com>
[ upstream commit 2b31c34 ] The index used for ipv4PrivateIndex/ipv4PublicIndex was wrong in cases where the address was skipped. Use the proper index to which the address is appended. Fixes: 7da6514 ("tables: Sort node addresses also by public vs private IP") Signed-off-by: Jussi Maki <jussi@isovalent.com> Signed-off-by: Yutaro Hayakawa <yutaro.hayakawa@isovalent.com>
[ upstream commit 485b405 ] No functional change, just to make further revision easier. Signed-off-by: gray <gray.liang@isovalent.com> Signed-off-by: Yutaro Hayakawa <yutaro.hayakawa@isovalent.com>
[ upstream commit 9a68ec2 ] This patch introduces a workaround to avoid kernel issue when deleting xfrm states. Let's start from the kernel issue. After installing two xfrm states on the same host using below commands (please note the differences on the mark, mask, src): ``` ip x s a src 10.244.1.43 dst 10.244.3.114 proto esp spi 0x00000003 reqid 1 mode tunnel replay-window 0 mark 0x2a450d00 mask 0xffff0f00 output-mark 0xd00 mask 0xffffff00 aead 'rfc4106(gcm(aes))' 0x42a89014074c49243219a20cc87cadd7b9c0d7d1 128 sel src 0.0.0.0/0 dst 0.0.0.0/0 ip x s a src 0.0.0.0 dst 10.244.3.114 proto esp spi 0x00000003 reqid 1 mode tunnel replay-window 0 mark 0xd00 mask 0xf00 output-mark 0xd00 mask 0xffffff00 aead 'rfc4106(gcm(aes))' 0x42a89014074c49243219a20cc87cadd7b9c0d7d1 128 sel src 0.0.0.0/0 dst 0.0.0.0/0 ``` When trying to delete the first xfrm state using the following command, Linux kernel will instead remove the second xfrm state but keep the first: ``` ip x s d src 10.244.1.43 dst 10.244.3.114 proto esp spi 0x00000003 mark 0x2a450d00 mask 0xffff0f00 ``` This causes troubles for cilium upgrade. A real world scenario for cilium upgrade from 1.13.12 to 1.13.14 could be like: 1. Before upgrade, the node has "old-style" xfrm state to catch mark "0xd00/0xf00" for ingress traffic; old bpf programs also set "0xd00" mark to ingress skbs; 2. Upgrade begins, bpf programs are reloaded to new version, thereafter ingress skbs are marked with "0xXXXX0d00"; 3. After a short while, cilium-agent installs new xfrm states to catch traffic with specific mark "0xXXXX0d00"; During window between step 2 and 3, cilium relies on "old-style" xfrm states "0xd00/0xf00" to catch traffic with specific mark "0xXXXX0d00". So far so good. However, in a large scale cluster it's inevitable to receive NodeDeletion events during upgrade due to node churn. Once seeing a NodeDeletion event, cilium-agent will remove the xfrm state for that gone-away remote node. Now we hit the aforementioned kernel issue: cilium-agent tries to delete the xfrm state catching more specific mark, but kernel wrongly removes the one catching general mark. This causing traffic disruption until upgrade completes with all new xfrm states installed. This patch provides an elegant solution at low cost: if cilium-agent wants to remove a xfrm state catching specific mark, it has to temporarily remove the xfrm state catching general mark first and add it back after: 1. Temporarily remove the xfrm states catching the general mark; 2. Remove the xfrm state we really care abot; 3. Add back the temporaily removed one on step 1; Indeed there will be a small window between temporary removing and adding back, but our past test shows the window lasts 200-900µs only, so short that we shoudn't see many drops. Suggested-by: Julian Wiedmann <jwi@isovalent.com> Signed-off-by: gray <gray.liang@isovalent.com> Signed-off-by: Yutaro Hayakawa <yutaro.hayakawa@isovalent.com>
[ upstream commit a9a5ca7 ] [ backporter's note: The code structure of test/nat46x64/test.sh are different between main and v1.15. Take the v1.15's structure and apply restarting parameter change on top of it. ] Docker in Docker container used within L4LB tests occasionally fails to start due to a `sed: write error`. log output: ``` Certificate request self-signature ok /certs/server/cert.pem: OK subject=CN = docker:dind server Certificate request self-signature ok subject=CN = docker:dind client cat: can't open '/proc/net/arp_tables_names': No such file or directory sed: write error /certs/client/cert.pem: OK iptables v1.8.10 (nf_tables) ``` To prevent this error from causing the entire test to fail, this commit tries to fix this by restarting the container in case of a failure up to 10 times. Signed-off-by: Marco Hofstetter <marco.hofstetter@isovalent.com> Signed-off-by: Yutaro Hayakawa <yutaro.hayakawa@isovalent.com>
[ upstream commit 3f995c4 ] This commit enhances the "fetch dind information" GH action step from the L4LB test to output all containers (including stopped ones) and details about the lb-node container. Signed-off-by: Marco Hofstetter <marco.hofstetter@isovalent.com> Signed-off-by: Yutaro Hayakawa <yutaro.hayakawa@isovalent.com>
[ upstream commit dd947b3 ] Whenever GKE stopped supporting a particular version of GKE, we had to manually remove it from all stable branches. Now instead of that, we will dynamically check if it's supported and only then run the test. Signed-off-by: Marcel Zieba <marcel.zieba@isovalent.com> Signed-off-by: Yutaro Hayakawa <yutaro.hayakawa@isovalent.com>
[ upstream commit ae31ee9 ] [ backporter's comment: Scale tests don't exist in the v1.15 branch. Exclude the changes for them. ] Signed-off-by: Marcel Zieba <marcel.zieba@isovalent.com> Signed-off-by: Yutaro Hayakawa <yutaro.hayakawa@isovalent.com>
[ upstream commit 87119e9 ] Signed-off-by: Marcel Zieba <marcel.zieba@isovalent.com> Signed-off-by: Yutaro Hayakawa <yutaro.hayakawa@isovalent.com>
[ upstream commit f8afceb ] The result of running ``` images/scripts/update-cni-version.sh 1.5.0 ``` Signed-off-by: Anton Ippolitov <anton.ippolitov@datadoghq.com> Signed-off-by: Yutaro Hayakawa <yutaro.hayakawa@isovalent.com>
[ upstream commit 62117e7 ] [ backporter's note: Executed scripts/update-cilium-runtime-image.sh ] Signed-off-by: Anton Ippolitov <anton.ippolitov@datadoghq.com> Signed-off-by: Yutaro Hayakawa <yutaro.hayakawa@isovalent.com>
5f74d58
to
b17ef43
Compare
Rebased on top of the latest v1.15 to pull-in #32707. |
/test-backport-1.15 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like GKE tests work now, thanks!
Once this PR is merged, a GitHub action will update the labels of these PRs: