v1.15 Backports 2024-05-24 #32691

YutaroHayakawa · 2024-05-23T16:30:48Z

Once this PR is merged, a GitHub action will update the labels of these PRs:

 32084 32328 32017 32307 32513 32536 32464 32553 32564 32548 32570 32506 32450 32600 32302 32629

giorio94

Thanks! My commit looks good

michi-covalent

ty~!

mhofstetter

Thanks Yutaro! My commits look good.

joestringer

Approving for backport of my PR. I also glanced through my codeowners and nothing obvious seemed out of place (though I would delegate to the original authors if they have specific concerns)

[ upstream commit 715ed4c ] This script would check for CHARTS_PATH and CHARTS_TOOL, but it never actually invoked them. We've renamed the script now so it's out of date. Signed-off-by: Joe Stringer <joe@cilium.io> Signed-off-by: Yutaro Hayakawa <yutaro.hayakawa@isovalent.com>

[ upstream commit a37fd43 ] Signed-off-by: Michael Mykhaylov <32168861+mikemykhaylov@users.noreply.github.com> Signed-off-by: Yutaro Hayakawa <yutaro.hayakawa@isovalent.com>

[ upstream commit 57db22b ] Syft appears to offer higher quality results, including being able to identify the versions of individual libraries in golang binaries, which is not something that we were able to do efficiently with `bom`. Syft can also generate the SBOMs in spdx-json format, which was a pending TODO from our initial implementation of SBOM generation. Signed-off-by: Feroz Salam <feroz.salam@isovalent.com> Signed-off-by: Yutaro Hayakawa <yutaro.hayakawa@isovalent.com>

[ upstream commit 104a302 ] The clustermesh logic is currently affected by a possible, although rare, race condition occurring if the cluster configuration is being retrieved while the connection to the remote cluster is stopped. Indeed, this operation stops two controllers -- the one handling the connection to the remote cluster and the one responsible for the retrieval of the cluster config. However, this causes the getRemoteCluster function to possibly terminate before the termination of the second controller, in turn leading to a panic due to send on closed channel. Let's fix this issue by explicitly removing only the first controller, and letting the other terminate normally due to the parent context having been terminated. Hence, ensuring that the controller has always terminated before closing the cfgch channel. Fixes: #32179 Signed-off-by: Marco Iorio <marco.iorio@isovalent.com> Signed-off-by: Yutaro Hayakawa <yutaro.hayakawa@isovalent.com>

[ upstream commit 4aa6a7f ] [ backporter's note: lb_service.go and lb_service_test.go are renamed in the upstream. Move the logic from service.go and service_test.go. ] Filtering out backends which are terminating when creating local endpoint state. This will result in quicker route withdrawal if local backends go into terminating state, without waiting for graceful shutdown period of the pods. Signed-off-by: harsimran pabla <hpabla@isovalent.com> Signed-off-by: Yutaro Hayakawa <yutaro.hayakawa@isovalent.com>

[ upstream commit 3989f02 ] Adds a note to the docs that `kube-apiserver` entity is not available for ingress traffic in AKS, EKS and GKE. Signed-off-by: darox <maderdario@gmail.com> Signed-off-by: Yutaro Hayakawa <yutaro.hayakawa@isovalent.com>

[ upstream commit 580d31b ] It's causing unnecessary flakes. Money is not a concern so let's not bother using spot instances. Signed-off-by: Michi Mutsuzaki <michi@isovalent.com> Signed-off-by: Yutaro Hayakawa <yutaro.hayakawa@isovalent.com>

[ upstream commit 83e5a9e ] Similar to EKS, money is not a concern here so let's not bother with preemptible instances. Signed-off-by: Michi Mutsuzaki <michi@isovalent.com> Signed-off-by: Yutaro Hayakawa <yutaro.hayakawa@isovalent.com>

[ upstream commit d15410d ] [ backporter's note: Adjust the minor cilium-cli invocation conflict ] it's fine to ignore the "No egress gateway found" drop reason as this may be caused by the kind=echo pods sending traffic while the egressgw policy map is still being populated. The actual connectivity test will ensure that the map is in sync with the policy and that egressgw traffic always go through the correct gateway Signed-off-by: Gilberto Bertin <jibi@cilium.io> Signed-off-by: Yutaro Hayakawa <yutaro.hayakawa@isovalent.com>

[ upstream commit a1e5295 ] We had the value, but forgot to plumb it in. Fixes: #32497 Signed-off-by: Casey Callendrello <cdc@isovalent.com> Signed-off-by: Yutaro Hayakawa <yutaro.hayakawa@isovalent.com>

[ upstream commit 9392745 ] Sometimes, the L4LB tests timeout waiting for the docker (in docker) instance to be ready. ``` +[06:55:18] docker run --privileged --name lb-node -d --network cilium-l4lb -v /lib/modules:/lib/modules docker:dind ca31f2a72a098bf612d569fee976e7e17779a1546337748dc62b63c99da13271 +[06:55:18] docker exec -t lb-node mount bpffs /sys/fs/bpf -t bpf +[06:55:18] docker run --name nginx -d --network cilium-l4lb nginx 544abbd0503584c582da50480d5f96eae5ecadb426d48d6fee078558b45451b2 +[06:55:18] docker exec -t lb-node docker ps +[06:55:18] sleep 1 +[06:55:19] docker exec -t lb-node docker ps +[06:55:19] sleep 1 +[06:55:20] docker exec -t lb-node docker ps Error response from daemon: Container ca31f2a72a098bf612d569fee976e7e17779a1546337748dc62b63c99da13271 is not running +[06:55:20] sleep 1 +[06:55:21] docker exec -t lb-node docker ps Error response from daemon: Container ca31f2a72a098bf612d569fee976e7e17779a1546337748dc62b63c99da13271 is not running +[06:55:21] sleep 1 +[06:55:22] docker exec -t lb-node docker ps Error response from daemon: Container ca31f2a72a098bf612d569fee976e7e17779a1546337748dc62b63c99da13271 is not running +[06:55:22] sleep 1 +[06:55:23] docker exec -t lb-node docker ps Error response from daemon: Container ca31f2a72a098bf612d569fee976e7e17779a1546337748dc62b63c99da13271 is not running +[06:55:20] docker exec -t lb-node docker ps Error response from daemon: Container ca31f2a72a098bf612d569fee976e7e17779a1546337748dc62b63c99da13271 is not running ``` Unfortunately, fetching the LB logs after the failed test doesn't help either, as this fails with the same error. ``` Run docker exec -t lb-node docker logs cilium-lb docker exec -t lb-node docker logs cilium-lb ... Error response from daemon: Container ca31f2a72a098bf612d569fee976e7e17779a1546337748dc62b63c99da13271 is not running ``` Therefore, this commit adds an additional job step that fetches the status and logs of the docker instance itself. Signed-off-by: Marco Hofstetter <marco.hofstetter@isovalent.com> Signed-off-by: Yutaro Hayakawa <yutaro.hayakawa@isovalent.com>

[ upstream commit 2b31c34 ] The index used for ipv4PrivateIndex/ipv4PublicIndex was wrong in cases where the address was skipped. Use the proper index to which the address is appended. Fixes: 7da6514 ("tables: Sort node addresses also by public vs private IP") Signed-off-by: Jussi Maki <jussi@isovalent.com> Signed-off-by: Yutaro Hayakawa <yutaro.hayakawa@isovalent.com>

[ upstream commit 485b405 ] No functional change, just to make further revision easier. Signed-off-by: gray <gray.liang@isovalent.com> Signed-off-by: Yutaro Hayakawa <yutaro.hayakawa@isovalent.com>

[ upstream commit 9a68ec2 ] This patch introduces a workaround to avoid kernel issue when deleting xfrm states. Let's start from the kernel issue. After installing two xfrm states on the same host using below commands (please note the differences on the mark, mask, src): ``` ip x s a src 10.244.1.43 dst 10.244.3.114 proto esp spi 0x00000003 reqid 1 mode tunnel replay-window 0 mark 0x2a450d00 mask 0xffff0f00 output-mark 0xd00 mask 0xffffff00 aead 'rfc4106(gcm(aes))' 0x42a89014074c49243219a20cc87cadd7b9c0d7d1 128 sel src 0.0.0.0/0 dst 0.0.0.0/0 ip x s a src 0.0.0.0 dst 10.244.3.114 proto esp spi 0x00000003 reqid 1 mode tunnel replay-window 0 mark 0xd00 mask 0xf00 output-mark 0xd00 mask 0xffffff00 aead 'rfc4106(gcm(aes))' 0x42a89014074c49243219a20cc87cadd7b9c0d7d1 128 sel src 0.0.0.0/0 dst 0.0.0.0/0 ``` When trying to delete the first xfrm state using the following command, Linux kernel will instead remove the second xfrm state but keep the first: ``` ip x s d src 10.244.1.43 dst 10.244.3.114 proto esp spi 0x00000003 mark 0x2a450d00 mask 0xffff0f00 ``` This causes troubles for cilium upgrade. A real world scenario for cilium upgrade from 1.13.12 to 1.13.14 could be like: 1. Before upgrade, the node has "old-style" xfrm state to catch mark "0xd00/0xf00" for ingress traffic; old bpf programs also set "0xd00" mark to ingress skbs; 2. Upgrade begins, bpf programs are reloaded to new version, thereafter ingress skbs are marked with "0xXXXX0d00"; 3. After a short while, cilium-agent installs new xfrm states to catch traffic with specific mark "0xXXXX0d00"; During window between step 2 and 3, cilium relies on "old-style" xfrm states "0xd00/0xf00" to catch traffic with specific mark "0xXXXX0d00". So far so good. However, in a large scale cluster it's inevitable to receive NodeDeletion events during upgrade due to node churn. Once seeing a NodeDeletion event, cilium-agent will remove the xfrm state for that gone-away remote node. Now we hit the aforementioned kernel issue: cilium-agent tries to delete the xfrm state catching more specific mark, but kernel wrongly removes the one catching general mark. This causing traffic disruption until upgrade completes with all new xfrm states installed. This patch provides an elegant solution at low cost: if cilium-agent wants to remove a xfrm state catching specific mark, it has to temporarily remove the xfrm state catching general mark first and add it back after: 1. Temporarily remove the xfrm states catching the general mark; 2. Remove the xfrm state we really care abot; 3. Add back the temporaily removed one on step 1; Indeed there will be a small window between temporary removing and adding back, but our past test shows the window lasts 200-900µs only, so short that we shoudn't see many drops. Suggested-by: Julian Wiedmann <jwi@isovalent.com> Signed-off-by: gray <gray.liang@isovalent.com> Signed-off-by: Yutaro Hayakawa <yutaro.hayakawa@isovalent.com>

[ upstream commit a9a5ca7 ] [ backporter's note: The code structure of test/nat46x64/test.sh are different between main and v1.15. Take the v1.15's structure and apply restarting parameter change on top of it. ] Docker in Docker container used within L4LB tests occasionally fails to start due to a `sed: write error`. log output: ``` Certificate request self-signature ok /certs/server/cert.pem: OK subject=CN = docker:dind server Certificate request self-signature ok subject=CN = docker:dind client cat: can't open '/proc/net/arp_tables_names': No such file or directory sed: write error /certs/client/cert.pem: OK iptables v1.8.10 (nf_tables) ``` To prevent this error from causing the entire test to fail, this commit tries to fix this by restarting the container in case of a failure up to 10 times. Signed-off-by: Marco Hofstetter <marco.hofstetter@isovalent.com> Signed-off-by: Yutaro Hayakawa <yutaro.hayakawa@isovalent.com>

[ upstream commit 3f995c4 ] This commit enhances the "fetch dind information" GH action step from the L4LB test to output all containers (including stopped ones) and details about the lb-node container. Signed-off-by: Marco Hofstetter <marco.hofstetter@isovalent.com> Signed-off-by: Yutaro Hayakawa <yutaro.hayakawa@isovalent.com>

[ upstream commit dd947b3 ] Whenever GKE stopped supporting a particular version of GKE, we had to manually remove it from all stable branches. Now instead of that, we will dynamically check if it's supported and only then run the test. Signed-off-by: Marcel Zieba <marcel.zieba@isovalent.com> Signed-off-by: Yutaro Hayakawa <yutaro.hayakawa@isovalent.com>

[ upstream commit ae31ee9 ] [ backporter's comment: Scale tests don't exist in the v1.15 branch. Exclude the changes for them. ] Signed-off-by: Marcel Zieba <marcel.zieba@isovalent.com> Signed-off-by: Yutaro Hayakawa <yutaro.hayakawa@isovalent.com>

[ upstream commit 87119e9 ] Signed-off-by: Marcel Zieba <marcel.zieba@isovalent.com> Signed-off-by: Yutaro Hayakawa <yutaro.hayakawa@isovalent.com>

[ upstream commit f8afceb ] The result of running ``` images/scripts/update-cni-version.sh 1.5.0 ``` Signed-off-by: Anton Ippolitov <anton.ippolitov@datadoghq.com> Signed-off-by: Yutaro Hayakawa <yutaro.hayakawa@isovalent.com>

[ upstream commit 62117e7 ] [ backporter's note: Executed scripts/update-cilium-runtime-image.sh ] Signed-off-by: Anton Ippolitov <anton.ippolitov@datadoghq.com> Signed-off-by: Yutaro Hayakawa <yutaro.hayakawa@isovalent.com>

YutaroHayakawa · 2024-05-24T15:44:54Z

Rebased on top of the latest v1.15 to pull-in #32707.

YutaroHayakawa · 2024-05-24T16:06:09Z

/test-backport-1.15

marseel

Looks like GKE tests work now, thanks!

YutaroHayakawa added kind/backports This PR provides functionality previously merged into master. backport/1.15 This PR represents a backport for Cilium 1.15.x of a PR that was merged to main. labels May 23, 2024

YutaroHayakawa requested review from joestringer, squeed, jschwinger233, ferozsalam, giorio94, michi-covalent, joamaki, pchaigno, jibi, marseel, harsimran-pabla, darox and mhofstetter May 23, 2024 16:30

YutaroHayakawa had a problem deploying to release-base-images May 23, 2024 16:30 — with GitHub Actions Error

giorio94 approved these changes May 23, 2024

View reviewed changes

michi-covalent approved these changes May 23, 2024

View reviewed changes

mhofstetter approved these changes May 23, 2024

View reviewed changes

ferozsalam approved these changes May 23, 2024

View reviewed changes

joestringer approved these changes May 23, 2024

View reviewed changes

harsimran-pabla approved these changes May 23, 2024

View reviewed changes

auto-committer bot temporarily deployed to release-base-images May 23, 2024 16:48 Inactive

marseel approved these changes May 23, 2024

View reviewed changes

YutaroHayakawa force-pushed the pr/v1.15-backport-2024-05-24-12-27 branch from 42f0011 to 8368e1e Compare May 24, 2024 01:02

YutaroHayakawa had a problem deploying to release-base-images May 24, 2024 01:02 — with GitHub Actions Error

auto-committer bot had a problem deploying to release-base-images May 24, 2024 01:18 Error

YutaroHayakawa force-pushed the pr/v1.15-backport-2024-05-24-12-27 branch from e6114df to 93fab9b Compare May 24, 2024 01:53

YutaroHayakawa temporarily deployed to release-base-images May 24, 2024 01:53 — with GitHub Actions Inactive

jschwinger233 approved these changes May 24, 2024

View reviewed changes

joestringer and others added 21 commits May 25, 2024 00:43

Fix incorrect PromQL query

4efa879

[ upstream commit a37fd43 ] Signed-off-by: Michael Mykhaylov <32168861+mikemykhaylov@users.noreply.github.com> Signed-off-by: Yutaro Hayakawa <yutaro.hayakawa@isovalent.com>

eks: Don't use spot instances

517ded0

[ upstream commit 580d31b ] It's causing unnecessary flakes. Money is not a concern so let's not bother using spot instances. Signed-off-by: Michi Mutsuzaki <michi@isovalent.com> Signed-off-by: Yutaro Hayakawa <yutaro.hayakawa@isovalent.com>

gke: Don't use preemptible instances

27f0ac6

[ upstream commit 83e5a9e ] Similar to EKS, money is not a concern here so let's not bother with preemptible instances. Signed-off-by: Michi Mutsuzaki <michi@isovalent.com> Signed-off-by: Yutaro Hayakawa <yutaro.hayakawa@isovalent.com>

helm: add nodeSelector to preflight DaemonSet

6da2e12

[ upstream commit a1e5295 ] We had the value, but forgot to plumb it in. Fixes: #32497 Signed-off-by: Casey Callendrello <cdc@isovalent.com> Signed-off-by: Yutaro Hayakawa <yutaro.hayakawa@isovalent.com>

ipsec: Minor refactor for ipsecDeleteXfrmState

00f4877

[ upstream commit 485b405 ] No functional change, just to make further revision easier. Signed-off-by: gray <gray.liang@isovalent.com> Signed-off-by: Yutaro Hayakawa <yutaro.hayakawa@isovalent.com>

ci: add renovate for gcloud cli

1bdf29e

[ upstream commit ae31ee9 ] [ backporter's comment: Scale tests don't exist in the v1.15 branch. Exclude the changes for them. ] Signed-off-by: Marcel Zieba <marcel.zieba@isovalent.com> Signed-off-by: Yutaro Hayakawa <yutaro.hayakawa@isovalent.com>

ci: add notice message to skipped k8s versions

347c492

[ upstream commit 87119e9 ] Signed-off-by: Marcel Zieba <marcel.zieba@isovalent.com> Signed-off-by: Yutaro Hayakawa <yutaro.hayakawa@isovalent.com>

images: bump cni plugins to v1.5.0

d59e596

[ upstream commit f8afceb ] The result of running ``` images/scripts/update-cni-version.sh 1.5.0 ``` Signed-off-by: Anton Ippolitov <anton.ippolitov@datadoghq.com> Signed-off-by: Yutaro Hayakawa <yutaro.hayakawa@isovalent.com>

images: update cilium-{runtime,builder}

b17ef43

[ upstream commit 62117e7 ] [ backporter's note: Executed scripts/update-cilium-runtime-image.sh ] Signed-off-by: Anton Ippolitov <anton.ippolitov@datadoghq.com> Signed-off-by: Yutaro Hayakawa <yutaro.hayakawa@isovalent.com>

YutaroHayakawa force-pushed the pr/v1.15-backport-2024-05-24-12-27 branch from 5f74d58 to b17ef43 Compare May 24, 2024 15:43

YutaroHayakawa temporarily deployed to release-base-images May 24, 2024 15:43 — with GitHub Actions Inactive

marseel approved these changes May 24, 2024

View reviewed changes

YutaroHayakawa merged commit 240163b into v1.15 May 25, 2024

YutaroHayakawa deleted the pr/v1.15-backport-2024-05-24-12-27 branch May 25, 2024 01:54

YutaroHayakawa mentioned this pull request May 27, 2024

v1.15: Revert the change to the golang image version of hubble-relay #32732

Merged

michi-covalent mentioned this pull request Jun 8, 2024

Prepare for release v1.15.6 #32976

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

v1.15 Backports 2024-05-24 #32691

v1.15 Backports 2024-05-24 #32691

Uh oh!

YutaroHayakawa commented May 23, 2024 •

edited

Loading

Uh oh!

giorio94 left a comment

Uh oh!

michi-covalent left a comment

Uh oh!

mhofstetter left a comment

Uh oh!

joestringer left a comment

Uh oh!

YutaroHayakawa commented May 24, 2024

Uh oh!

YutaroHayakawa commented May 24, 2024

Uh oh!

marseel left a comment

Uh oh!

Uh oh!

v1.15 Backports 2024-05-24 #32691

v1.15 Backports 2024-05-24 #32691

Uh oh!

Conversation

YutaroHayakawa commented May 23, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

giorio94 left a comment

Choose a reason for hiding this comment

Uh oh!

michi-covalent left a comment

Choose a reason for hiding this comment

Uh oh!

mhofstetter left a comment

Choose a reason for hiding this comment

Uh oh!

joestringer left a comment

Choose a reason for hiding this comment

Uh oh!

YutaroHayakawa commented May 24, 2024

Uh oh!

YutaroHayakawa commented May 24, 2024

Uh oh!

marseel left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

YutaroHayakawa commented May 23, 2024 •

edited

Loading