v1.18 Backports 2025-08-19 #41267

joamaki · 2025-08-19T11:32:28Z

Once this PR is merged, a GitHub action will update the labels of these PRs:

 41007 41085 41098 41102 41079 40568 41092 41122 41147 41120 40340 41088 41171 41010 41110 41148 41039 41107 41240

[ upstream commit ebfc395 ] As of today, we identify privileged tests with the "TestPrivileged" prefix in their name. However, that doesn't hold true for benchmarks requiring privileges, which are always skipped given their prefix doesn't match "TestPrivileged". This commit patches our current logic to introduce a new "BenchmarkPrivileged" prefix to identify such benchmarks requiring privileged access. Signed-off-by: Simone Magnani <simone.magnani@isovalent.com> Signed-off-by: Jussi Maki <jussi@isovalent.com>

[ upstream commit 7275a86 ] This commits adjusts the prefix of all benchmarks requiring privileged access with the new "BenchmarkPrivileged" prefix. Signed-off-by: Simone Magnani <simone.magnani@isovalent.com> Signed-off-by: Jussi Maki <jussi@isovalent.com>

[ upstream commit 0e03851 ] Wait for the prune to actually happen in the lb/prune command to make tests that e.g. do BPF state restoration more reliable as then we won't have a prune racing in the background. Update migrate-any-proto.txtar to call lb/prune before restoration to avoid a race. Signed-off-by: Jussi Maki <jussi@isovalent.com>

[ upstream commit c123118 ] While the StateDB reconciler never calls the Update/Delete/Prune concurrently, we do want to be able to do BPFOps.ResetAndRestore from a test script to clear out the state. Since [sync.Mutex.Lock] is very cheap on an unlocked mutex, add a mutex around the BPFOps state so that we can inspect and manipulate it safely from tests and avoid very odd failures. Signed-off-by: Jussi Maki <jussi@isovalent.com>

[ upstream commit d8d0b98 ] This had changed when client-go was updated and this was causing false positive goroutine leak failures. Signed-off-by: Jussi Maki <jussi@isovalent.com>

[ upstream commit 5b8127a ] The backends table wasn't checked after service and endpoint slice removal leading to sometimes adding the endpoints back before the deletions were processed leading to re-use of old IDs. Signed-off-by: Jussi Maki <jussi@isovalent.com>

[ upstream commit 765ee79 ] When Kubelet gets started with --cloud-provider=external, then new Nodes get that annotation. CCM picks these new nodes and sets then ProviderID. But before CCM can start on the first control-plane of a new cluster, the CNI must be running. This means Cilum Operator needs a toleration for that taint. Related: https://app.slack.com/client/T1MATJ4SZ/C53TG4J4R In Cilium v1.17 the Cilium Operator had a toleration for all taints. This was changed in that PR: #40475 This PR extends the list of tolerations. Fixes: aa9a24c (Change the default taints that Cilium tolerates to avoid deploying to a drained node) Signed-off-by: Thomas Guettler <thomas.guettler@syself.com> Signed-off-by: Jussi Maki <jussi@isovalent.com>

[ upstream commit bbf32d5 ] Since c27d53f we use 30 GB disk size for LVH images in ci-runtime, so we can re-enable go caches for privileged tests. Signed-off-by: Rastislav Szabo <rastislav.szabo@isovalent.com> Signed-off-by: Jussi Maki <jussi@isovalent.com>

[ upstream commit b3ba248 ] /tmp filesystem which gets 50% of the RAM size is running out of space in some test runs. Signed-off-by: Rastislav Szabo <rastislav.szabo@isovalent.com> Signed-off-by: Jussi Maki <jussi@isovalent.com>

[ upstream commit a32a6c8 ] This commit fixes a nil pointer dereference in cilium-agent. The segfault results due to a race condition during host endpoint identity labels update processing. When host endpoint identity labels are updated we first delete the existing entry from policy cache and then trigger async endpoint regeneration to populate the cache again with resolved selector policy from repository. During endpoint policy regeneration the policy resolution happens in two steps: 1. Lookup or create the cached selector policy entry in policy cache. 2. Resolve the selector policy for endpoint and set it in the corresponding cached selector policy. Currently when cachedSelectorPolicy entry is deleted from the policycache we assume that the underlying selector policy is set. If two host endpoint identity labels updates are close together then the policy cache delete operation might happen between the above 2 steps leading to a nil pointer deref for underlying selector policy. This commit makes sure that the underlying selector policy is not nil before attempting to detach. This behavior is now consistent with how `policyCache.lookupOrCreate` creates the entry in cache where the underlying selector policy is nil unless set explicitly. Signed-off-by: Deepesh Pathak <deepesh.pathak@isovalent.com> Signed-off-by: Jussi Maki <jussi@isovalent.com>

[ upstream commit d2626ae ] This commit updates the cilium open-api spec to include HTTP response code 503(Service Unavailable) in `responses` list for mutating apis in endpoint subsystem. Signed-off-by: Deepesh Pathak <deepesh.pathak@isovalent.com> Signed-off-by: Jussi Maki <jussi@isovalent.com>

[ upstream commit 1c193c0 ] This commit adds CNI endpoint delete handling for scenarios where the cilium-agent server is up but unavailable to serve the request(eg. during state restore after restart). With this patch, cilium-cni will persist endpoint delete requests to the offline queue if api server responds with ServiceUnavailable(503) http code. Signed-off-by: Marcel Zieba <marcel.zieba@isovalent.com> Signed-off-by: Deepesh Pathak <deepesh.pathak@isovalent.com> Signed-off-by: Jussi Maki <jussi@isovalent.com>

[ upstream commit 130d0e9 ] Signed-off-by: Deepesh Pathak <deepesh.pathak@isovalent.com> Signed-off-by: Jussi Maki <jussi@isovalent.com>

[ upstream commit e50b821 ] This commit simplifies the DeletionFallbackClient usage and handling of endpoint deletion requests. With this patch the constructor now returns the client object without performing any initialization. The `EndpointDeleteMany` method is directly used by callers to request endpoint deletion which is processed according to below flow: ```mermaid flowchart TD A[DeletionFallbackClient] -->|EndpointDeleteMany| B[Get Or Connect Cilium API] B -->|Success| D[Request Endpoint Delete] B -->|Failure| E{DeletionQueue Lock} D -->|Success| OK{Return OK} D -->|Failure| E E -->|Acquired| F[Persist Deletion Request] E -->|NotAcquired| L[AcquireLock] L -->|Success| B L -->|Failed| NotOK F -->|Success| OK{Return OK} F -->|Failed| NotOK{Return Error} ``` Signed-off-by: Deepesh Pathak <deepesh.pathak@isovalent.com> Signed-off-by: Jussi Maki <jussi@isovalent.com>

[ upstream commit 3feec70 ] Signed-off-by: Deepesh Pathak <deepesh.pathak@isovalent.com> Signed-off-by: Jussi Maki <jussi@isovalent.com>

[ upstream commit 56abeb9 ] Signed-off-by: Deepesh Pathak <deepesh.pathak@isovalent.com> Signed-off-by: Jussi Maki <jussi@isovalent.com>

[ upstream commit c4870be ] This commit adds a readiness precheck to certain endpoint subsystem APIs like Put, Delete, Patch. This check ensures that endpoint APIs are not exposed to external components like CNI till all dependencies are ready. Currently this includes: 1. Endpoint Delete * Endpoint Delete APIs are only exposed once the state restore is complete. This ensures that all delete operations always see full state of endpoint manager so as to avoid any missed deletes. 2. Endpoint Put/Patch * Endpoint Update APIs are guarded by DeletionQueue fence. These apis are only exposed once the offline deletion queue is processed by the agent. Since the processing of endpoint Delete and Add operations is async, this check ensures that we don't delete an active endpoint if there is a delay in replaying the offline delete queue. Signed-off-by: Deepesh Pathak <deepesh.pathak@isovalent.com> Signed-off-by: Jussi Maki <jussi@isovalent.com>

[ upstream commit 3377927 ] Signed-off-by: Deepesh Pathak <deepesh.pathak@isovalent.com> Signed-off-by: Jussi Maki <jussi@isovalent.com>

…tion [ upstream commit 7eca61f ] Signed-off-by: Deepesh Pathak <deepesh.pathak@isovalent.com> Signed-off-by: Jussi Maki <jussi@isovalent.com>

[ upstream commit e286198 ] This commit changes how other components(daemon) waits on initial policy computation for an endpoint. Instead of directly relying on the `InitialEnvoyPolicyComputed` channel in an endpoint object, we now expose a method for existing callers to block on initial policy computation. `WaitForInitialPolicy` method on endpoint object waits on either initial policy to be computed or endpoint to be deleted. This makes sure that callers don't wait on policy computation indefinitely in cases where endpoint is deleted before initial regeneration is completed. This fixes an issue in cilium during endpoint restore where a restored endpoint is deleted before regeneration when processing the offline deletion queue. Signed-off-by: Deepesh Pathak <deepesh.pathak@isovalent.com> Signed-off-by: Jussi Maki <jussi@isovalent.com>

[ upstream commit 114cacd ] [ backporter's note: minor fixes due to 0edfec1 not present in v1.18 ] The current loadbalancer controlplane always initializes new `BackendParams` with its default value `Unhealthy=false`. This leads to potential issues that backendselection for a given service doesn't respect the circumstance that there might be service health check implementations that should report the backend health state before exposing that service backend via statedb `frontend` to other modules. Therefore, this commit introduces a new hook `SetIsServiceHealthCheckedFunc` to the loadbalancer `Writer`. Service health checking modules can use this hook to mark that a given service is health checked. Backendselection for a service will use that hook and only include a backend to a health-checked service frontend if the health state for the backend has already been reported once. This prevents prematurely exposing unhealthy backends (or their incorrect health status) to other modules. Signed-off-by: Marco Hofstetter <marco.hofstetter@isovalent.com> Signed-off-by: Jussi Maki <jussi@isovalent.com>

[ upstream commit 49b33da ] Previously, whenever CIDR range was extended, we removed all allocations from existing ranges, created new ranges and tried reassigning IPs to new ranges. We've already tried to reuse the same IP, however as we tried first to assign IPs to "unsatisfied" services, they could steal existing IPs from other service, resulting in reallocation of IP for already "satisfied" service, while also resulting in temporary state with two different services having the same IP. The same issue could have happened, when selector of pool was modified selecting new unsatisfied services. Note that this do not solve a case when CIDR range shrinks. In case of CIDR shrinking, IPs that would still be valid within a new range might get reallocated. Related: #40358 Signed-off-by: Marcel Zieba <marcel.zieba@isovalent.com> Signed-off-by: Jussi Maki <jussi@isovalent.com>

[ upstream commit 5e4e4bf ] Additionally, in case of pool spec changes, log previous and new spec. Signed-off-by: Marcel Zieba <marcel.zieba@isovalent.com> Signed-off-by: Jussi Maki <jussi@isovalent.com>

[ upstream commit b904b9f ] If there was a pool that was filled and had unsatisfied Services, on operator restart there was a high chance that we will reshuffle assignement of IPs for that pool. This resulted in previously safisfied services to either become unsatisfied or get a new IP. Issue is fixed by not performing any operation on services until full sync happens. After that, first we try to reuse IPs for already satisfied services and only after that we try to assign additional IPs to unsatisfied services. Additionally, add test that covers this case, simulating restart of operator. Related: #40358 Signed-off-by: Marcel Zieba <marcel.zieba@isovalent.com> Signed-off-by: Jussi Maki <jussi@isovalent.com>

[ upstream commit 4b8cec7 ] Without a timeout, this step could get stuck until the workflow timeout is reached. Thus, we add a sane timeout of 2 minutes for the step to be completed. Signed-off-by: André Martins <andre@cilium.io> Signed-off-by: Jussi Maki <jussi@isovalent.com>

[ upstream commit f2cc2e1 ] This is the follow up PR of #38300 where I didn't upadte the docs after implementing the feature Signed-off-by: Liyi Huang <liyi.huang@isovalent.com> Signed-off-by: Jussi Maki <jussi@isovalent.com>

[ upstream commit 57b5825 ] This commit is to fix my previous unit test of using incorrect return with EventuallyWithT and other errors in previous test codes. Signed-off-by: Liyi Huang <liyi.huang@isovalent.com> Signed-off-by: Jussi Maki <jussi@isovalent.com>

[ upstream commit 41a303a ] Enhance the router ID override logic for RouterIDIPPool mode so it can update the exsiting allocation accordingly. Move the restore logic to initializeJobs so the overall logic is clear Update the unit test for override to cover new function `handleRouterIDOverride` Signed-off-by: Liyi Huang <liyi.huang@isovalent.com> Signed-off-by: Jussi Maki <jussi@isovalent.com>

[ upstream commit 8759e5f ] Modifying an object in the resource store is forbidden and cause a panic. Signed-off-by: Rastislav Szabo <rastislav.szabo@isovalent.com> Signed-off-by: Jussi Maki <jussi@isovalent.com>

[ upstream commit 1f4145c ] This commit fixes a log message that still contains substitutions in the log message after the log migration. ``` 2025-08-14T18:19:16.429606429Z time=2025-08-14T18:19:16.427860732Z level=warn source=/go/src/github.com/cilium/cilium/pkg/envoy/xds/server.go:397 msg="NACK received for versions after %s and up to %s; waiting for a version update before sending again" module=enterprise-agent.agent.controlplane.envoy-proxy [...] version=19 responseNonce=20 ``` Signed-off-by: Marco Hofstetter <marco.hofstetter@isovalent.com> Signed-off-by: Jussi Maki <jussi@isovalent.com>

mhofstetter

Thanks

marseel

Thanks!

rastislavs

Thanks!

giorio94

Thanks!

joamaki · 2025-08-19T13:25:35Z

/test

0xch4z

Thank you!

fristonio

Thanks <3

This commit adds the TestPrivileged prefix to the node and ipsec linux tests. In main this is not needed as it is already running from within a TestPrivileged suite. While PRs related to similar fixes being backported from main #41078 and #41267 do sync up some tests, in v1.18 there were others that needed the TestPrivileged prefix to be added. Here's the fix. Signed-off-by: Simone Magnani <simone.magnani@isovalent.com>

This commit adds the TestPrivileged prefix to the node and ipsec linux tests. In main this is not needed as it is already running from within a TestPrivileged suite. While PRs related to similar fixes being backported from main #41078 and #41267 do sync up some tests, in v1.18 there were others that needed the TestPrivileged prefix to be added. Here's the fix. The last remaining bits will be the backport of #41279, which have not been added to this PR for consistency with history. Signed-off-by: Simone Magnani <simone.magnani@isovalent.com>

smagnani96

Many Thanks Jussi for fixing the conflicts. LGTM.
I've opened #41281 to apply similar changes to the remaining unexecuted unparallel tests (which we do already execute in main), which should not go straight into this specific backport.

This commit adds the TestPrivileged prefix to the node and ipsec linux tests. In main this is not needed as it is already running from within a TestPrivileged suite. While PRs related to similar fixes being backported from main #41078 and #41267 do sync up some tests, in v1.18 there were others that needed the TestPrivileged prefix to be added. Here's the fix. The last remaining bits will be the backport of #41279, which have not been added to this PR for consistency with history. Signed-off-by: Simone Magnani <simone.magnani@isovalent.com>

liyihuang

thanks.

smagnani96 and others added 30 commits August 19, 2025 13:06

loadbalancer: Update goleak ignores for client-go workqueue

dc8dc9f

[ upstream commit d8d0b98 ] This had changed when client-go was updated and this was causing false positive goroutine leak failures. Signed-off-by: Jussi Maki <jussi@isovalent.com>

ci: Bump conformance-runtime VM memory to 14GB

19a0af8

[ upstream commit b3ba248 ] /tmp filesystem which gets 50% of the RAM size is running out of space in some test runs. Signed-off-by: Rastislav Szabo <rastislav.szabo@isovalent.com> Signed-off-by: Jussi Maki <jussi@isovalent.com>

cni: remove dead code in cni lib and improve logging

866d33e

[ upstream commit 130d0e9 ] Signed-off-by: Deepesh Pathak <deepesh.pathak@isovalent.com> Signed-off-by: Jussi Maki <jussi@isovalent.com>

cni: refactor deletion queue lib for unit testing

7ac8b1e

[ upstream commit 3feec70 ] Signed-off-by: Deepesh Pathak <deepesh.pathak@isovalent.com> Signed-off-by: Jussi Maki <jussi@isovalent.com>

cni: add unit tests for deletion fallback client

7e8fbab

[ upstream commit 56abeb9 ] Signed-off-by: Deepesh Pathak <deepesh.pathak@isovalent.com> Signed-off-by: Jussi Maki <jussi@isovalent.com>

daemon: extract restored endpoints regen handling to a separate method

6ac307c

[ upstream commit 3377927 ] Signed-off-by: Deepesh Pathak <deepesh.pathak@isovalent.com> Signed-off-by: Jussi Maki <jussi@isovalent.com>

daemon: wait for endpoint api fence before restored endpoint regenera…

e35d2ab

…tion [ upstream commit 7eca61f ] Signed-off-by: Deepesh Pathak <deepesh.pathak@isovalent.com> Signed-off-by: Jussi Maki <jussi@isovalent.com>

lbipam: improve logging when we strip Ingress IPs

001e3b5

[ upstream commit 5e4e4bf ] Additionally, in case of pool spec changes, log previous and new spec. Signed-off-by: Marcel Zieba <marcel.zieba@isovalent.com> Signed-off-by: Jussi Maki <jussi@isovalent.com>

docs: add bgp IP pool router-id allocation mode

d52665c

[ upstream commit f2cc2e1 ] This is the follow up PR of #38300 where I didn't upadte the docs after implementing the feature Signed-off-by: Liyi Huang <liyi.huang@isovalent.com> Signed-off-by: Jussi Maki <jussi@isovalent.com>

bgpv2: Avoid modifying CiliumBGPPeerConfig in resource store

9156b3d

[ upstream commit 8759e5f ] Modifying an object in the resource store is forbidden and cause a panic. Signed-off-by: Rastislav Szabo <rastislav.szabo@isovalent.com> Signed-off-by: Jussi Maki <jussi@isovalent.com>

joamaki requested review from smagnani96, liyihuang and mhofstetter August 19, 2025 11:32

github-actions bot added the sig/policy Impacts whether traffic is allowed or denied based on user-defined policies. label Aug 19, 2025

mhofstetter approved these changes Aug 19, 2025

View reviewed changes

marseel approved these changes Aug 19, 2025

View reviewed changes

rastislavs approved these changes Aug 19, 2025

View reviewed changes

devodev approved these changes Aug 19, 2025

View reviewed changes

giorio94 approved these changes Aug 19, 2025

View reviewed changes

0xch4z approved these changes Aug 19, 2025

View reviewed changes

aanm approved these changes Aug 19, 2025

View reviewed changes

fristonio approved these changes Aug 19, 2025

View reviewed changes

smagnani96 mentioned this pull request Aug 19, 2025

node:tests: fix privileged #41281

Merged

smagnani96 approved these changes Aug 19, 2025

View reviewed changes

liyihuang approved these changes Aug 20, 2025

View reviewed changes

joamaki marked this pull request as ready for review August 21, 2025 08:52

joamaki requested review from a team as code owners August 21, 2025 08:52

joamaki requested a review from Artyop August 21, 2025 08:52

Artyop approved these changes Aug 21, 2025

View reviewed changes

joamaki added this pull request to the merge queue Aug 21, 2025

Merged via the queue into v1.18 with commit 93df5b6 Aug 21, 2025
403 of 405 checks passed

joamaki deleted the pr/v1.18-backport-2025-08-19-01-06 branch August 21, 2025 12:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

v1.18 Backports 2025-08-19 #41267

v1.18 Backports 2025-08-19 #41267

Uh oh!

joamaki commented Aug 19, 2025 •

edited

Loading

Uh oh!

mhofstetter left a comment

Uh oh!

marseel left a comment

Uh oh!

rastislavs left a comment

Uh oh!

giorio94 left a comment

Uh oh!

joamaki commented Aug 19, 2025

Uh oh!

0xch4z left a comment

Uh oh!

fristonio left a comment

Uh oh!

smagnani96 left a comment

Uh oh!

liyihuang left a comment

Uh oh!

Uh oh!

Uh oh!

v1.18 Backports 2025-08-19 #41267

v1.18 Backports 2025-08-19 #41267

Uh oh!

Conversation

joamaki commented Aug 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mhofstetter left a comment

Choose a reason for hiding this comment

Uh oh!

marseel left a comment

Choose a reason for hiding this comment

Uh oh!

rastislavs left a comment

Choose a reason for hiding this comment

Uh oh!

giorio94 left a comment

Choose a reason for hiding this comment

Uh oh!

joamaki commented Aug 19, 2025

Uh oh!

0xch4z left a comment

Choose a reason for hiding this comment

Uh oh!

fristonio left a comment

Choose a reason for hiding this comment

Uh oh!

smagnani96 left a comment

Choose a reason for hiding this comment

Uh oh!

liyihuang left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

joamaki commented Aug 19, 2025 •

edited

Loading