-
Notifications
You must be signed in to change notification settings - Fork 3.4k
v1.18 Backports 2025-08-19 #41267
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
v1.18 Backports 2025-08-19 #41267
Conversation
[ upstream commit ebfc395 ] As of today, we identify privileged tests with the "TestPrivileged" prefix in their name. However, that doesn't hold true for benchmarks requiring privileges, which are always skipped given their prefix doesn't match "TestPrivileged". This commit patches our current logic to introduce a new "BenchmarkPrivileged" prefix to identify such benchmarks requiring privileged access. Signed-off-by: Simone Magnani <simone.magnani@isovalent.com> Signed-off-by: Jussi Maki <jussi@isovalent.com>
[ upstream commit 7275a86 ] This commits adjusts the prefix of all benchmarks requiring privileged access with the new "BenchmarkPrivileged" prefix. Signed-off-by: Simone Magnani <simone.magnani@isovalent.com> Signed-off-by: Jussi Maki <jussi@isovalent.com>
[ upstream commit 0e03851 ] Wait for the prune to actually happen in the lb/prune command to make tests that e.g. do BPF state restoration more reliable as then we won't have a prune racing in the background. Update migrate-any-proto.txtar to call lb/prune before restoration to avoid a race. Signed-off-by: Jussi Maki <jussi@isovalent.com>
[ upstream commit c123118 ] While the StateDB reconciler never calls the Update/Delete/Prune concurrently, we do want to be able to do BPFOps.ResetAndRestore from a test script to clear out the state. Since [sync.Mutex.Lock] is very cheap on an unlocked mutex, add a mutex around the BPFOps state so that we can inspect and manipulate it safely from tests and avoid very odd failures. Signed-off-by: Jussi Maki <jussi@isovalent.com>
[ upstream commit d8d0b98 ] This had changed when client-go was updated and this was causing false positive goroutine leak failures. Signed-off-by: Jussi Maki <jussi@isovalent.com>
[ upstream commit 5b8127a ] The backends table wasn't checked after service and endpoint slice removal leading to sometimes adding the endpoints back before the deletions were processed leading to re-use of old IDs. Signed-off-by: Jussi Maki <jussi@isovalent.com>
[ upstream commit 765ee79 ] When Kubelet gets started with --cloud-provider=external, then new Nodes get that annotation. CCM picks these new nodes and sets then ProviderID. But before CCM can start on the first control-plane of a new cluster, the CNI must be running. This means Cilum Operator needs a toleration for that taint. Related: https://app.slack.com/client/T1MATJ4SZ/C53TG4J4R In Cilium v1.17 the Cilium Operator had a toleration for all taints. This was changed in that PR: #40475 This PR extends the list of tolerations. Fixes: aa9a24c (Change the default taints that Cilium tolerates to avoid deploying to a drained node) Signed-off-by: Thomas Guettler <thomas.guettler@syself.com> Signed-off-by: Jussi Maki <jussi@isovalent.com>
[ upstream commit b3ba248 ] /tmp filesystem which gets 50% of the RAM size is running out of space in some test runs. Signed-off-by: Rastislav Szabo <rastislav.szabo@isovalent.com> Signed-off-by: Jussi Maki <jussi@isovalent.com>
[ upstream commit a32a6c8 ] This commit fixes a nil pointer dereference in cilium-agent. The segfault results due to a race condition during host endpoint identity labels update processing. When host endpoint identity labels are updated we first delete the existing entry from policy cache and then trigger async endpoint regeneration to populate the cache again with resolved selector policy from repository. During endpoint policy regeneration the policy resolution happens in two steps: 1. Lookup or create the cached selector policy entry in policy cache. 2. Resolve the selector policy for endpoint and set it in the corresponding cached selector policy. Currently when cachedSelectorPolicy entry is deleted from the policycache we assume that the underlying selector policy is set. If two host endpoint identity labels updates are close together then the policy cache delete operation might happen between the above 2 steps leading to a nil pointer deref for underlying selector policy. This commit makes sure that the underlying selector policy is not nil before attempting to detach. This behavior is now consistent with how `policyCache.lookupOrCreate` creates the entry in cache where the underlying selector policy is nil unless set explicitly. Signed-off-by: Deepesh Pathak <deepesh.pathak@isovalent.com> Signed-off-by: Jussi Maki <jussi@isovalent.com>
[ upstream commit d2626ae ] This commit updates the cilium open-api spec to include HTTP response code 503(Service Unavailable) in `responses` list for mutating apis in endpoint subsystem. Signed-off-by: Deepesh Pathak <deepesh.pathak@isovalent.com> Signed-off-by: Jussi Maki <jussi@isovalent.com>
[ upstream commit 1c193c0 ] This commit adds CNI endpoint delete handling for scenarios where the cilium-agent server is up but unavailable to serve the request(eg. during state restore after restart). With this patch, cilium-cni will persist endpoint delete requests to the offline queue if api server responds with ServiceUnavailable(503) http code. Signed-off-by: Marcel Zieba <marcel.zieba@isovalent.com> Signed-off-by: Deepesh Pathak <deepesh.pathak@isovalent.com> Signed-off-by: Jussi Maki <jussi@isovalent.com>
[ upstream commit 130d0e9 ] Signed-off-by: Deepesh Pathak <deepesh.pathak@isovalent.com> Signed-off-by: Jussi Maki <jussi@isovalent.com>
[ upstream commit e50b821 ] This commit simplifies the DeletionFallbackClient usage and handling of endpoint deletion requests. With this patch the constructor now returns the client object without performing any initialization. The `EndpointDeleteMany` method is directly used by callers to request endpoint deletion which is processed according to below flow: ```mermaid flowchart TD A[DeletionFallbackClient] -->|EndpointDeleteMany| B[Get Or Connect Cilium API] B -->|Success| D[Request Endpoint Delete] B -->|Failure| E{DeletionQueue Lock} D -->|Success| OK{Return OK} D -->|Failure| E E -->|Acquired| F[Persist Deletion Request] E -->|NotAcquired| L[AcquireLock] L -->|Success| B L -->|Failed| NotOK F -->|Success| OK{Return OK} F -->|Failed| NotOK{Return Error} ``` Signed-off-by: Deepesh Pathak <deepesh.pathak@isovalent.com> Signed-off-by: Jussi Maki <jussi@isovalent.com>
[ upstream commit 3feec70 ] Signed-off-by: Deepesh Pathak <deepesh.pathak@isovalent.com> Signed-off-by: Jussi Maki <jussi@isovalent.com>
[ upstream commit 56abeb9 ] Signed-off-by: Deepesh Pathak <deepesh.pathak@isovalent.com> Signed-off-by: Jussi Maki <jussi@isovalent.com>
[ upstream commit c4870be ] This commit adds a readiness precheck to certain endpoint subsystem APIs like Put, Delete, Patch. This check ensures that endpoint APIs are not exposed to external components like CNI till all dependencies are ready. Currently this includes: 1. Endpoint Delete * Endpoint Delete APIs are only exposed once the state restore is complete. This ensures that all delete operations always see full state of endpoint manager so as to avoid any missed deletes. 2. Endpoint Put/Patch * Endpoint Update APIs are guarded by DeletionQueue fence. These apis are only exposed once the offline deletion queue is processed by the agent. Since the processing of endpoint Delete and Add operations is async, this check ensures that we don't delete an active endpoint if there is a delay in replaying the offline delete queue. Signed-off-by: Deepesh Pathak <deepesh.pathak@isovalent.com> Signed-off-by: Jussi Maki <jussi@isovalent.com>
[ upstream commit 3377927 ] Signed-off-by: Deepesh Pathak <deepesh.pathak@isovalent.com> Signed-off-by: Jussi Maki <jussi@isovalent.com>
…tion [ upstream commit 7eca61f ] Signed-off-by: Deepesh Pathak <deepesh.pathak@isovalent.com> Signed-off-by: Jussi Maki <jussi@isovalent.com>
[ upstream commit e286198 ] This commit changes how other components(daemon) waits on initial policy computation for an endpoint. Instead of directly relying on the `InitialEnvoyPolicyComputed` channel in an endpoint object, we now expose a method for existing callers to block on initial policy computation. `WaitForInitialPolicy` method on endpoint object waits on either initial policy to be computed or endpoint to be deleted. This makes sure that callers don't wait on policy computation indefinitely in cases where endpoint is deleted before initial regeneration is completed. This fixes an issue in cilium during endpoint restore where a restored endpoint is deleted before regeneration when processing the offline deletion queue. Signed-off-by: Deepesh Pathak <deepesh.pathak@isovalent.com> Signed-off-by: Jussi Maki <jussi@isovalent.com>
[ upstream commit 114cacd ] [ backporter's note: minor fixes due to 0edfec1 not present in v1.18 ] The current loadbalancer controlplane always initializes new `BackendParams` with its default value `Unhealthy=false`. This leads to potential issues that backendselection for a given service doesn't respect the circumstance that there might be service health check implementations that should report the backend health state before exposing that service backend via statedb `frontend` to other modules. Therefore, this commit introduces a new hook `SetIsServiceHealthCheckedFunc` to the loadbalancer `Writer`. Service health checking modules can use this hook to mark that a given service is health checked. Backendselection for a service will use that hook and only include a backend to a health-checked service frontend if the health state for the backend has already been reported once. This prevents prematurely exposing unhealthy backends (or their incorrect health status) to other modules. Signed-off-by: Marco Hofstetter <marco.hofstetter@isovalent.com> Signed-off-by: Jussi Maki <jussi@isovalent.com>
[ upstream commit 49b33da ] Previously, whenever CIDR range was extended, we removed all allocations from existing ranges, created new ranges and tried reassigning IPs to new ranges. We've already tried to reuse the same IP, however as we tried first to assign IPs to "unsatisfied" services, they could steal existing IPs from other service, resulting in reallocation of IP for already "satisfied" service, while also resulting in temporary state with two different services having the same IP. The same issue could have happened, when selector of pool was modified selecting new unsatisfied services. Note that this do not solve a case when CIDR range shrinks. In case of CIDR shrinking, IPs that would still be valid within a new range might get reallocated. Related: #40358 Signed-off-by: Marcel Zieba <marcel.zieba@isovalent.com> Signed-off-by: Jussi Maki <jussi@isovalent.com>
[ upstream commit 5e4e4bf ] Additionally, in case of pool spec changes, log previous and new spec. Signed-off-by: Marcel Zieba <marcel.zieba@isovalent.com> Signed-off-by: Jussi Maki <jussi@isovalent.com>
[ upstream commit b904b9f ] If there was a pool that was filled and had unsatisfied Services, on operator restart there was a high chance that we will reshuffle assignement of IPs for that pool. This resulted in previously safisfied services to either become unsatisfied or get a new IP. Issue is fixed by not performing any operation on services until full sync happens. After that, first we try to reuse IPs for already satisfied services and only after that we try to assign additional IPs to unsatisfied services. Additionally, add test that covers this case, simulating restart of operator. Related: #40358 Signed-off-by: Marcel Zieba <marcel.zieba@isovalent.com> Signed-off-by: Jussi Maki <jussi@isovalent.com>
[ upstream commit 4b8cec7 ] Without a timeout, this step could get stuck until the workflow timeout is reached. Thus, we add a sane timeout of 2 minutes for the step to be completed. Signed-off-by: André Martins <andre@cilium.io> Signed-off-by: Jussi Maki <jussi@isovalent.com>
[ upstream commit 57b5825 ] This commit is to fix my previous unit test of using incorrect return with EventuallyWithT and other errors in previous test codes. Signed-off-by: Liyi Huang <liyi.huang@isovalent.com> Signed-off-by: Jussi Maki <jussi@isovalent.com>
[ upstream commit 41a303a ] Enhance the router ID override logic for RouterIDIPPool mode so it can update the exsiting allocation accordingly. Move the restore logic to initializeJobs so the overall logic is clear Update the unit test for override to cover new function `handleRouterIDOverride` Signed-off-by: Liyi Huang <liyi.huang@isovalent.com> Signed-off-by: Jussi Maki <jussi@isovalent.com>
[ upstream commit 8759e5f ] Modifying an object in the resource store is forbidden and cause a panic. Signed-off-by: Rastislav Szabo <rastislav.szabo@isovalent.com> Signed-off-by: Jussi Maki <jussi@isovalent.com>
[ upstream commit 1f4145c ] This commit fixes a log message that still contains substitutions in the log message after the log migration. ``` 2025-08-14T18:19:16.429606429Z time=2025-08-14T18:19:16.427860732Z level=warn source=/go/src/github.com/cilium/cilium/pkg/envoy/xds/server.go:397 msg="NACK received for versions after %s and up to %s; waiting for a version update before sending again" module=enterprise-agent.agent.controlplane.envoy-proxy [...] version=19 responseNonce=20 ``` Signed-off-by: Marco Hofstetter <marco.hofstetter@isovalent.com> Signed-off-by: Jussi Maki <jussi@isovalent.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
/test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks <3
This commit adds the TestPrivileged prefix to the node and ipsec linux tests. In main this is not needed as it is already running from within a TestPrivileged suite. While PRs related to similar fixes being backported from main #41078 and #41267 do sync up some tests, in v1.18 there were others that needed the TestPrivileged prefix to be added. Here's the fix. Signed-off-by: Simone Magnani <simone.magnani@isovalent.com>
This commit adds the TestPrivileged prefix to the node and ipsec linux tests. In main this is not needed as it is already running from within a TestPrivileged suite. While PRs related to similar fixes being backported from main #41078 and #41267 do sync up some tests, in v1.18 there were others that needed the TestPrivileged prefix to be added. Here's the fix. The last remaining bits will be the backport of #41279, which have not been added to this PR for consistency with history. Signed-off-by: Simone Magnani <simone.magnani@isovalent.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Many Thanks Jussi for fixing the conflicts. LGTM.
I've opened #41281 to apply similar changes to the remaining unexecuted unparallel tests (which we do already execute in main), which should not go straight into this specific backport.
This commit adds the TestPrivileged prefix to the node and ipsec linux tests. In main this is not needed as it is already running from within a TestPrivileged suite. While PRs related to similar fixes being backported from main #41078 and #41267 do sync up some tests, in v1.18 there were others that needed the TestPrivileged prefix to be added. Here's the fix. The last remaining bits will be the backport of #41279, which have not been added to this PR for consistency with history. Signed-off-by: Simone Magnani <simone.magnani@isovalent.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks.
Once this PR is merged, a GitHub action will update the labels of these PRs: