Skip to content

Conversation

joamaki
Copy link
Contributor

@joamaki joamaki commented Aug 19, 2025

Once this PR is merged, a GitHub action will update the labels of these PRs:

 41007 41085 41098 41102 41079 40568 41092 41122 41147 41120 40340 41088 41171 41010 41110 41148 41039 41107 41240

smagnani96 and others added 30 commits August 19, 2025 13:06
[ upstream commit ebfc395 ]

As of today, we identify privileged tests with the "TestPrivileged" prefix
in their name. However, that doesn't hold true for benchmarks requiring
privileges, which are always skipped given their prefix doesn't match
"TestPrivileged".

This commit patches our current logic to introduce a new "BenchmarkPrivileged"
prefix to identify such benchmarks requiring privileged access.

Signed-off-by: Simone Magnani <simone.magnani@isovalent.com>
Signed-off-by: Jussi Maki <jussi@isovalent.com>
[ upstream commit 7275a86 ]

This commits adjusts the prefix of all benchmarks requiring privileged
access with the new "BenchmarkPrivileged" prefix.

Signed-off-by: Simone Magnani <simone.magnani@isovalent.com>
Signed-off-by: Jussi Maki <jussi@isovalent.com>
[ upstream commit 0e03851 ]

Wait for the prune to actually happen in the lb/prune command
to make tests that e.g. do BPF state restoration more reliable
as then we won't have a prune racing in the background.

Update migrate-any-proto.txtar to call lb/prune before restoration
to avoid a race.

Signed-off-by: Jussi Maki <jussi@isovalent.com>
[ upstream commit c123118 ]

While the StateDB reconciler never calls the Update/Delete/Prune
concurrently, we do want to be able to do BPFOps.ResetAndRestore
from a test script to clear out the state.

Since [sync.Mutex.Lock] is very cheap on an unlocked mutex, add
a mutex around the BPFOps state so that we can inspect and manipulate
it safely from tests and avoid very odd failures.

Signed-off-by: Jussi Maki <jussi@isovalent.com>
[ upstream commit d8d0b98 ]

This had changed when client-go was updated and this was causing
false positive goroutine leak failures.

Signed-off-by: Jussi Maki <jussi@isovalent.com>
[ upstream commit 5b8127a ]

The backends table wasn't checked after service and endpoint slice removal
leading to sometimes adding the endpoints back before the deletions were
processed leading to re-use of old IDs.

Signed-off-by: Jussi Maki <jussi@isovalent.com>
[ upstream commit 765ee79 ]

When Kubelet gets started with --cloud-provider=external, then new Nodes get that annotation. CCM picks these new nodes and sets then ProviderID. But before CCM can start on the first control-plane of a new cluster, the CNI must be running. This means Cilum Operator needs a toleration for that taint.

Related: https://app.slack.com/client/T1MATJ4SZ/C53TG4J4R

In Cilium v1.17 the Cilium Operator had a toleration for all taints. This was changed in that PR: #40475
This PR extends the list of tolerations.

Fixes: aa9a24c (Change the default taints that Cilium tolerates to avoid deploying to a drained node)

Signed-off-by: Thomas Guettler <thomas.guettler@syself.com>
Signed-off-by: Jussi Maki <jussi@isovalent.com>
[ upstream commit bbf32d5 ]

Since c27d53f
we use 30 GB disk size for LVH images in ci-runtime,
so we can re-enable go caches for privileged tests.

Signed-off-by: Rastislav Szabo <rastislav.szabo@isovalent.com>
Signed-off-by: Jussi Maki <jussi@isovalent.com>
[ upstream commit b3ba248 ]

/tmp filesystem which gets 50% of the RAM size is
running out of space in some test runs.

Signed-off-by: Rastislav Szabo <rastislav.szabo@isovalent.com>
Signed-off-by: Jussi Maki <jussi@isovalent.com>
[ upstream commit a32a6c8 ]

This commit fixes a nil pointer dereference in cilium-agent. The
segfault results due to a race condition during host endpoint identity
labels update processing.

When host endpoint identity labels are updated we first delete the
existing entry from policy cache and then trigger async endpoint
regeneration to populate the cache again with resolved selector policy from
repository.
During endpoint policy regeneration the policy resolution happens in two
steps:
1. Lookup or create the cached selector policy entry in policy cache.
2. Resolve the selector policy for endpoint and set it in the corresponding
   cached selector policy.

Currently when cachedSelectorPolicy entry is deleted from the
policycache we assume that the underlying selector policy is set.
If two host endpoint identity labels updates are close together then the
policy cache delete operation might happen between the above 2 steps
leading to a nil pointer deref for underlying selector policy.

This commit makes sure that the underlying selector policy is not nil
before attempting to detach. This behavior is now consistent with how
`policyCache.lookupOrCreate` creates the entry in cache where the
underlying selector policy is nil unless set explicitly.

Signed-off-by: Deepesh Pathak <deepesh.pathak@isovalent.com>
Signed-off-by: Jussi Maki <jussi@isovalent.com>
[ upstream commit d2626ae ]

This commit updates the cilium open-api spec to include HTTP response
code 503(Service Unavailable) in `responses` list for mutating apis in
endpoint subsystem.

Signed-off-by: Deepesh Pathak <deepesh.pathak@isovalent.com>
Signed-off-by: Jussi Maki <jussi@isovalent.com>
[ upstream commit 1c193c0 ]

This commit adds CNI endpoint delete handling for scenarios where
the cilium-agent server is up but unavailable to serve the request(eg.
during state restore after restart).
With this patch, cilium-cni will persist endpoint delete requests to the
offline queue if api server responds with ServiceUnavailable(503) http
code.

Signed-off-by: Marcel Zieba <marcel.zieba@isovalent.com>
Signed-off-by: Deepesh Pathak <deepesh.pathak@isovalent.com>
Signed-off-by: Jussi Maki <jussi@isovalent.com>
[ upstream commit 130d0e9 ]

Signed-off-by: Deepesh Pathak <deepesh.pathak@isovalent.com>
Signed-off-by: Jussi Maki <jussi@isovalent.com>
[ upstream commit e50b821 ]

This commit simplifies the DeletionFallbackClient usage and handling of
endpoint deletion requests. With this patch the constructor now returns
the client object without performing any initialization.
The `EndpointDeleteMany` method is directly used by callers to request
endpoint deletion which is processed according to below flow:

```mermaid
flowchart TD
    A[DeletionFallbackClient] -->|EndpointDeleteMany| B[Get Or Connect Cilium API]
    B -->|Success| D[Request Endpoint Delete]
    B -->|Failure| E{DeletionQueue Lock}
    D -->|Success| OK{Return OK}
    D -->|Failure| E
    E -->|Acquired| F[Persist Deletion Request]
    E -->|NotAcquired| L[AcquireLock]
    L -->|Success| B
    L -->|Failed| NotOK
    F -->|Success| OK{Return OK}
    F -->|Failed| NotOK{Return Error}
```

Signed-off-by: Deepesh Pathak <deepesh.pathak@isovalent.com>
Signed-off-by: Jussi Maki <jussi@isovalent.com>
[ upstream commit 3feec70 ]

Signed-off-by: Deepesh Pathak <deepesh.pathak@isovalent.com>
Signed-off-by: Jussi Maki <jussi@isovalent.com>
[ upstream commit 56abeb9 ]

Signed-off-by: Deepesh Pathak <deepesh.pathak@isovalent.com>
Signed-off-by: Jussi Maki <jussi@isovalent.com>
[ upstream commit c4870be ]

This commit adds a readiness precheck to certain endpoint subsystem APIs
like Put, Delete, Patch. This check ensures that endpoint APIs are not
exposed to external components like CNI till all dependencies are ready.
Currently this includes:

1. Endpoint Delete
  * Endpoint Delete APIs are only exposed once the state restore is complete.
  This ensures that all delete operations always see full state of endpoint
  manager so as to avoid any missed deletes.

2. Endpoint Put/Patch
  * Endpoint Update APIs are guarded by DeletionQueue fence. These apis are
  only exposed once the offline deletion queue is processed by the agent.
  Since the processing of endpoint Delete and Add operations is async,
  this check ensures that we don't delete an active endpoint if there is a
  delay in replaying the offline delete queue.

Signed-off-by: Deepesh Pathak <deepesh.pathak@isovalent.com>
Signed-off-by: Jussi Maki <jussi@isovalent.com>
[ upstream commit 3377927 ]

Signed-off-by: Deepesh Pathak <deepesh.pathak@isovalent.com>
Signed-off-by: Jussi Maki <jussi@isovalent.com>
…tion

[ upstream commit 7eca61f ]

Signed-off-by: Deepesh Pathak <deepesh.pathak@isovalent.com>
Signed-off-by: Jussi Maki <jussi@isovalent.com>
[ upstream commit e286198 ]

This commit changes how other components(daemon) waits on initial policy
computation for an endpoint. Instead of directly relying on the
`InitialEnvoyPolicyComputed` channel in an endpoint object, we now expose
a method for existing callers to block on initial policy computation.
`WaitForInitialPolicy` method on endpoint object waits on either initial
policy to be computed or endpoint to be deleted. This makes sure that
callers don't wait on policy computation indefinitely in cases where
endpoint is deleted before initial regeneration is completed.

This fixes an issue in cilium during endpoint restore where a restored
endpoint is deleted before regeneration when processing the offline
deletion queue.

Signed-off-by: Deepesh Pathak <deepesh.pathak@isovalent.com>
Signed-off-by: Jussi Maki <jussi@isovalent.com>
[ upstream commit 114cacd ]

[ backporter's note: minor fixes due to 0edfec1 not
  present in v1.18 ]

The current loadbalancer controlplane always initializes new `BackendParams`
with its default value `Unhealthy=false`. This leads to potential issues that
backendselection for a given service doesn't respect the circumstance that there might be
service health check implementations that should report the backend health
state before exposing that service backend via statedb `frontend` to other modules.

Therefore, this commit introduces a new hook `SetIsServiceHealthCheckedFunc` to the
loadbalancer `Writer`. Service health checking modules can use this hook to
mark that a given service is health checked. Backendselection for a service will use
that hook and only include a backend to a health-checked service frontend if the
health state for the backend has already been reported once. This prevents prematurely
exposing unhealthy backends (or their incorrect health status) to other modules.

Signed-off-by: Marco Hofstetter <marco.hofstetter@isovalent.com>
Signed-off-by: Jussi Maki <jussi@isovalent.com>
[ upstream commit 49b33da ]

Previously, whenever CIDR range was extended, we removed all allocations
from existing ranges, created new ranges and tried reassigning IPs to
new ranges. We've already tried to reuse the same IP, however as we
tried first to assign IPs to "unsatisfied" services, they could steal
existing IPs from other service, resulting in reallocation of IP for already
"satisfied" service, while also resulting in temporary state with two
different services having the same IP.

The same issue could have happened, when selector of pool was modified
selecting new unsatisfied services.

Note that this do not solve a case when CIDR range shrinks. In case of
CIDR shrinking, IPs that would still be valid within a new range might
get reallocated.

Related: #40358

Signed-off-by: Marcel Zieba <marcel.zieba@isovalent.com>
Signed-off-by: Jussi Maki <jussi@isovalent.com>
[ upstream commit 5e4e4bf ]

Additionally, in case of pool spec changes, log previous and new spec.

Signed-off-by: Marcel Zieba <marcel.zieba@isovalent.com>
Signed-off-by: Jussi Maki <jussi@isovalent.com>
[ upstream commit b904b9f ]

If there was a pool that was filled and had unsatisfied Services, on
operator restart there was a high chance that we will reshuffle
assignement of IPs for that pool. This resulted in previously safisfied
services to either become unsatisfied or get a new IP.

Issue is fixed by not performing any operation on services until full
sync happens. After that, first we try to reuse IPs for already
satisfied services and only after that we try to assign additional IPs
to unsatisfied services.

Additionally, add test that covers this case, simulating restart of
operator.

Related: #40358

Signed-off-by: Marcel Zieba <marcel.zieba@isovalent.com>
Signed-off-by: Jussi Maki <jussi@isovalent.com>
[ upstream commit 4b8cec7 ]

Without a timeout, this step could get stuck until the workflow timeout
is reached. Thus, we add a sane timeout of 2 minutes for the step to be
completed.

Signed-off-by: André Martins <andre@cilium.io>
Signed-off-by: Jussi Maki <jussi@isovalent.com>
[ upstream commit f2cc2e1 ]

This is the follow up PR of #38300
where I didn't upadte the docs after implementing the feature

Signed-off-by: Liyi Huang <liyi.huang@isovalent.com>
Signed-off-by: Jussi Maki <jussi@isovalent.com>
[ upstream commit 57b5825 ]

This commit is to fix my previous unit test of using incorrect return with EventuallyWithT and other errors in previous test codes.

Signed-off-by: Liyi Huang <liyi.huang@isovalent.com>
Signed-off-by: Jussi Maki <jussi@isovalent.com>
[ upstream commit 41a303a ]

Enhance the router ID override logic for RouterIDIPPool mode so it can update the exsiting allocation accordingly.
Move the restore logic to initializeJobs so the overall logic is clear

Update the unit test for override to cover new function `handleRouterIDOverride`

Signed-off-by: Liyi Huang <liyi.huang@isovalent.com>
Signed-off-by: Jussi Maki <jussi@isovalent.com>
[ upstream commit 8759e5f ]

Modifying an object in the resource store is forbidden and cause a panic.

Signed-off-by: Rastislav Szabo <rastislav.szabo@isovalent.com>
Signed-off-by: Jussi Maki <jussi@isovalent.com>
[ upstream commit 1f4145c ]

This commit fixes a log message that still contains substitutions in the
log message after the log migration.

```
2025-08-14T18:19:16.429606429Z time=2025-08-14T18:19:16.427860732Z level=warn source=/go/src/github.com/cilium/cilium/pkg/envoy/xds/server.go:397 msg="NACK received for versions after %s and up to %s; waiting for a version update before sending again" module=enterprise-agent.agent.controlplane.envoy-proxy [...] version=19 responseNonce=20
```

Signed-off-by: Marco Hofstetter <marco.hofstetter@isovalent.com>
Signed-off-by: Jussi Maki <jussi@isovalent.com>
@github-actions github-actions bot added the sig/policy Impacts whether traffic is allowed or denied based on user-defined policies. label Aug 19, 2025
Copy link
Member

@mhofstetter mhofstetter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks

Copy link
Contributor

@marseel marseel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

Copy link
Contributor

@rastislavs rastislavs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

Copy link
Member

@giorio94 giorio94 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@joamaki
Copy link
Contributor Author

joamaki commented Aug 19, 2025

/test

Copy link
Contributor

@0xch4z 0xch4z left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you!

Copy link
Member

@fristonio fristonio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks <3

smagnani96 added a commit that referenced this pull request Aug 19, 2025
This commit adds the TestPrivileged prefix to the node and ipsec linux tests.
In main this is not needed as it is already running from within a TestPrivileged
suite. While PRs related to similar fixes being backported from main #41078
and #41267 do sync up some tests, in v1.18 there were others that
needed the TestPrivileged prefix to be added. Here's the fix.

Signed-off-by: Simone Magnani <simone.magnani@isovalent.com>
smagnani96 added a commit that referenced this pull request Aug 19, 2025
This commit adds the TestPrivileged prefix to the node and ipsec linux tests.
In main this is not needed as it is already running from within a TestPrivileged
suite. While PRs related to similar fixes being backported from main #41078
and #41267 do sync up some tests, in v1.18 there were others that
needed the TestPrivileged prefix to be added. Here's the fix.

The last remaining bits will be the backport of #41279, which have not
been added to this PR for consistency with history.

Signed-off-by: Simone Magnani <simone.magnani@isovalent.com>
Copy link
Contributor

@smagnani96 smagnani96 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Many Thanks Jussi for fixing the conflicts. LGTM.
I've opened #41281 to apply similar changes to the remaining unexecuted unparallel tests (which we do already execute in main), which should not go straight into this specific backport.

github-merge-queue bot pushed a commit that referenced this pull request Aug 19, 2025
This commit adds the TestPrivileged prefix to the node and ipsec linux tests.
In main this is not needed as it is already running from within a TestPrivileged
suite. While PRs related to similar fixes being backported from main #41078
and #41267 do sync up some tests, in v1.18 there were others that
needed the TestPrivileged prefix to be added. Here's the fix.

The last remaining bits will be the backport of #41279, which have not
been added to this PR for consistency with history.

Signed-off-by: Simone Magnani <simone.magnani@isovalent.com>
Copy link
Contributor

@liyihuang liyihuang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks.

@joamaki joamaki marked this pull request as ready for review August 21, 2025 08:52
@joamaki joamaki requested review from a team as code owners August 21, 2025 08:52
@joamaki joamaki requested a review from Artyop August 21, 2025 08:52
@joamaki joamaki added this pull request to the merge queue Aug 21, 2025
Merged via the queue into v1.18 with commit 93df5b6 Aug 21, 2025
403 of 405 checks passed
@joamaki joamaki deleted the pr/v1.18-backport-2025-08-19-01-06 branch August 21, 2025 12:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport/1.18 This PR represents a backport for Cilium 1.18.x of a PR that was merged to main. kind/backports This PR provides functionality previously merged into master. sig/policy Impacts whether traffic is allowed or denied based on user-defined policies.
Projects
None yet
Development

Successfully merging this pull request may close these issues.