Skip to content

Wait for CEC and CCEC resources before restoring endpoints. #32981

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 10 commits into from
Jun 21, 2024

Conversation

jrajahalme
Copy link
Member

@jrajahalme jrajahalme commented Jun 9, 2024

Just like we are currently waiting for policy CRDs before generating endpoints for the first time after restoring them during an agent restart, it is also beneficial to wait for CEC and CCEC resources as well. This is mainly due to possible policy redirections to CEC/CCEC listeners, but also for the possibility of L7 LB service annotations, as well as east/west Ingress and Gateway API, which are implemented via CEC/CCEC resources.

In order to achieve this we first break up potentially blocking parts of daemon lifecycle start hook to run in a goroutine, so that the hive may run concurrently with the daemon bootstrap. This is necessary as start hooks are executed in the order in which they are appended to the hive lifecycle, and the next start hook can only run when the previous one has returned.

startDeamon() blocks waiting for a subset of k8s resources to have been synchronized, and as we add CEC and CCEC resources to that subset, we want to be able to use the hive lifecycle to start running the resource watchers for these resources. This becomes possible by making sure daemon start hooks do not block for this purpose. Similarly, any start hook that needs to await for the daemon promise needs to perform that Await call in a goroutine. This allows the hive to run all the needed start hooks while startDaemon() is waiting for the resources to have been synchronized.

This change also allows reverting the change in k8s policy watchers introduced in #32028, as now the hive is not stalled by the daemon waiting for the k8s resources to synchronize.

During testing there were spurious error and warning logs from the vendored k8s code when Cilium resources were requested before the corresponding CRDs had been registered. This is resolved by providing a new CRDSync promise to the hive, and then awaiting for that promise before requesting any Cilium resources. This eliminates the error and warning logs seen for CiliumNode, CiliumEnvoyConfig, and CIliumClusterwideEnvoyConfig resources. The CRDSync promise is optional so that it can be ignored for other than the agent hive.

@jrajahalme jrajahalme added area/daemon Impacts operation of the Cilium daemon. area/k8s Impacts the kubernetes API, or kubernetes -> cilium internals translation layers. sig/policy Impacts whether traffic is allowed or denied based on user-defined policies. area/servicemesh GH issues or PRs regarding servicemesh labels Jun 9, 2024
@jrajahalme jrajahalme requested a review from gandro June 9, 2024 15:33
@jrajahalme jrajahalme requested review from a team as code owners June 9, 2024 15:33
@maintainer-s-little-helper
Copy link

Commit c0add7e does not match "(?m)^Signed-off-by:".

Please follow instructions provided in https://docs.cilium.io/en/stable/contributing/development/contributing_guide/#developer-s-certificate-of-origin

@jrajahalme jrajahalme requested a review from sayboras June 9, 2024 15:33
@maintainer-s-little-helper maintainer-s-little-helper bot added dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. labels Jun 9, 2024
@maintainer-s-little-helper maintainer-s-little-helper bot added the dont-merge/needs-sign-off The author needs to add signoff to their commits before merge. label Jun 9, 2024
@jrajahalme jrajahalme force-pushed the daemon-on-start-do-not-block branch from 94a0d7e to 6bb1eed Compare June 9, 2024 15:40
@maintainer-s-little-helper maintainer-s-little-helper bot removed the dont-merge/needs-sign-off The author needs to add signoff to their commits before merge. label Jun 9, 2024
@jrajahalme
Copy link
Member Author

Added missing sign-off to the revert commit.

@jrajahalme jrajahalme added the release-note/misc This PR makes changes that have no direct user impact. label Jun 9, 2024
@maintainer-s-little-helper maintainer-s-little-helper bot removed the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Jun 9, 2024
@jrajahalme
Copy link
Member Author

/test

@jrajahalme jrajahalme marked this pull request as draft June 9, 2024 16:28
@jrajahalme jrajahalme force-pushed the daemon-on-start-do-not-block branch from 6bb1eed to ce865ca Compare June 9, 2024 16:43
@jrajahalme
Copy link
Member Author

Changed to not use context from lifecycle OnStart hook in deamon promise await, as it will be canceled as soon as all the start hooks have executed, which may be before daemon promise becomes resolved.

@jrajahalme jrajahalme marked this pull request as ready for review June 9, 2024 16:45
@jrajahalme
Copy link
Member Author

/test

mhofstetter added a commit to mhofstetter/cilium that referenced this pull request Apr 23, 2025
PR cilium#32981 (commit cilium@9463a868475) changed
the deletion queue lifecycle logic to no longer process the queue in
the lifecycle start hook. Instead the queue gets processed asynchronously
in a separate Go routine.

As a consequence, the unlock of the CNI lock file may happen prematurely.

Therefore, this commit fixes the deletion queue logic to explicitly wait
with a channel before unlocking the lock file.

Signed-off-by: Marco Hofstetter <marco.hofstetter@isovalent.com>
mhofstetter added a commit to mhofstetter/cilium that referenced this pull request Apr 23, 2025
This commit refactors the Endpoint deletion queue to use
Hive Jobs instead of raw Hive lifecycle hooks. In addition, it
fixes the temporary context handling by using the `context.Context`
from the job function that is cancelled when the agent is terminated.

Note: Looks like PR cilium#32981 introduced a regression in the deletion queue
lock/unlocking behaviour (premature unlocking). This will be fixed in a following commit.

Signed-off-by: Marco Hofstetter <marco.hofstetter@isovalent.com>
mhofstetter added a commit to mhofstetter/cilium that referenced this pull request Apr 23, 2025
PR cilium#32981 (commit cilium@9463a868475) changed
the deletion queue lifecycle logic to no longer process the queue in
the lifecycle start hook. Instead the queue gets processed asynchronously
in a separate Go routine.

As a consequence, the unlock of the CNI lock file may happen prematurely.

Therefore, this commit fixes the deletion queue logic to explicitly wait
with a channel before unlocking the lock file.

Signed-off-by: Marco Hofstetter <marco.hofstetter@isovalent.com>
mhofstetter added a commit to mhofstetter/cilium that referenced this pull request Apr 23, 2025
This commit refactors the Endpoint deletion queue to use
Hive Jobs instead of raw Hive lifecycle hooks. In addition, it
fixes the temporary context handling by using the `context.Context`
from the job function that is cancelled when the agent is terminated.

Note: Further improving the error handling / reporting of the deletion queue
processing logic results in test errors. It might be better to improve this
in follow up PRs.

Note2: Looks like PR cilium#32981 introduced a regression in the deletion queue
lock/unlocking behaviour (premature unlocking). This will be fixed in a following commit.

Signed-off-by: Marco Hofstetter <marco.hofstetter@isovalent.com>
mhofstetter added a commit to mhofstetter/cilium that referenced this pull request Apr 23, 2025
PR cilium#32981 (commit cilium@9463a868475) changed
the deletion queue lifecycle logic to no longer process the queue in
the lifecycle start hook. Instead the queue gets processed asynchronously
in a separate Go routine.

As a consequence, the unlock of the CNI lock file may happen prematurely.

Therefore, this commit fixes the deletion queue logic to explicitly wait
with a channel before unlocking the lock file.

Signed-off-by: Marco Hofstetter <marco.hofstetter@isovalent.com>
mhofstetter added a commit to mhofstetter/cilium that referenced this pull request Apr 23, 2025
The Endpoint `DeletionQueue` is part of the Endpoint API as it
is responsible to handle the Endpoint deletion requests from the CNI
that happened during the time the Endpoint API wasn't available.

Therefore, this commit moves the `DeletionQueue` from the daemon cells
to the Hive module `endpoint-api`, where it is provided as private cell.

Note: Instead of depending on the `DaemonPromise`, the deletion queue now
depends on the `EndpointRestorationPromise`. Under the hood this is the same,
because the later is resolved once the first is resolved. It's simply to workaround
cyclic dependencies - and it's more expressive.

Note2: The deletion queue no longer has access to the lifecycle context of
the daemon (`d.ctx`). Therefore, this commit temporarily uses a `context.TODO()`.
This will be fixed in a follow up commit that refactors the deletion queue
to use Hive Jobs instead of raw Hive lifecycle hooks.

Note3: Looks like PR cilium#32981 introduced a regression in the deletion queue
lock/unlocking behaviour (premature unlocking). This will be fixed in a following commit.

Signed-off-by: Marco Hofstetter <marco.hofstetter@isovalent.com>
mhofstetter added a commit to mhofstetter/cilium that referenced this pull request Apr 23, 2025
This commit refactors the Endpoint deletion queue to use
Hive Jobs instead of raw Hive lifecycle hooks. In addition, it
fixes the temporary context handling by using the `context.Context`
from the job function that is cancelled when the agent is terminated.

Note: Further improving the error handling / reporting of the deletion queue
processing logic results in test errors. It might be better to improve this
in follow up PRs.

Note2: Looks like PR cilium#32981 introduced a regression in the deletion queue
lock/unlocking behaviour (premature unlocking). This will be fixed in a following commit.

Signed-off-by: Marco Hofstetter <marco.hofstetter@isovalent.com>
mhofstetter added a commit to mhofstetter/cilium that referenced this pull request Apr 23, 2025
PR cilium#32981 (commit cilium@9463a868475) changed
the deletion queue lifecycle logic to no longer process the queue in
the lifecycle start hook. Instead the queue gets processed asynchronously
in a separate Go routine.

As a consequence, the unlock of the CNI lock file may happen prematurely.

Therefore, this commit fixes the deletion queue logic to explicitly wait
with a channel before unlocking the lock file.

Signed-off-by: Marco Hofstetter <marco.hofstetter@isovalent.com>
mhofstetter added a commit to mhofstetter/cilium that referenced this pull request Apr 23, 2025
PR cilium#32981 (commit cilium@9463a868475) changed
the deletion queue lifecycle logic to no longer process the queue in
the lifecycle start hook. Instead the queue gets processed asynchronously
in a separate Go routine.

As a consequence, the unlock of the CNI lock file may happen prematurely.

Therefore, this commit fixes the deletion queue logic to explicitly wait
with a channel before unlocking the lock file.

Signed-off-by: Marco Hofstetter <marco.hofstetter@isovalent.com>
mhofstetter added a commit to mhofstetter/cilium that referenced this pull request Apr 23, 2025
This commit refactors the Endpoint deletion queue to use
Hive Jobs instead of raw Hive lifecycle hooks. In addition, it
fixes the temporary context handling by using the `context.Context`
from the job function that is cancelled when the agent is terminated.

Note: Further improving the error handling / reporting of the deletion queue
processing logic results in test errors. It might be better to improve this
in follow up PRs.

Note2: Looks like PR cilium#32981 introduced a regression in the deletion queue
lock/unlocking behaviour (premature unlocking). This will be fixed in a following commit.

Signed-off-by: Marco Hofstetter <marco.hofstetter@isovalent.com>
mhofstetter added a commit to mhofstetter/cilium that referenced this pull request Apr 23, 2025
PR cilium#32981 (commit cilium@9463a868475) changed
the deletion queue lifecycle logic to no longer process the queue in
the lifecycle start hook. Instead the queue gets processed asynchronously
in a separate Go routine.

As a consequence, the unlock of the CNI lock file may happen prematurely.

Therefore, this commit fixes the deletion queue logic to explicitly wait
with a channel before unlocking the lock file.

Signed-off-by: Marco Hofstetter <marco.hofstetter@isovalent.com>
mhofstetter added a commit to mhofstetter/cilium that referenced this pull request Apr 24, 2025
The Endpoint `DeletionQueue` is part of the Endpoint API as it
is responsible to handle the Endpoint deletion requests from the CNI
that happened during the time the Endpoint API wasn't available.

Therefore, this commit moves the `DeletionQueue` from the daemon cells
to the Hive module `endpoint-api`, where it is provided as private cell.

Note: Instead of depending on the `DaemonPromise`, the deletion queue now
depends on the `EndpointRestorationPromise`. Under the hood this is the same,
because the later is resolved once the first is resolved. It's simply to workaround
cyclic dependencies - and it's more expressive.

Note2: The deletion queue no longer has access to the lifecycle context of
the daemon (`d.ctx`). Therefore, this commit temporarily uses a `context.TODO()`.
This will be fixed in a follow up commit that refactors the deletion queue
to use Hive Jobs instead of raw Hive lifecycle hooks.

Note3: Looks like PR cilium#32981 introduced a regression in the deletion queue
lock/unlocking behaviour (premature unlocking). This will be fixed in a following commit.

Signed-off-by: Marco Hofstetter <marco.hofstetter@isovalent.com>
mhofstetter added a commit to mhofstetter/cilium that referenced this pull request Apr 24, 2025
This commit refactors the Endpoint deletion queue to use
Hive Jobs instead of raw Hive lifecycle hooks. In addition, it
fixes the temporary context handling by using the `context.Context`
from the job function that is cancelled when the agent is terminated.

Note: Further improving the error handling / reporting of the deletion queue
processing logic results in test errors (warnings detected - probably due to
the test setups). It might be better to improve this in follow up PRs.

Note2: Looks like PR cilium#32981 introduced a regression in the deletion queue
lock/unlocking behaviour (premature unlocking). This will be fixed in a following commit.

Signed-off-by: Marco Hofstetter <marco.hofstetter@isovalent.com>
mhofstetter added a commit to mhofstetter/cilium that referenced this pull request Apr 24, 2025
PR cilium#32981 (commit cilium@9463a868475) changed
the deletion queue lifecycle logic to no longer process the queue in
the lifecycle start hook. Instead the queue gets processed asynchronously
in a separate Go routine.

As a consequence, the unlock of the CNI lock file may happen prematurely.

Therefore, this commit fixes the deletion queue logic to explicitly wait
with a channel before unlocking the lock file.

Signed-off-by: Marco Hofstetter <marco.hofstetter@isovalent.com>
mhofstetter added a commit to mhofstetter/cilium that referenced this pull request Apr 24, 2025
PR cilium#32981 (commit cilium@9463a868475) changed
the deletion queue lifecycle logic to no longer process the queue in
the lifecycle start hook. Instead the queue gets processed asynchronously
in a separate Go routine.

As a consequence, the unlock of the CNI lock file may happen prematurely.

Therefore, this commit fixes the deletion queue logic to explicitly wait
with a channel before unlocking the lock file.

Signed-off-by: Marco Hofstetter <marco.hofstetter@isovalent.com>
mhofstetter added a commit to mhofstetter/cilium that referenced this pull request Apr 28, 2025
The Endpoint `DeletionQueue` is part of the Endpoint API as it
is responsible to handle the Endpoint deletion requests from the CNI
that happened during the time the Endpoint API wasn't available.

Therefore, this commit moves the `DeletionQueue` from the daemon cells
to the Hive module `endpoint-api`, where it is provided as private cell.

Note: Instead of depending on the `DaemonPromise`, the deletion queue now
depends on the `EndpointRestorationPromise`. Under the hood this is the same,
because the later is resolved once the first is resolved. It's simply to workaround
cyclic dependencies - and it's more expressive.

Note2: The deletion queue no longer has access to the lifecycle context of
the daemon (`d.ctx`). Therefore, this commit temporarily uses a `context.TODO()`.
This will be fixed in a follow up commit that refactors the deletion queue
to use Hive Jobs instead of raw Hive lifecycle hooks.

Note3: Looks like PR cilium#32981 introduced a regression in the deletion queue
lock/unlocking behaviour (premature unlocking). This will be fixed in a following commit.

Signed-off-by: Marco Hofstetter <marco.hofstetter@isovalent.com>
mhofstetter added a commit to mhofstetter/cilium that referenced this pull request Apr 28, 2025
This commit refactors the Endpoint deletion queue to use
Hive Jobs instead of raw Hive lifecycle hooks. In addition, it
fixes the temporary context handling by using the `context.Context`
from the job function that is cancelled when the agent is terminated.

Note: Further improving the error handling / reporting of the deletion queue
processing logic results in test errors (warnings detected - probably due to
the test setups). It might be better to improve this in follow up PRs.

Note2: Looks like PR cilium#32981 introduced a regression in the deletion queue
lock/unlocking behaviour (premature unlocking). This will be fixed in a following commit.

Signed-off-by: Marco Hofstetter <marco.hofstetter@isovalent.com>
mhofstetter added a commit to mhofstetter/cilium that referenced this pull request Apr 28, 2025
PR cilium#32981 (commit cilium@9463a868475) changed
the deletion queue lifecycle logic to no longer process the queue in
the lifecycle start hook. Instead the queue gets processed asynchronously
in a separate Go routine.

As a consequence, the unlock of the CNI lock file may happen prematurely.

Therefore, this commit fixes the deletion queue logic to explicitly wait
with a channel before unlocking the lock file.

Signed-off-by: Marco Hofstetter <marco.hofstetter@isovalent.com>
mhofstetter added a commit to mhofstetter/cilium that referenced this pull request Apr 30, 2025
The Endpoint `DeletionQueue` is part of the Endpoint API as it
is responsible to handle the Endpoint deletion requests from the CNI
that happened during the time the Endpoint API wasn't available.

Therefore, this commit moves the `DeletionQueue` from the daemon cells
to the Hive module `endpoint-api`, where it is provided as private cell.

Note: Instead of depending on the `DaemonPromise`, the deletion queue now
depends on the `EndpointRestorationPromise`. Under the hood this is the same,
because the later is resolved once the first is resolved. It's simply to workaround
cyclic dependencies - and it's more expressive.

Note2: The deletion queue no longer has access to the lifecycle context of
the daemon (`d.ctx`). Therefore, this commit temporarily uses a `context.TODO()`.
This will be fixed in a follow up commit that refactors the deletion queue
to use Hive Jobs instead of raw Hive lifecycle hooks.

Note3: Looks like PR cilium#32981 introduced a regression in the deletion queue
lock/unlocking behaviour (premature unlocking). This will be fixed in a following commit.

Signed-off-by: Marco Hofstetter <marco.hofstetter@isovalent.com>
mhofstetter added a commit to mhofstetter/cilium that referenced this pull request Apr 30, 2025
This commit refactors the Endpoint deletion queue to use
Hive Jobs instead of raw Hive lifecycle hooks. In addition, it
fixes the temporary context handling by using the `context.Context`
from the job function that is cancelled when the agent is terminated.

Note: Further improving the error handling / reporting of the deletion queue
processing logic results in test errors (warnings detected - probably due to
the test setups). It might be better to improve this in follow up PRs.

Note2: Looks like PR cilium#32981 introduced a regression in the deletion queue
lock/unlocking behaviour (premature unlocking). This will be fixed in a following commit.

Signed-off-by: Marco Hofstetter <marco.hofstetter@isovalent.com>
mhofstetter added a commit to mhofstetter/cilium that referenced this pull request Apr 30, 2025
PR cilium#32981 (commit cilium@9463a868475) changed
the deletion queue lifecycle logic to no longer process the queue in
the lifecycle start hook. Instead the queue gets processed asynchronously
in a separate Go routine.

As a consequence, the unlock of the CNI lock file may happen prematurely.

Therefore, this commit fixes the deletion queue logic to explicitly wait
with a channel before unlocking the lock file.

Signed-off-by: Marco Hofstetter <marco.hofstetter@isovalent.com>
mhofstetter added a commit to mhofstetter/cilium that referenced this pull request May 2, 2025
The Endpoint `DeletionQueue` is part of the Endpoint API as it
is responsible to handle the Endpoint deletion requests from the CNI
that happened during the time the Endpoint API wasn't available.

Therefore, this commit moves the `DeletionQueue` from the daemon cells
to the Hive module `endpoint-api`, where it is provided as private cell.

Note: Instead of depending on the `DaemonPromise`, the deletion queue now
depends on the `EndpointRestorationPromise`. Under the hood this is the same,
because the later is resolved once the first is resolved. It's simply to workaround
cyclic dependencies - and it's more expressive.

Note2: The deletion queue no longer has access to the lifecycle context of
the daemon (`d.ctx`). Therefore, this commit temporarily uses a `context.TODO()`.
This will be fixed in a follow up commit that refactors the deletion queue
to use Hive Jobs instead of raw Hive lifecycle hooks.

Note3: Looks like PR cilium#32981 introduced a regression in the deletion queue
lock/unlocking behaviour (premature unlocking). This will be fixed in a following commit.

Signed-off-by: Marco Hofstetter <marco.hofstetter@isovalent.com>
mhofstetter added a commit to mhofstetter/cilium that referenced this pull request May 2, 2025
This commit refactors the Endpoint deletion queue to use
Hive Jobs instead of raw Hive lifecycle hooks. In addition, it
fixes the temporary context handling by using the `context.Context`
from the job function that is cancelled when the agent is terminated.

Note: Further improving the error handling / reporting of the deletion queue
processing logic results in test errors (warnings detected - probably due to
the test setups). It might be better to improve this in follow up PRs.

Note2: Looks like PR cilium#32981 introduced a regression in the deletion queue
lock/unlocking behaviour (premature unlocking). This will be fixed in a following commit.

Signed-off-by: Marco Hofstetter <marco.hofstetter@isovalent.com>
mhofstetter added a commit to mhofstetter/cilium that referenced this pull request May 2, 2025
PR cilium#32981 (commit cilium@9463a868475) changed
the deletion queue lifecycle logic to no longer process the queue in
the lifecycle start hook. Instead the queue gets processed asynchronously
in a separate Go routine.

As a consequence, the unlock of the CNI lock file may happen prematurely.

Therefore, this commit fixes the deletion queue logic to explicitly wait
with a channel before unlocking the lock file.

Signed-off-by: Marco Hofstetter <marco.hofstetter@isovalent.com>
github-merge-queue bot pushed a commit that referenced this pull request May 6, 2025
The Endpoint `DeletionQueue` is part of the Endpoint API as it
is responsible to handle the Endpoint deletion requests from the CNI
that happened during the time the Endpoint API wasn't available.

Therefore, this commit moves the `DeletionQueue` from the daemon cells
to the Hive module `endpoint-api`, where it is provided as private cell.

Note: Instead of depending on the `DaemonPromise`, the deletion queue now
depends on the `EndpointRestorationPromise`. Under the hood this is the same,
because the later is resolved once the first is resolved. It's simply to workaround
cyclic dependencies - and it's more expressive.

Note2: The deletion queue no longer has access to the lifecycle context of
the daemon (`d.ctx`). Therefore, this commit temporarily uses a `context.TODO()`.
This will be fixed in a follow up commit that refactors the deletion queue
to use Hive Jobs instead of raw Hive lifecycle hooks.

Note3: Looks like PR #32981 introduced a regression in the deletion queue
lock/unlocking behaviour (premature unlocking). This will be fixed in a following commit.

Signed-off-by: Marco Hofstetter <marco.hofstetter@isovalent.com>
github-merge-queue bot pushed a commit that referenced this pull request May 6, 2025
This commit refactors the Endpoint deletion queue to use
Hive Jobs instead of raw Hive lifecycle hooks. In addition, it
fixes the temporary context handling by using the `context.Context`
from the job function that is cancelled when the agent is terminated.

Note: Further improving the error handling / reporting of the deletion queue
processing logic results in test errors (warnings detected - probably due to
the test setups). It might be better to improve this in follow up PRs.

Note2: Looks like PR #32981 introduced a regression in the deletion queue
lock/unlocking behaviour (premature unlocking). This will be fixed in a following commit.

Signed-off-by: Marco Hofstetter <marco.hofstetter@isovalent.com>
mhofstetter added a commit to mhofstetter/cilium that referenced this pull request May 6, 2025
PR cilium#32981 (commit cilium@9463a868475) changed
the deletion queue lifecycle logic to no longer process the queue in
the lifecycle start hook. Instead the queue gets processed asynchronously
in a separate Go routine.

As a consequence, the unlock of the CNI lock file may happen prematurely.

Therefore, this commit fixes the deletion queue logic to explicitly wait
with a channel before unlocking the lock file.

Signed-off-by: Marco Hofstetter <marco.hofstetter@isovalent.com>
hsalluri259 pushed a commit to hsalluri259/cilium that referenced this pull request May 14, 2025
The Endpoint `DeletionQueue` is part of the Endpoint API as it
is responsible to handle the Endpoint deletion requests from the CNI
that happened during the time the Endpoint API wasn't available.

Therefore, this commit moves the `DeletionQueue` from the daemon cells
to the Hive module `endpoint-api`, where it is provided as private cell.

Note: Instead of depending on the `DaemonPromise`, the deletion queue now
depends on the `EndpointRestorationPromise`. Under the hood this is the same,
because the later is resolved once the first is resolved. It's simply to workaround
cyclic dependencies - and it's more expressive.

Note2: The deletion queue no longer has access to the lifecycle context of
the daemon (`d.ctx`). Therefore, this commit temporarily uses a `context.TODO()`.
This will be fixed in a follow up commit that refactors the deletion queue
to use Hive Jobs instead of raw Hive lifecycle hooks.

Note3: Looks like PR cilium#32981 introduced a regression in the deletion queue
lock/unlocking behaviour (premature unlocking). This will be fixed in a following commit.

Signed-off-by: Marco Hofstetter <marco.hofstetter@isovalent.com>
hsalluri259 pushed a commit to hsalluri259/cilium that referenced this pull request May 14, 2025
This commit refactors the Endpoint deletion queue to use
Hive Jobs instead of raw Hive lifecycle hooks. In addition, it
fixes the temporary context handling by using the `context.Context`
from the job function that is cancelled when the agent is terminated.

Note: Further improving the error handling / reporting of the deletion queue
processing logic results in test errors (warnings detected - probably due to
the test setups). It might be better to improve this in follow up PRs.

Note2: Looks like PR cilium#32981 introduced a regression in the deletion queue
lock/unlocking behaviour (premature unlocking). This will be fixed in a following commit.

Signed-off-by: Marco Hofstetter <marco.hofstetter@isovalent.com>
hsalluri259 pushed a commit to hsalluri259/cilium that referenced this pull request May 14, 2025
PR cilium#32981 (commit cilium@9463a868475) changed
the deletion queue lifecycle logic to no longer process the queue in
the lifecycle start hook. Instead the queue gets processed asynchronously
in a separate Go routine.

As a consequence, the unlock of the CNI lock file may happen prematurely.

Therefore, this commit fixes the deletion queue logic to explicitly wait
with a channel before unlocking the lock file.

Signed-off-by: Marco Hofstetter <marco.hofstetter@isovalent.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/daemon Impacts operation of the Cilium daemon. area/k8s Impacts the kubernetes API, or kubernetes -> cilium internals translation layers. area/servicemesh GH issues or PRs regarding servicemesh backport-done/1.16 The backport for Cilium 1.16.x for this PR is done. kind/bug This is a bug in the Cilium logic. ready-to-merge This PR has passed all tests and received consensus from code owners to merge. release-blocker/1.16 This issue will prevent the release of the next version of Cilium. release-note/misc This PR makes changes that have no direct user impact. sig/policy Impacts whether traffic is allowed or denied based on user-defined policies.
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.