endpoint: extract Endpoint API from daemon #39014

mhofstetter · 2025-04-17T17:07:43Z

This PR extracts the Endpoint API and other CNI related components from the daemon.

Components:

Endpoint API handlers
Endpoint API manager
Endpoint Deletion Queue
Endpoint Creation Manager

Please review the individual commits

Note: Commit endpoint: fix deletion queue lock/unlock handling fixes a potential regression that has been introduced with #32981. This needs to be manually backported to v1.17 & v1.16.

/cc @squeed - regarding CNI integration
/cc @joamaki - who initially modularized the deletionqueue / api aspects (lock, unlock, ...)

mhofstetter · 2025-04-28T07:26:51Z

rebased to main and resolved conflicts

mhofstetter · 2025-04-28T07:35:08Z

/test

In preparation to extract the endpoint metadata fetcher from the daemon, this commit moves the methods `fetchK8sMetadataForEndpoint` & `fetchK8sMetadataForEndpointFromPod` to the `endpointMetadataFetcher` struct. In addition, the interface `k8sPodMetadataFetcher` has been introduced that provides support for wrapping the `K8sWatcher` for testing. Signed-off-by: Marco Hofstetter <marco.hofstetter@isovalent.com>

In preparation for extracting the Endpoint API, this commit extracts the `EndpointMetadataFetcher` into its own Hive Cell `endpoint-metadata` Signed-off-by: Marco Hofstetter <marco.hofstetter@isovalent.com>

This commit removes the method `Daemon.deleteEndpointQuiet` which is a 1:1 delegation to `EndpointManager.RemoveEndpoint`. Signed-off-by: Marco Hofstetter <marco.hofstetter@isovalent.com>

The method `Daemon.getEndpointList` is currently used by the Endpoint API and the Debug API. In preparation for extracting the Endpoint API, this commit moves the method from the `Daemon` to the `EndpointManager`. Signed-off-by: Marco Hofstetter <marco.hofstetter@isovalent.com>

Currently, the EndpointCreationManager that keeps track of the Endpoint creation requests is created directly in the daemon. In preparation for extracting the Endpoint API, this commit extracts the `EndpointCreationManager` into its own Hive Cell `endpoint-api` Signed-off-by: Marco Hofstetter <marco.hofstetter@isovalent.com>

Currently, some methods on the `Daemon` are used by the Endpoint API and DeletionQueue. In preparation for extracting the Endpoint API, this commit extracts these methods from the `Daemon` and introduces a new Hive component `EndpointAPIManager` that provides this functionality. It is provided by the Hive module `endpoint-api`. Signed-off-by: Marco Hofstetter <marco.hofstetter@isovalent.com>

The Endpoint `DeletionQueue` is part of the Endpoint API as it is responsible to handle the Endpoint deletion requests from the CNI that happened during the time the Endpoint API wasn't available. Therefore, this commit moves the `DeletionQueue` from the daemon cells to the Hive module `endpoint-api`, where it is provided as private cell. Note: Instead of depending on the `DaemonPromise`, the deletion queue now depends on the `EndpointRestorationPromise`. Under the hood this is the same, because the later is resolved once the first is resolved. It's simply to workaround cyclic dependencies - and it's more expressive. Note2: The deletion queue no longer has access to the lifecycle context of the daemon (`d.ctx`). Therefore, this commit temporarily uses a `context.TODO()`. This will be fixed in a follow up commit that refactors the deletion queue to use Hive Jobs instead of raw Hive lifecycle hooks. Note3: Looks like PR cilium#32981 introduced a regression in the deletion queue lock/unlocking behaviour (premature unlocking). This will be fixed in a following commit. Signed-off-by: Marco Hofstetter <marco.hofstetter@isovalent.com>

This commit extracts the Endpoint API handlers from the Daemon to the new Hive module `endpoint-api`. Signed-off-by: Marco Hofstetter <marco.hofstetter@isovalent.com>

This commit refactors the Endpoint deletion queue to use Hive Jobs instead of raw Hive lifecycle hooks. In addition, it fixes the temporary context handling by using the `context.Context` from the job function that is cancelled when the agent is terminated. Note: Further improving the error handling / reporting of the deletion queue processing logic results in test errors (warnings detected - probably due to the test setups). It might be better to improve this in follow up PRs. Note2: Looks like PR cilium#32981 introduced a regression in the deletion queue lock/unlocking behaviour (premature unlocking). This will be fixed in a following commit. Signed-off-by: Marco Hofstetter <marco.hofstetter@isovalent.com>

This commit migrates the package `pkg/endpoint/api` from `logrus` to `slog`. The only exceptions is the usage of `Endpoint.GetLogger`. Signed-off-by: Marco Hofstetter <marco.hofstetter@isovalent.com>

Currently, various endpoint related Hive modules are directly registered in the daemon cells. Therefore, this commit introduces a new meta hive group that provides all these modules. Signed-off-by: Marco Hofstetter <marco.hofstetter@isovalent.com>

PR cilium#32981 (commit cilium@9463a868475) changed the deletion queue lifecycle logic to no longer process the queue in the lifecycle start hook. Instead the queue gets processed asynchronously in a separate Go routine. As a consequence, the unlock of the CNI lock file may happen prematurely. Therefore, this commit fixes the deletion queue logic to explicitly wait with a channel before unlocking the lock file. Signed-off-by: Marco Hofstetter <marco.hofstetter@isovalent.com>

mhofstetter · 2025-04-30T06:52:33Z

rebased to main to resolve conflicts

mhofstetter · 2025-04-30T11:41:10Z

/test

joestringer · 2025-08-05T21:06:58Z

pkg/endpoint/api/endpoint_deletion_queue.go

+// unlockAfterAPIServer registers a start hook that runs after API server
+// has started and the deletion queue has been drained to unlock the
+// delete queue and thus allow CNI plugin to proceed.
+func unlockAfterAPIServer(jobGroup job.Group, _ *server.Server, dq *DeletionQueue) {
+	jobGroup.Add(job.OneShot("unlock-lockfile", func(ctx context.Context, health cell.Health) error {
+		// Explicitly wait until deletion queue finished processing or job context is cancelled
+		select {
+		case <-ctx.Done():
+			// continue and unlock
+		case <-dq.processed:
+			// continue and unlock
+		}
+
+		if dq.lf != nil {
+			unlockErr := dq.lf.Unlock()
+			closeErr := dq.lf.Close()
+
+			if unlockErr != nil || closeErr != nil {
+				return fmt.Errorf("failed to unlock deletion queue lock file: %w", errors.Join(unlockErr, closeErr))
+			}
+		}
+
+		return nil
+	}))
+}
+


Why not put this logic into dq.Process()? It's an unusual pattern to have one goroutine grab a lock and then delegate the release of the lock to another goroutine. I think that strictly speaking given the synchronization between the goroutines with the channel it's probably technically correct, but this seems ripe for abuse if the logic is ever extended. As a general rule locks should be held in a single goroutine.

mhofstetter added kind/cleanup This includes no functional changes. release-note/misc This PR makes changes that have no direct user impact. area/modularization Relates to code modularization and maintenance. labels Apr 17, 2025

mhofstetter force-pushed the pr/mhofstetter/daemon-extract-endpoint-api branch 18 times, most recently from 16a773c to f4b8db3 Compare April 23, 2025 11:34

mhofstetter marked this pull request as ready for review April 23, 2025 11:35

mhofstetter requested review from a team as code owners April 23, 2025 11:35

mhofstetter requested review from viktor-kurchenko, tklauser and dylandreimerink April 23, 2025 11:35

mhofstetter removed the dont-merge/needs-rebase This PR needs to be rebased because it has merge conflicts. label Apr 28, 2025

mhofstetter requested a review from tklauser April 28, 2025 07:54

viktor-kurchenko approved these changes Apr 29, 2025

View reviewed changes

mhofstetter added 12 commits April 30, 2025 08:31

endpoint: provide k8s metadata fetcher via cell

42bfc12

In preparation for extracting the Endpoint API, this commit extracts the `EndpointMetadataFetcher` into its own Hive Cell `endpoint-metadata` Signed-off-by: Marco Hofstetter <marco.hofstetter@isovalent.com>

endpoint: remove Daemon.deleteEndpointQuiet

cb918b7

This commit removes the method `Daemon.deleteEndpointQuiet` which is a 1:1 delegation to `EndpointManager.RemoveEndpoint`. Signed-off-by: Marco Hofstetter <marco.hofstetter@isovalent.com>

endpoint: move endpoint api handlers from daemon to endpoint api cell

4d48795

This commit extracts the Endpoint API handlers from the Daemon to the new Hive module `endpoint-api`. Signed-off-by: Marco Hofstetter <marco.hofstetter@isovalent.com>

endpoint: migrate to slog in endpoint api package

ae2d36f

This commit migrates the package `pkg/endpoint/api` from `logrus` to `slog`. The only exceptions is the usage of `Endpoint.GetLogger`. Signed-off-by: Marco Hofstetter <marco.hofstetter@isovalent.com>

endpoint: meta cell

52f37c9

Currently, various endpoint related Hive modules are directly registered in the daemon cells. Therefore, this commit introduces a new meta hive group that provides all these modules. Signed-off-by: Marco Hofstetter <marco.hofstetter@isovalent.com>

mhofstetter force-pushed the pr/mhofstetter/daemon-extract-endpoint-api branch from ccce06f to 6de4435 Compare April 30, 2025 06:52

tklauser approved these changes May 6, 2025

View reviewed changes

maintainer-s-little-helper bot added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label May 6, 2025

tklauser enabled auto-merge May 6, 2025 07:17

tklauser added this pull request to the merge queue May 6, 2025

Merged via the queue into cilium:main with commit 5562edb May 6, 2025
66 checks passed

mhofstetter deleted the pr/mhofstetter/daemon-extract-endpoint-api branch May 6, 2025 07:39

giorio94 mentioned this pull request Jun 17, 2025

ci: update clustermesh upgrade workflow to reflect ipsec changes once v1.17 is out #33760

Closed

joestringer reviewed Aug 5, 2025

View reviewed changes

joestringer mentioned this pull request Aug 5, 2025

Fix Endpoint delete request processing during agent startup #40568

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

endpoint: extract Endpoint API from daemon #39014

endpoint: extract Endpoint API from daemon #39014

Uh oh!

mhofstetter commented Apr 17, 2025 •

edited

Loading

Uh oh!

mhofstetter commented Apr 28, 2025

Uh oh!

mhofstetter commented Apr 28, 2025

Uh oh!

mhofstetter commented Apr 30, 2025

Uh oh!

mhofstetter commented Apr 30, 2025

Uh oh!

Uh oh!

joestringer Aug 5, 2025

Uh oh!

Uh oh!

endpoint: extract Endpoint API from daemon #39014

endpoint: extract Endpoint API from daemon #39014

Uh oh!

Conversation

mhofstetter commented Apr 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mhofstetter commented Apr 28, 2025

Uh oh!

mhofstetter commented Apr 28, 2025

Uh oh!

mhofstetter commented Apr 30, 2025

Uh oh!

mhofstetter commented Apr 30, 2025

Uh oh!

Uh oh!

joestringer Aug 5, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

mhofstetter commented Apr 17, 2025 •

edited

Loading