Skip to content

endpoint: extract Endpoint API from daemon #39014

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

mhofstetter
Copy link
Member

@mhofstetter mhofstetter commented Apr 17, 2025

This PR extracts the Endpoint API and other CNI related components from the daemon.

Components:

  • Endpoint API handlers
  • Endpoint API manager
  • Endpoint Deletion Queue
  • Endpoint Creation Manager

Please review the individual commits

Note: Commit endpoint: fix deletion queue lock/unlock handling fixes a potential regression that has been introduced with #32981. This needs to be manually backported to v1.17 & v1.16.

/cc @squeed - regarding CNI integration
/cc @joamaki - who initially modularized the deletionqueue / api aspects (lock, unlock, ...)

@mhofstetter mhofstetter added kind/cleanup This includes no functional changes. release-note/misc This PR makes changes that have no direct user impact. area/modularization Relates to code modularization and maintenance. labels Apr 17, 2025
@mhofstetter mhofstetter force-pushed the pr/mhofstetter/daemon-extract-endpoint-api branch 18 times, most recently from 16a773c to f4b8db3 Compare April 23, 2025 11:34
@mhofstetter mhofstetter marked this pull request as ready for review April 23, 2025 11:35
@mhofstetter mhofstetter requested review from a team as code owners April 23, 2025 11:35
@mhofstetter mhofstetter removed the dont-merge/needs-rebase This PR needs to be rebased because it has merge conflicts. label Apr 28, 2025
@mhofstetter
Copy link
Member Author

rebased to main and resolved conflicts

@mhofstetter
Copy link
Member Author

/test

@mhofstetter mhofstetter requested a review from tklauser April 28, 2025 07:54
In preparation to extract the endpoint metadata fetcher from the daemon,
this commit moves the methods `fetchK8sMetadataForEndpoint` &
`fetchK8sMetadataForEndpointFromPod` to the `endpointMetadataFetcher` struct.

In addition, the interface `k8sPodMetadataFetcher` has been introduced that
provides support for wrapping the `K8sWatcher` for testing.

Signed-off-by: Marco Hofstetter <marco.hofstetter@isovalent.com>
In preparation for extracting the Endpoint API, this commit extracts
the `EndpointMetadataFetcher` into its own Hive Cell `endpoint-metadata`

Signed-off-by: Marco Hofstetter <marco.hofstetter@isovalent.com>
This commit removes the method `Daemon.deleteEndpointQuiet` which is
a 1:1 delegation to `EndpointManager.RemoveEndpoint`.

Signed-off-by: Marco Hofstetter <marco.hofstetter@isovalent.com>
The method `Daemon.getEndpointList` is currently used by the
Endpoint API and the Debug API.

In preparation for extracting the Endpoint API, this commit moves
the method from the `Daemon` to the `EndpointManager`.

Signed-off-by: Marco Hofstetter <marco.hofstetter@isovalent.com>
Currently, the EndpointCreationManager that keeps track of the Endpoint
creation requests is created directly in the daemon.

In preparation for extracting the Endpoint API, this commit extracts
the `EndpointCreationManager` into its own Hive Cell `endpoint-api`

Signed-off-by: Marco Hofstetter <marco.hofstetter@isovalent.com>
Currently, some methods on the `Daemon` are used by the Endpoint API
and DeletionQueue.

In preparation for extracting the Endpoint API, this commit extracts
these methods from the `Daemon` and introduces a new Hive component
`EndpointAPIManager` that provides this functionality. It is provided
by the Hive module `endpoint-api`.

Signed-off-by: Marco Hofstetter <marco.hofstetter@isovalent.com>
The Endpoint `DeletionQueue` is part of the Endpoint API as it
is responsible to handle the Endpoint deletion requests from the CNI
that happened during the time the Endpoint API wasn't available.

Therefore, this commit moves the `DeletionQueue` from the daemon cells
to the Hive module `endpoint-api`, where it is provided as private cell.

Note: Instead of depending on the `DaemonPromise`, the deletion queue now
depends on the `EndpointRestorationPromise`. Under the hood this is the same,
because the later is resolved once the first is resolved. It's simply to workaround
cyclic dependencies - and it's more expressive.

Note2: The deletion queue no longer has access to the lifecycle context of
the daemon (`d.ctx`). Therefore, this commit temporarily uses a `context.TODO()`.
This will be fixed in a follow up commit that refactors the deletion queue
to use Hive Jobs instead of raw Hive lifecycle hooks.

Note3: Looks like PR cilium#32981 introduced a regression in the deletion queue
lock/unlocking behaviour (premature unlocking). This will be fixed in a following commit.

Signed-off-by: Marco Hofstetter <marco.hofstetter@isovalent.com>
This commit extracts the Endpoint API handlers from the Daemon to
the new Hive module `endpoint-api`.

Signed-off-by: Marco Hofstetter <marco.hofstetter@isovalent.com>
This commit refactors the Endpoint deletion queue to use
Hive Jobs instead of raw Hive lifecycle hooks. In addition, it
fixes the temporary context handling by using the `context.Context`
from the job function that is cancelled when the agent is terminated.

Note: Further improving the error handling / reporting of the deletion queue
processing logic results in test errors (warnings detected - probably due to
the test setups). It might be better to improve this in follow up PRs.

Note2: Looks like PR cilium#32981 introduced a regression in the deletion queue
lock/unlocking behaviour (premature unlocking). This will be fixed in a following commit.

Signed-off-by: Marco Hofstetter <marco.hofstetter@isovalent.com>
This commit migrates the package `pkg/endpoint/api` from
`logrus` to `slog`.

The only exceptions is the usage of `Endpoint.GetLogger`.

Signed-off-by: Marco Hofstetter <marco.hofstetter@isovalent.com>
Currently, various endpoint related Hive modules are directly registered
in the daemon cells.

Therefore, this commit introduces a new meta hive group that provides
all these modules.

Signed-off-by: Marco Hofstetter <marco.hofstetter@isovalent.com>
PR cilium#32981 (commit cilium@9463a868475) changed
the deletion queue lifecycle logic to no longer process the queue in
the lifecycle start hook. Instead the queue gets processed asynchronously
in a separate Go routine.

As a consequence, the unlock of the CNI lock file may happen prematurely.

Therefore, this commit fixes the deletion queue logic to explicitly wait
with a channel before unlocking the lock file.

Signed-off-by: Marco Hofstetter <marco.hofstetter@isovalent.com>
@mhofstetter mhofstetter force-pushed the pr/mhofstetter/daemon-extract-endpoint-api branch from ccce06f to 6de4435 Compare April 30, 2025 06:52
@mhofstetter
Copy link
Member Author

rebased to main to resolve conflicts

@mhofstetter
Copy link
Member Author

/test

@maintainer-s-little-helper maintainer-s-little-helper bot added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label May 6, 2025
@tklauser tklauser enabled auto-merge May 6, 2025 07:17
@tklauser tklauser added this pull request to the merge queue May 6, 2025
Merged via the queue into cilium:main with commit 5562edb May 6, 2025
66 checks passed
@mhofstetter mhofstetter deleted the pr/mhofstetter/daemon-extract-endpoint-api branch May 6, 2025 07:39
Comment on lines +161 to +186
// unlockAfterAPIServer registers a start hook that runs after API server
// has started and the deletion queue has been drained to unlock the
// delete queue and thus allow CNI plugin to proceed.
func unlockAfterAPIServer(jobGroup job.Group, _ *server.Server, dq *DeletionQueue) {
jobGroup.Add(job.OneShot("unlock-lockfile", func(ctx context.Context, health cell.Health) error {
// Explicitly wait until deletion queue finished processing or job context is cancelled
select {
case <-ctx.Done():
// continue and unlock
case <-dq.processed:
// continue and unlock
}

if dq.lf != nil {
unlockErr := dq.lf.Unlock()
closeErr := dq.lf.Close()

if unlockErr != nil || closeErr != nil {
return fmt.Errorf("failed to unlock deletion queue lock file: %w", errors.Join(unlockErr, closeErr))
}
}

return nil
}))
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not put this logic into dq.Process()? It's an unusual pattern to have one goroutine grab a lock and then delegate the release of the lock to another goroutine. I think that strictly speaking given the synchronization between the goroutines with the channel it's probably technically correct, but this seems ripe for abuse if the logic is ever extended. As a general rule locks should be held in a single goroutine.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/modularization Relates to code modularization and maintenance. kind/cleanup This includes no functional changes. ready-to-merge This PR has passed all tests and received consensus from code owners to merge. release-note/misc This PR makes changes that have no direct user impact.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants