-
Notifications
You must be signed in to change notification settings - Fork 3.4k
endpoint: extract Endpoint API from daemon #39014
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
endpoint: extract Endpoint API from daemon #39014
Conversation
16a773c
to
f4b8db3
Compare
rebased to |
/test |
In preparation to extract the endpoint metadata fetcher from the daemon, this commit moves the methods `fetchK8sMetadataForEndpoint` & `fetchK8sMetadataForEndpointFromPod` to the `endpointMetadataFetcher` struct. In addition, the interface `k8sPodMetadataFetcher` has been introduced that provides support for wrapping the `K8sWatcher` for testing. Signed-off-by: Marco Hofstetter <marco.hofstetter@isovalent.com>
In preparation for extracting the Endpoint API, this commit extracts the `EndpointMetadataFetcher` into its own Hive Cell `endpoint-metadata` Signed-off-by: Marco Hofstetter <marco.hofstetter@isovalent.com>
This commit removes the method `Daemon.deleteEndpointQuiet` which is a 1:1 delegation to `EndpointManager.RemoveEndpoint`. Signed-off-by: Marco Hofstetter <marco.hofstetter@isovalent.com>
The method `Daemon.getEndpointList` is currently used by the Endpoint API and the Debug API. In preparation for extracting the Endpoint API, this commit moves the method from the `Daemon` to the `EndpointManager`. Signed-off-by: Marco Hofstetter <marco.hofstetter@isovalent.com>
Currently, the EndpointCreationManager that keeps track of the Endpoint creation requests is created directly in the daemon. In preparation for extracting the Endpoint API, this commit extracts the `EndpointCreationManager` into its own Hive Cell `endpoint-api` Signed-off-by: Marco Hofstetter <marco.hofstetter@isovalent.com>
Currently, some methods on the `Daemon` are used by the Endpoint API and DeletionQueue. In preparation for extracting the Endpoint API, this commit extracts these methods from the `Daemon` and introduces a new Hive component `EndpointAPIManager` that provides this functionality. It is provided by the Hive module `endpoint-api`. Signed-off-by: Marco Hofstetter <marco.hofstetter@isovalent.com>
The Endpoint `DeletionQueue` is part of the Endpoint API as it is responsible to handle the Endpoint deletion requests from the CNI that happened during the time the Endpoint API wasn't available. Therefore, this commit moves the `DeletionQueue` from the daemon cells to the Hive module `endpoint-api`, where it is provided as private cell. Note: Instead of depending on the `DaemonPromise`, the deletion queue now depends on the `EndpointRestorationPromise`. Under the hood this is the same, because the later is resolved once the first is resolved. It's simply to workaround cyclic dependencies - and it's more expressive. Note2: The deletion queue no longer has access to the lifecycle context of the daemon (`d.ctx`). Therefore, this commit temporarily uses a `context.TODO()`. This will be fixed in a follow up commit that refactors the deletion queue to use Hive Jobs instead of raw Hive lifecycle hooks. Note3: Looks like PR cilium#32981 introduced a regression in the deletion queue lock/unlocking behaviour (premature unlocking). This will be fixed in a following commit. Signed-off-by: Marco Hofstetter <marco.hofstetter@isovalent.com>
This commit extracts the Endpoint API handlers from the Daemon to the new Hive module `endpoint-api`. Signed-off-by: Marco Hofstetter <marco.hofstetter@isovalent.com>
This commit refactors the Endpoint deletion queue to use Hive Jobs instead of raw Hive lifecycle hooks. In addition, it fixes the temporary context handling by using the `context.Context` from the job function that is cancelled when the agent is terminated. Note: Further improving the error handling / reporting of the deletion queue processing logic results in test errors (warnings detected - probably due to the test setups). It might be better to improve this in follow up PRs. Note2: Looks like PR cilium#32981 introduced a regression in the deletion queue lock/unlocking behaviour (premature unlocking). This will be fixed in a following commit. Signed-off-by: Marco Hofstetter <marco.hofstetter@isovalent.com>
This commit migrates the package `pkg/endpoint/api` from `logrus` to `slog`. The only exceptions is the usage of `Endpoint.GetLogger`. Signed-off-by: Marco Hofstetter <marco.hofstetter@isovalent.com>
Currently, various endpoint related Hive modules are directly registered in the daemon cells. Therefore, this commit introduces a new meta hive group that provides all these modules. Signed-off-by: Marco Hofstetter <marco.hofstetter@isovalent.com>
PR cilium#32981 (commit cilium@9463a868475) changed the deletion queue lifecycle logic to no longer process the queue in the lifecycle start hook. Instead the queue gets processed asynchronously in a separate Go routine. As a consequence, the unlock of the CNI lock file may happen prematurely. Therefore, this commit fixes the deletion queue logic to explicitly wait with a channel before unlocking the lock file. Signed-off-by: Marco Hofstetter <marco.hofstetter@isovalent.com>
ccce06f
to
6de4435
Compare
rebased to |
/test |
// unlockAfterAPIServer registers a start hook that runs after API server | ||
// has started and the deletion queue has been drained to unlock the | ||
// delete queue and thus allow CNI plugin to proceed. | ||
func unlockAfterAPIServer(jobGroup job.Group, _ *server.Server, dq *DeletionQueue) { | ||
jobGroup.Add(job.OneShot("unlock-lockfile", func(ctx context.Context, health cell.Health) error { | ||
// Explicitly wait until deletion queue finished processing or job context is cancelled | ||
select { | ||
case <-ctx.Done(): | ||
// continue and unlock | ||
case <-dq.processed: | ||
// continue and unlock | ||
} | ||
|
||
if dq.lf != nil { | ||
unlockErr := dq.lf.Unlock() | ||
closeErr := dq.lf.Close() | ||
|
||
if unlockErr != nil || closeErr != nil { | ||
return fmt.Errorf("failed to unlock deletion queue lock file: %w", errors.Join(unlockErr, closeErr)) | ||
} | ||
} | ||
|
||
return nil | ||
})) | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not put this logic into dq.Process()
? It's an unusual pattern to have one goroutine grab a lock and then delegate the release of the lock to another goroutine. I think that strictly speaking given the synchronization between the goroutines with the channel it's probably technically correct, but this seems ripe for abuse if the logic is ever extended. As a general rule locks should be held in a single goroutine.
This PR extracts the Endpoint API and other CNI related components from the daemon.
Components:
Please review the individual commits
Note: Commit
endpoint: fix deletion queue lock/unlock handling
fixes a potential regression that has been introduced with #32981. This needs to be manually backported tov1.17
&v1.16
./cc @squeed - regarding CNI integration
/cc @joamaki - who initially modularized the deletionqueue / api aspects (lock, unlock, ...)