-
Notifications
You must be signed in to change notification settings - Fork 3.4k
Wait for CEC and CCEC resources before restoring endpoints. #32981
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Wait for CEC and CCEC resources before restoring endpoints. #32981
Conversation
Commit c0add7e does not match "(?m)^Signed-off-by:". Please follow instructions provided in https://docs.cilium.io/en/stable/contributing/development/contributing_guide/#developer-s-certificate-of-origin |
94a0d7e
to
6bb1eed
Compare
Added missing sign-off to the revert commit. |
/test |
6bb1eed
to
ce865ca
Compare
Changed to not use context from lifecycle OnStart hook in deamon promise await, as it will be canceled as soon as all the start hooks have executed, which may be before daemon promise becomes resolved. |
/test |
PR cilium#32981 (commit cilium@9463a868475) changed the deletion queue lifecycle logic to no longer process the queue in the lifecycle start hook. Instead the queue gets processed asynchronously in a separate Go routine. As a consequence, the unlock of the CNI lock file may happen prematurely. Therefore, this commit fixes the deletion queue logic to explicitly wait with a channel before unlocking the lock file. Signed-off-by: Marco Hofstetter <marco.hofstetter@isovalent.com>
This commit refactors the Endpoint deletion queue to use Hive Jobs instead of raw Hive lifecycle hooks. In addition, it fixes the temporary context handling by using the `context.Context` from the job function that is cancelled when the agent is terminated. Note: Looks like PR cilium#32981 introduced a regression in the deletion queue lock/unlocking behaviour (premature unlocking). This will be fixed in a following commit. Signed-off-by: Marco Hofstetter <marco.hofstetter@isovalent.com>
PR cilium#32981 (commit cilium@9463a868475) changed the deletion queue lifecycle logic to no longer process the queue in the lifecycle start hook. Instead the queue gets processed asynchronously in a separate Go routine. As a consequence, the unlock of the CNI lock file may happen prematurely. Therefore, this commit fixes the deletion queue logic to explicitly wait with a channel before unlocking the lock file. Signed-off-by: Marco Hofstetter <marco.hofstetter@isovalent.com>
This commit refactors the Endpoint deletion queue to use Hive Jobs instead of raw Hive lifecycle hooks. In addition, it fixes the temporary context handling by using the `context.Context` from the job function that is cancelled when the agent is terminated. Note: Further improving the error handling / reporting of the deletion queue processing logic results in test errors. It might be better to improve this in follow up PRs. Note2: Looks like PR cilium#32981 introduced a regression in the deletion queue lock/unlocking behaviour (premature unlocking). This will be fixed in a following commit. Signed-off-by: Marco Hofstetter <marco.hofstetter@isovalent.com>
PR cilium#32981 (commit cilium@9463a868475) changed the deletion queue lifecycle logic to no longer process the queue in the lifecycle start hook. Instead the queue gets processed asynchronously in a separate Go routine. As a consequence, the unlock of the CNI lock file may happen prematurely. Therefore, this commit fixes the deletion queue logic to explicitly wait with a channel before unlocking the lock file. Signed-off-by: Marco Hofstetter <marco.hofstetter@isovalent.com>
The Endpoint `DeletionQueue` is part of the Endpoint API as it is responsible to handle the Endpoint deletion requests from the CNI that happened during the time the Endpoint API wasn't available. Therefore, this commit moves the `DeletionQueue` from the daemon cells to the Hive module `endpoint-api`, where it is provided as private cell. Note: Instead of depending on the `DaemonPromise`, the deletion queue now depends on the `EndpointRestorationPromise`. Under the hood this is the same, because the later is resolved once the first is resolved. It's simply to workaround cyclic dependencies - and it's more expressive. Note2: The deletion queue no longer has access to the lifecycle context of the daemon (`d.ctx`). Therefore, this commit temporarily uses a `context.TODO()`. This will be fixed in a follow up commit that refactors the deletion queue to use Hive Jobs instead of raw Hive lifecycle hooks. Note3: Looks like PR cilium#32981 introduced a regression in the deletion queue lock/unlocking behaviour (premature unlocking). This will be fixed in a following commit. Signed-off-by: Marco Hofstetter <marco.hofstetter@isovalent.com>
This commit refactors the Endpoint deletion queue to use Hive Jobs instead of raw Hive lifecycle hooks. In addition, it fixes the temporary context handling by using the `context.Context` from the job function that is cancelled when the agent is terminated. Note: Further improving the error handling / reporting of the deletion queue processing logic results in test errors. It might be better to improve this in follow up PRs. Note2: Looks like PR cilium#32981 introduced a regression in the deletion queue lock/unlocking behaviour (premature unlocking). This will be fixed in a following commit. Signed-off-by: Marco Hofstetter <marco.hofstetter@isovalent.com>
PR cilium#32981 (commit cilium@9463a868475) changed the deletion queue lifecycle logic to no longer process the queue in the lifecycle start hook. Instead the queue gets processed asynchronously in a separate Go routine. As a consequence, the unlock of the CNI lock file may happen prematurely. Therefore, this commit fixes the deletion queue logic to explicitly wait with a channel before unlocking the lock file. Signed-off-by: Marco Hofstetter <marco.hofstetter@isovalent.com>
PR cilium#32981 (commit cilium@9463a868475) changed the deletion queue lifecycle logic to no longer process the queue in the lifecycle start hook. Instead the queue gets processed asynchronously in a separate Go routine. As a consequence, the unlock of the CNI lock file may happen prematurely. Therefore, this commit fixes the deletion queue logic to explicitly wait with a channel before unlocking the lock file. Signed-off-by: Marco Hofstetter <marco.hofstetter@isovalent.com>
This commit refactors the Endpoint deletion queue to use Hive Jobs instead of raw Hive lifecycle hooks. In addition, it fixes the temporary context handling by using the `context.Context` from the job function that is cancelled when the agent is terminated. Note: Further improving the error handling / reporting of the deletion queue processing logic results in test errors. It might be better to improve this in follow up PRs. Note2: Looks like PR cilium#32981 introduced a regression in the deletion queue lock/unlocking behaviour (premature unlocking). This will be fixed in a following commit. Signed-off-by: Marco Hofstetter <marco.hofstetter@isovalent.com>
PR cilium#32981 (commit cilium@9463a868475) changed the deletion queue lifecycle logic to no longer process the queue in the lifecycle start hook. Instead the queue gets processed asynchronously in a separate Go routine. As a consequence, the unlock of the CNI lock file may happen prematurely. Therefore, this commit fixes the deletion queue logic to explicitly wait with a channel before unlocking the lock file. Signed-off-by: Marco Hofstetter <marco.hofstetter@isovalent.com>
The Endpoint `DeletionQueue` is part of the Endpoint API as it is responsible to handle the Endpoint deletion requests from the CNI that happened during the time the Endpoint API wasn't available. Therefore, this commit moves the `DeletionQueue` from the daemon cells to the Hive module `endpoint-api`, where it is provided as private cell. Note: Instead of depending on the `DaemonPromise`, the deletion queue now depends on the `EndpointRestorationPromise`. Under the hood this is the same, because the later is resolved once the first is resolved. It's simply to workaround cyclic dependencies - and it's more expressive. Note2: The deletion queue no longer has access to the lifecycle context of the daemon (`d.ctx`). Therefore, this commit temporarily uses a `context.TODO()`. This will be fixed in a follow up commit that refactors the deletion queue to use Hive Jobs instead of raw Hive lifecycle hooks. Note3: Looks like PR cilium#32981 introduced a regression in the deletion queue lock/unlocking behaviour (premature unlocking). This will be fixed in a following commit. Signed-off-by: Marco Hofstetter <marco.hofstetter@isovalent.com>
This commit refactors the Endpoint deletion queue to use Hive Jobs instead of raw Hive lifecycle hooks. In addition, it fixes the temporary context handling by using the `context.Context` from the job function that is cancelled when the agent is terminated. Note: Further improving the error handling / reporting of the deletion queue processing logic results in test errors (warnings detected - probably due to the test setups). It might be better to improve this in follow up PRs. Note2: Looks like PR cilium#32981 introduced a regression in the deletion queue lock/unlocking behaviour (premature unlocking). This will be fixed in a following commit. Signed-off-by: Marco Hofstetter <marco.hofstetter@isovalent.com>
PR cilium#32981 (commit cilium@9463a868475) changed the deletion queue lifecycle logic to no longer process the queue in the lifecycle start hook. Instead the queue gets processed asynchronously in a separate Go routine. As a consequence, the unlock of the CNI lock file may happen prematurely. Therefore, this commit fixes the deletion queue logic to explicitly wait with a channel before unlocking the lock file. Signed-off-by: Marco Hofstetter <marco.hofstetter@isovalent.com>
PR cilium#32981 (commit cilium@9463a868475) changed the deletion queue lifecycle logic to no longer process the queue in the lifecycle start hook. Instead the queue gets processed asynchronously in a separate Go routine. As a consequence, the unlock of the CNI lock file may happen prematurely. Therefore, this commit fixes the deletion queue logic to explicitly wait with a channel before unlocking the lock file. Signed-off-by: Marco Hofstetter <marco.hofstetter@isovalent.com>
The Endpoint `DeletionQueue` is part of the Endpoint API as it is responsible to handle the Endpoint deletion requests from the CNI that happened during the time the Endpoint API wasn't available. Therefore, this commit moves the `DeletionQueue` from the daemon cells to the Hive module `endpoint-api`, where it is provided as private cell. Note: Instead of depending on the `DaemonPromise`, the deletion queue now depends on the `EndpointRestorationPromise`. Under the hood this is the same, because the later is resolved once the first is resolved. It's simply to workaround cyclic dependencies - and it's more expressive. Note2: The deletion queue no longer has access to the lifecycle context of the daemon (`d.ctx`). Therefore, this commit temporarily uses a `context.TODO()`. This will be fixed in a follow up commit that refactors the deletion queue to use Hive Jobs instead of raw Hive lifecycle hooks. Note3: Looks like PR cilium#32981 introduced a regression in the deletion queue lock/unlocking behaviour (premature unlocking). This will be fixed in a following commit. Signed-off-by: Marco Hofstetter <marco.hofstetter@isovalent.com>
This commit refactors the Endpoint deletion queue to use Hive Jobs instead of raw Hive lifecycle hooks. In addition, it fixes the temporary context handling by using the `context.Context` from the job function that is cancelled when the agent is terminated. Note: Further improving the error handling / reporting of the deletion queue processing logic results in test errors (warnings detected - probably due to the test setups). It might be better to improve this in follow up PRs. Note2: Looks like PR cilium#32981 introduced a regression in the deletion queue lock/unlocking behaviour (premature unlocking). This will be fixed in a following commit. Signed-off-by: Marco Hofstetter <marco.hofstetter@isovalent.com>
PR cilium#32981 (commit cilium@9463a868475) changed the deletion queue lifecycle logic to no longer process the queue in the lifecycle start hook. Instead the queue gets processed asynchronously in a separate Go routine. As a consequence, the unlock of the CNI lock file may happen prematurely. Therefore, this commit fixes the deletion queue logic to explicitly wait with a channel before unlocking the lock file. Signed-off-by: Marco Hofstetter <marco.hofstetter@isovalent.com>
The Endpoint `DeletionQueue` is part of the Endpoint API as it is responsible to handle the Endpoint deletion requests from the CNI that happened during the time the Endpoint API wasn't available. Therefore, this commit moves the `DeletionQueue` from the daemon cells to the Hive module `endpoint-api`, where it is provided as private cell. Note: Instead of depending on the `DaemonPromise`, the deletion queue now depends on the `EndpointRestorationPromise`. Under the hood this is the same, because the later is resolved once the first is resolved. It's simply to workaround cyclic dependencies - and it's more expressive. Note2: The deletion queue no longer has access to the lifecycle context of the daemon (`d.ctx`). Therefore, this commit temporarily uses a `context.TODO()`. This will be fixed in a follow up commit that refactors the deletion queue to use Hive Jobs instead of raw Hive lifecycle hooks. Note3: Looks like PR cilium#32981 introduced a regression in the deletion queue lock/unlocking behaviour (premature unlocking). This will be fixed in a following commit. Signed-off-by: Marco Hofstetter <marco.hofstetter@isovalent.com>
This commit refactors the Endpoint deletion queue to use Hive Jobs instead of raw Hive lifecycle hooks. In addition, it fixes the temporary context handling by using the `context.Context` from the job function that is cancelled when the agent is terminated. Note: Further improving the error handling / reporting of the deletion queue processing logic results in test errors (warnings detected - probably due to the test setups). It might be better to improve this in follow up PRs. Note2: Looks like PR cilium#32981 introduced a regression in the deletion queue lock/unlocking behaviour (premature unlocking). This will be fixed in a following commit. Signed-off-by: Marco Hofstetter <marco.hofstetter@isovalent.com>
PR cilium#32981 (commit cilium@9463a868475) changed the deletion queue lifecycle logic to no longer process the queue in the lifecycle start hook. Instead the queue gets processed asynchronously in a separate Go routine. As a consequence, the unlock of the CNI lock file may happen prematurely. Therefore, this commit fixes the deletion queue logic to explicitly wait with a channel before unlocking the lock file. Signed-off-by: Marco Hofstetter <marco.hofstetter@isovalent.com>
The Endpoint `DeletionQueue` is part of the Endpoint API as it is responsible to handle the Endpoint deletion requests from the CNI that happened during the time the Endpoint API wasn't available. Therefore, this commit moves the `DeletionQueue` from the daemon cells to the Hive module `endpoint-api`, where it is provided as private cell. Note: Instead of depending on the `DaemonPromise`, the deletion queue now depends on the `EndpointRestorationPromise`. Under the hood this is the same, because the later is resolved once the first is resolved. It's simply to workaround cyclic dependencies - and it's more expressive. Note2: The deletion queue no longer has access to the lifecycle context of the daemon (`d.ctx`). Therefore, this commit temporarily uses a `context.TODO()`. This will be fixed in a follow up commit that refactors the deletion queue to use Hive Jobs instead of raw Hive lifecycle hooks. Note3: Looks like PR cilium#32981 introduced a regression in the deletion queue lock/unlocking behaviour (premature unlocking). This will be fixed in a following commit. Signed-off-by: Marco Hofstetter <marco.hofstetter@isovalent.com>
This commit refactors the Endpoint deletion queue to use Hive Jobs instead of raw Hive lifecycle hooks. In addition, it fixes the temporary context handling by using the `context.Context` from the job function that is cancelled when the agent is terminated. Note: Further improving the error handling / reporting of the deletion queue processing logic results in test errors (warnings detected - probably due to the test setups). It might be better to improve this in follow up PRs. Note2: Looks like PR cilium#32981 introduced a regression in the deletion queue lock/unlocking behaviour (premature unlocking). This will be fixed in a following commit. Signed-off-by: Marco Hofstetter <marco.hofstetter@isovalent.com>
PR cilium#32981 (commit cilium@9463a868475) changed the deletion queue lifecycle logic to no longer process the queue in the lifecycle start hook. Instead the queue gets processed asynchronously in a separate Go routine. As a consequence, the unlock of the CNI lock file may happen prematurely. Therefore, this commit fixes the deletion queue logic to explicitly wait with a channel before unlocking the lock file. Signed-off-by: Marco Hofstetter <marco.hofstetter@isovalent.com>
The Endpoint `DeletionQueue` is part of the Endpoint API as it is responsible to handle the Endpoint deletion requests from the CNI that happened during the time the Endpoint API wasn't available. Therefore, this commit moves the `DeletionQueue` from the daemon cells to the Hive module `endpoint-api`, where it is provided as private cell. Note: Instead of depending on the `DaemonPromise`, the deletion queue now depends on the `EndpointRestorationPromise`. Under the hood this is the same, because the later is resolved once the first is resolved. It's simply to workaround cyclic dependencies - and it's more expressive. Note2: The deletion queue no longer has access to the lifecycle context of the daemon (`d.ctx`). Therefore, this commit temporarily uses a `context.TODO()`. This will be fixed in a follow up commit that refactors the deletion queue to use Hive Jobs instead of raw Hive lifecycle hooks. Note3: Looks like PR #32981 introduced a regression in the deletion queue lock/unlocking behaviour (premature unlocking). This will be fixed in a following commit. Signed-off-by: Marco Hofstetter <marco.hofstetter@isovalent.com>
This commit refactors the Endpoint deletion queue to use Hive Jobs instead of raw Hive lifecycle hooks. In addition, it fixes the temporary context handling by using the `context.Context` from the job function that is cancelled when the agent is terminated. Note: Further improving the error handling / reporting of the deletion queue processing logic results in test errors (warnings detected - probably due to the test setups). It might be better to improve this in follow up PRs. Note2: Looks like PR #32981 introduced a regression in the deletion queue lock/unlocking behaviour (premature unlocking). This will be fixed in a following commit. Signed-off-by: Marco Hofstetter <marco.hofstetter@isovalent.com>
PR cilium#32981 (commit cilium@9463a868475) changed the deletion queue lifecycle logic to no longer process the queue in the lifecycle start hook. Instead the queue gets processed asynchronously in a separate Go routine. As a consequence, the unlock of the CNI lock file may happen prematurely. Therefore, this commit fixes the deletion queue logic to explicitly wait with a channel before unlocking the lock file. Signed-off-by: Marco Hofstetter <marco.hofstetter@isovalent.com>
The Endpoint `DeletionQueue` is part of the Endpoint API as it is responsible to handle the Endpoint deletion requests from the CNI that happened during the time the Endpoint API wasn't available. Therefore, this commit moves the `DeletionQueue` from the daemon cells to the Hive module `endpoint-api`, where it is provided as private cell. Note: Instead of depending on the `DaemonPromise`, the deletion queue now depends on the `EndpointRestorationPromise`. Under the hood this is the same, because the later is resolved once the first is resolved. It's simply to workaround cyclic dependencies - and it's more expressive. Note2: The deletion queue no longer has access to the lifecycle context of the daemon (`d.ctx`). Therefore, this commit temporarily uses a `context.TODO()`. This will be fixed in a follow up commit that refactors the deletion queue to use Hive Jobs instead of raw Hive lifecycle hooks. Note3: Looks like PR cilium#32981 introduced a regression in the deletion queue lock/unlocking behaviour (premature unlocking). This will be fixed in a following commit. Signed-off-by: Marco Hofstetter <marco.hofstetter@isovalent.com>
This commit refactors the Endpoint deletion queue to use Hive Jobs instead of raw Hive lifecycle hooks. In addition, it fixes the temporary context handling by using the `context.Context` from the job function that is cancelled when the agent is terminated. Note: Further improving the error handling / reporting of the deletion queue processing logic results in test errors (warnings detected - probably due to the test setups). It might be better to improve this in follow up PRs. Note2: Looks like PR cilium#32981 introduced a regression in the deletion queue lock/unlocking behaviour (premature unlocking). This will be fixed in a following commit. Signed-off-by: Marco Hofstetter <marco.hofstetter@isovalent.com>
PR cilium#32981 (commit cilium@9463a868475) changed the deletion queue lifecycle logic to no longer process the queue in the lifecycle start hook. Instead the queue gets processed asynchronously in a separate Go routine. As a consequence, the unlock of the CNI lock file may happen prematurely. Therefore, this commit fixes the deletion queue logic to explicitly wait with a channel before unlocking the lock file. Signed-off-by: Marco Hofstetter <marco.hofstetter@isovalent.com>
Just like we are currently waiting for policy CRDs before generating endpoints for the first time after restoring them during an agent restart, it is also beneficial to wait for CEC and CCEC resources as well. This is mainly due to possible policy redirections to CEC/CCEC listeners, but also for the possibility of L7 LB service annotations, as well as east/west Ingress and Gateway API, which are implemented via CEC/CCEC resources.
In order to achieve this we first break up potentially blocking parts of daemon lifecycle start hook to run in a goroutine, so that the hive may run concurrently with the daemon bootstrap. This is necessary as start hooks are executed in the order in which they are appended to the hive lifecycle, and the next start hook can only run when the previous one has returned.
startDeamon() blocks waiting for a subset of k8s resources to have been synchronized, and as we add CEC and CCEC resources to that subset, we want to be able to use the hive lifecycle to start running the resource watchers for these resources. This becomes possible by making sure daemon start hooks do not block for this purpose. Similarly, any start hook that needs to await for the daemon promise needs to perform that Await call in a goroutine. This allows the hive to run all the needed start hooks while startDaemon() is waiting for the resources to have been synchronized.
This change also allows reverting the change in k8s policy watchers introduced in #32028, as now the hive is not stalled by the daemon waiting for the k8s resources to synchronize.
During testing there were spurious error and warning logs from the vendored k8s code when Cilium resources were requested before the corresponding CRDs had been registered. This is resolved by providing a new CRDSync promise to the hive, and then awaiting for that promise before requesting any Cilium resources. This eliminates the error and warning logs seen for CiliumNode, CiliumEnvoyConfig, and CIliumClusterwideEnvoyConfig resources. The CRDSync promise is optional so that it can be ignored for other than the agent hive.