-
Notifications
You must be signed in to change notification settings - Fork 3.4k
Structured Health Reporter + EndpointManager Modular Health Checks #27522
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Commit a25c6155eb0e458037fdda5b56feb13b0074d5a1 does not contain "Signed-off-by". Please follow instructions provided in https://docs.cilium.io/en/stable/contributing/development/contributing_guide/#developer-s-certificate-of-origin |
a1eb5e0
to
4f8aa82
Compare
Commits 375cb8deb6addd67141bc098637c3f4a8b7bc188, f87153e626257aa6376f7e4964268a6a480d748d do not contain "Signed-off-by". Please follow instructions provided in https://docs.cilium.io/en/stable/contributing/development/contributing_guide/#developer-s-certificate-of-origin |
4f8aa82
to
840c267
Compare
Commits 375cb8deb6addd67141bc098637c3f4a8b7bc188, 0ec151f04edde3f81ae7b1d82b7d2c4f093185f1 do not contain "Signed-off-by". Please follow instructions provided in https://docs.cilium.io/en/stable/contributing/development/contributing_guide/#developer-s-certificate-of-origin |
840c267
to
9c9ad84
Compare
f187a54
to
c1f1d5f
Compare
35d2a04
to
88437b6
Compare
/test |
1 similar comment
/test |
88437b6
to
65e12fc
Compare
65e12fc
to
c3da307
Compare
/test |
Currently using lifecycle in cell is not possible as that would introduce a cyclic dependency. This moves the type definition of HookContext to a separate package, as well as adds separate Lifecycle/HookInterface definitions that can be imported by pkg/hive/cell. Finally, pkg/hive now injects a wrapped hive.Lifecycle that converts pkg/hive/cell/lifecycle hooks. Thus allowing pkg/cell to depend on lifecycle hooks. This is done to avoid having to refactor the lifecycle types in pkg/hive which would result in having to fix imports across the codebase. Signed-off-by: Tom Hadlaw <tom.hadlaw@isovalent.com>
c3da307
to
d219948
Compare
/test |
Removing @cilium/tophat as that was seemingly added due to accidentally committing binary file. |
Structured health reporter initializes a context.Context object with a module scoped cell.HealthReporter. This can be then used to propegate a structured health reporter tree, where leaves represent active health reporters and intermediary nodes provide a common scope for all sub reporters. In subsequent commits, this will be used to provide modular health status reporting for more complex modules such as endpoint manager where many "sub-components" must also be healthy in order for the module to report as healthy. Currently, this is an opt-in layer built on top of the existing cell.HealthReporter interface, by layering functionality on the existing health reporter provided by the cell.Module. Currently the context is initialized with the structured tree reporter at the module level. To avoid the risk of deadlocks, the entire reporter tree uses one lock which is initialized when calling WithStructuredReporter. This is shared among all subreporters. Example: 1. Init root scope (note: this is provided to modules automatically). root := rootScope(ctx, hr) root 2. Create endpoint-manager subscope, and reporter under that scope (with ok!) endpointManagerScope := GetSubScope(ctx, root, "endpoint-manager") GetHealthReporter(endpointManagerScope, "endpoint-000").OK("it works!") root(OK) └── scope(endpoint-manager, OK) └── reporter(endpoint-000, OK) 3. Create another reporter under that scope with degraded GetHealthReporter(endpointManagerScope, "endpoint-000").Degraded("oh no!") root(Degraded) └── scope(endpoint-manager, Degraded) └── reporter(endpoint-000, OK) └── reporter(endpoint-000, Degraded) 4. Close the endpoint-manager scope endpointManagerScope.Stopped("no more status to report!") root(OK) // status has been reported, but we no longer have any degraded status // default to ok status. Signed-off-by: Tom Hadlaw <tom.hadlaw@isovalent.com>
When passed, controller runs will automatically update the health reporter. Upon failure, a passed reporter will be degraded, otherwise it will be marked as "ok" (i.e. healthy). Signed-off-by: Tom Hadlaw <tom.hadlaw@isovalent.com>
Adds modular health checking to endpointmanager by using structured reporter to health check the following reconciliation tasks: * K8s endpoint sync * Endpoint regeneration * Endpoint policy map sync This means that the endpointmanager will be reported as healthy if all endpoints have their ep datapath regenerated, policy map in sync and cep sync controller has reconciled successfully. This will make it easier to understand the state of the endpointmanager/endpoint components of Cilium - providing a way to determine if there are any health problems with endpoints. Example output: ``` endpoint-manager OK 4s cilium-endpoint-1314 datapath-regenerate OK Endpoint regeneration successful sync-policymap-1314 OK sync-policymap-1314 cilium-endpoint-56 sync-to-k8s-ciliumendpoint OK sync-to-k8s-ciliumendpoint (56) datapath-regenerate OK Endpoint regeneration successful sync-policymap-56 OK sync-policymap-56 ``` Signed-off-by: Tom Hadlaw <tom.hadlaw@isovalent.com>
Health reports will often be multiple lines, this improves formatting in such case. Signed-off-by: Tom Hadlaw <tom.hadlaw@isovalent.com>
d219948
to
1bd6d29
Compare
/test |
2 similar comments
/test |
/test |
Change in cilium#27522 accidentally removed the first line in a code comment [1] explaining the significance of failure to get a mutex lock on an endpoint. This undoes that change and fixes the comment. [1] cilium#27522 (comment) Reported-by: Tobias Klauser <tobias@cilium.io> Signed-off-by: Tom Hadlaw <tom.hadlaw@isovalent.com>
Change in #27522 accidentally removed the first line in a code comment [1] explaining the significance of failure to get a mutex lock on an endpoint. This undoes that change and fixes the comment. [1] #27522 (comment) Reported-by: Tobias Klauser <tobias@cilium.io> Signed-off-by: Tom Hadlaw <tom.hadlaw@isovalent.com>
Change in cilium#27522 accidentally removed the first line in a code comment [1] explaining the significance of failure to get a mutex lock on an endpoint. This undoes that change and fixes the comment. [1] cilium#27522 (comment) Reported-by: Tobias Klauser <tobias@cilium.io> Signed-off-by: Tom Hadlaw <tom.hadlaw@isovalent.com>
Adds "structured health reporter" which builds on top of existing cell.HealthReporter type.
The intention is to provide a mechanism to do health status reporting on more complex modules, namely ones with many sub-components that all need to be health checked as well.
Subsequent commits in this pull request use the structured reporter to create health status reporting for the endpointmanager module to provide better visibility into the runtime status of endpointmanager/endpoints in a Cilium Agent.
The structured health reporter uses a tree structured to aggregate sub-reporters up to a top level report, which is ultimately passed to the "module" scoped health reporter to declare the status of the module component.
This PR adds the structured health reporter, and in subsequent commits implements structured health reporting on the endpointmanager module to provide better insight in the system health of Endpoints/EndpointManager.
Follow up work: