Skip to content

Conversation

tommyp1ckles
Copy link
Contributor

@tommyp1ckles tommyp1ckles commented Aug 15, 2023

Adds "structured health reporter" which builds on top of existing cell.HealthReporter type.

The intention is to provide a mechanism to do health status reporting on more complex modules, namely ones with many sub-components that all need to be health checked as well.

Subsequent commits in this pull request use the structured reporter to create health status reporting for the endpointmanager module to provide better visibility into the runtime status of endpointmanager/endpoints in a Cilium Agent.

The structured health reporter uses a tree structured to aggregate sub-reporters up to a top level report, which is ultimately passed to the "module" scoped health reporter to declare the status of the module component.

This PR adds the structured health reporter, and in subsequent commits implements structured health reporting on the endpointmanager module to provide better insight in the system health of Endpoints/EndpointManager.

> cilium status --verbose

  ...
Modules Health:
  Module             Status    Message                  Last Updated
  daemon             Unknown   No status reported yet            n/a
  endpoint-manager   OK                  9s
  
   cilium-endpoint-1058
    cep-k8s-sync        OK sync-to-k8s-ciliumendpoint (1058) 3m10.022180274s ago (x20)
    datapath-regenerate OK Endpoint regeneration successful  3m10.016021308s ago (x2)
    policy map sync     OK sync-policymap-1058               3m6.639716331s ago  (x5)
   cilium-endpoint-1488
    cep-k8s-sync        OK sync-to-k8s-ciliumendpoint (1488) 3m10.022419627s ago (x20)
    datapath-regenerate OK Endpoint regeneration successful  3m10.014718641s ago (x2)
    policy map sync     OK sync-policymap-1488               3m6.633931551s ago  (x5)
   cilium-endpoint-1704
    cep-k8s-sync        OK sync-to-k8s-ciliumendpoint (1704) 3m10.022513188s ago (x20)
    datapath-regenerate OK Endpoint regeneration successful  3m10.014787702s ago (x2)
    policy map sync     OK sync-policymap-1704               3m6.632632525s ago  (x5)
   cilium-endpoint-2366
    cep-k8s-sync        OK sync-to-k8s-ciliumendpoint (2366) 3m10.023392989s ago (x20)
    datapath-regenerate OK Endpoint regeneration successful  3m10.021365103s ago (x1)
...

Follow up work:

  • Add metrics

@maintainer-s-little-helper
Copy link

Commit a25c6155eb0e458037fdda5b56feb13b0074d5a1 does not contain "Signed-off-by".

Please follow instructions provided in https://docs.cilium.io/en/stable/contributing/development/contributing_guide/#developer-s-certificate-of-origin

@maintainer-s-little-helper maintainer-s-little-helper bot added dont-merge/needs-sign-off The author needs to add signoff to their commits before merge. dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. labels Aug 15, 2023
@tommyp1ckles tommyp1ckles changed the title Structured v3 Structured Health Reporter + EndpointManager Health Aug 15, 2023
@tommyp1ckles tommyp1ckles changed the title Structured Health Reporter + EndpointManager Health [Draf]t Structured Health Reporter + EndpointManager Health Aug 16, 2023
@tommyp1ckles tommyp1ckles changed the title [Draf]t Structured Health Reporter + EndpointManager Health [Draft] Structured Health Reporter + EndpointManager Health Aug 17, 2023
@maintainer-s-little-helper
Copy link

Commits 375cb8deb6addd67141bc098637c3f4a8b7bc188, f87153e626257aa6376f7e4964268a6a480d748d do not contain "Signed-off-by".

Please follow instructions provided in https://docs.cilium.io/en/stable/contributing/development/contributing_guide/#developer-s-certificate-of-origin

@maintainer-s-little-helper
Copy link

Commits 375cb8deb6addd67141bc098637c3f4a8b7bc188, 0ec151f04edde3f81ae7b1d82b7d2c4f093185f1 do not contain "Signed-off-by".

Please follow instructions provided in https://docs.cilium.io/en/stable/contributing/development/contributing_guide/#developer-s-certificate-of-origin

@maintainer-s-little-helper maintainer-s-little-helper bot removed the dont-merge/needs-sign-off The author needs to add signoff to their commits before merge. label Aug 22, 2023
@tommyp1ckles tommyp1ckles changed the title [Draft] Structured Health Reporter + EndpointManager Health [Draft] Structured Health Reporter + EndpointManager Modular Health Checks Aug 22, 2023
@tommyp1ckles tommyp1ckles force-pushed the structured-v3 branch 8 times, most recently from f187a54 to c1f1d5f Compare August 23, 2023 16:42
@tommyp1ckles tommyp1ckles marked this pull request as ready for review August 23, 2023 16:45
@tommyp1ckles tommyp1ckles requested review from a team as code owners August 23, 2023 16:45
@tommyp1ckles
Copy link
Contributor Author

/test

1 similar comment
@tommyp1ckles
Copy link
Contributor Author

/test

@tommyp1ckles
Copy link
Contributor Author

/test

Currently using lifecycle in cell is not possible as that would introduce a cyclic dependency.
This moves the type definition of HookContext to a separate package, as
well as adds separate Lifecycle/HookInterface definitions that can be
imported by pkg/hive/cell.

Finally, pkg/hive now injects a wrapped hive.Lifecycle that converts pkg/hive/cell/lifecycle hooks.
Thus allowing pkg/cell to depend on lifecycle hooks.

This is done to avoid having to refactor the lifecycle types in pkg/hive
which would result in having to fix imports across the codebase.

Signed-off-by: Tom Hadlaw <tom.hadlaw@isovalent.com>
@tommyp1ckles
Copy link
Contributor Author

/test

@tommyp1ckles tommyp1ckles removed the request for review from a team October 17, 2023 22:50
@tommyp1ckles
Copy link
Contributor Author

Removing @cilium/tophat as that was seemingly added due to accidentally committing binary file.

Structured health reporter initializes a context.Context object with a
module scoped cell.HealthReporter.

This can be then used to propegate a structured health reporter tree,
where leaves represent active health reporters and intermediary nodes
provide a common scope for all sub reporters.

In subsequent commits, this will be used to provide modular health
status reporting for more complex modules such as endpoint manager where
many "sub-components" must also be healthy in order for the module to
report as healthy.

Currently, this is an opt-in layer built on top of the existing
cell.HealthReporter interface, by layering functionality on the existing
health reporter provided by the cell.Module.

Currently the context is initialized with the structured tree reporter at the module level.
To avoid the risk of deadlocks, the entire reporter tree uses one lock
which is initialized when calling WithStructuredReporter. This is shared
among all subreporters.

 Example:

 1. Init root scope (note: this is provided to modules automatically).
 root := rootScope(ctx, hr)

	root

 2. Create endpoint-manager subscope, and reporter under that scope (with ok!)
 endpointManagerScope := GetSubScope(ctx, root, "endpoint-manager")
 GetHealthReporter(endpointManagerScope, "endpoint-000").OK("it works!")

	 root(OK)
		└── scope(endpoint-manager, OK)
			└── reporter(endpoint-000, OK)

 3. Create another reporter under that scope with degraded
 GetHealthReporter(endpointManagerScope, "endpoint-000").Degraded("oh no!")

	 root(Degraded)
		└── scope(endpoint-manager, Degraded)
			└── reporter(endpoint-000, OK)
			└── reporter(endpoint-000, Degraded)

 4. Close the endpoint-manager scope
 endpointManagerScope.Stopped("no more status to report!")

	root(OK) // status has been reported, but we no longer have any degraded status
		 // default to ok status.

Signed-off-by: Tom Hadlaw <tom.hadlaw@isovalent.com>
When passed, controller runs will automatically update the health
reporter.

Upon failure, a passed reporter will be degraded, otherwise it will be
marked as "ok" (i.e. healthy).

Signed-off-by: Tom Hadlaw <tom.hadlaw@isovalent.com>
Adds modular health checking to endpointmanager by using structured
reporter to health check the following reconciliation tasks:

* K8s endpoint sync
* Endpoint regeneration
* Endpoint policy map sync

This means that the endpointmanager will be reported as healthy if all
endpoints have their ep datapath regenerated, policy map in sync and cep
sync controller has reconciled successfully.

This will make it easier to understand the state of the
endpointmanager/endpoint components of Cilium - providing a way to
determine if there are any health problems with endpoints.

Example output:
```
  endpoint-manager   OK                  4s

  cilium-endpoint-1314
   datapath-regenerate OK Endpoint regeneration successful
   sync-policymap-1314 OK sync-policymap-1314
  cilium-endpoint-56
   sync-to-k8s-ciliumendpoint OK sync-to-k8s-ciliumendpoint (56)
   datapath-regenerate        OK Endpoint regeneration successful
   sync-policymap-56          OK sync-policymap-56
```

Signed-off-by: Tom Hadlaw <tom.hadlaw@isovalent.com>
Health reports will often be multiple lines, this improves formatting in
such case.

Signed-off-by: Tom Hadlaw <tom.hadlaw@isovalent.com>
@tommyp1ckles
Copy link
Contributor Author

/test

2 similar comments
@tommyp1ckles
Copy link
Contributor Author

/test

@tommyp1ckles
Copy link
Contributor Author

/test

@tommyp1ckles tommyp1ckles added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Oct 18, 2023
@joamaki joamaki merged commit b7d4aaf into cilium:main Oct 18, 2023
@maintainer-s-little-helper maintainer-s-little-helper bot removed the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Oct 18, 2023
tommyp1ckles added a commit to tommyp1ckles/cilium that referenced this pull request Nov 14, 2023
Change in cilium#27522 accidentally
removed the first line in a code comment [1] explaining the significance of
failure to get a mutex lock on an endpoint.

This undoes that change and fixes the comment.

[1] cilium#27522 (comment)

Reported-by: Tobias Klauser <tobias@cilium.io>
Signed-off-by: Tom Hadlaw <tom.hadlaw@isovalent.com>
tklauser pushed a commit that referenced this pull request Nov 16, 2023
Change in #27522 accidentally
removed the first line in a code comment [1] explaining the significance of
failure to get a mutex lock on an endpoint.

This undoes that change and fixes the comment.

[1] #27522 (comment)

Reported-by: Tobias Klauser <tobias@cilium.io>
Signed-off-by: Tom Hadlaw <tom.hadlaw@isovalent.com>
pjablonski123 pushed a commit to pjablonski123/cilium that referenced this pull request Dec 15, 2023
Change in cilium#27522 accidentally
removed the first line in a code comment [1] explaining the significance of
failure to get a mutex lock on an endpoint.

This undoes that change and fixes the comment.

[1] cilium#27522 (comment)

Reported-by: Tobias Klauser <tobias@cilium.io>
Signed-off-by: Tom Hadlaw <tom.hadlaw@isovalent.com>
@christarazi christarazi added the area/daemon Impacts operation of the Cilium daemon. label Mar 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/daemon Impacts operation of the Cilium daemon. release-note/minor This PR changes functionality that users may find relevant to operating Cilium.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants