Add host health metrics gauge #7728

laniehei · 2025-05-07T23:51:00Z

What changed?

Add a new metrics gauge to track host health

Why?

Visibility into which hosts are failing when

How did you test it?

Potential risks

Adding Sean to the PR to make sure I didn't miss anything obvious in regards to how we want to display this.

…health check

laniehei · 2025-05-08T22:03:23Z

proto/internal/temporal/server/api/enums/v1/cluster.proto

-    HEALTH_STATE_NOT_SERVING = 2;
-    HEALTH_STATE_DECLINED_SERVING = 3;
+    // The host is unhealthy through external observation. 
+    HEALTH_STATE_NOT_SERVING = 2; 


This change didn't get in to the previous PR somehow.

yux0 · 2025-05-08T22:46:40Z

service/history/handler.go

@@ -202,15 +202,18 @@ func (h *Handler) DeepHealthCheck(
 		return nil, err
 	}
 	if status.Status != healthpb.HealthCheckResponse_SERVING {
+		metrics.HistoryHostHealthGauge.With(h.metricsHandler).Record(float64(enumsspb.HEALTH_STATE_DECLINED_SERVING))


is this metrics per host?

yes! Because this is on the handler, metrics get scraped by prometheus on per-cell and per-host information. We'll be able to create a graph with this on per-cluster per-host checks

* main: (22 commits) Add host health metrics gauge (temporalio#7728) add rule expiration check (temporalio#7749) Add activity options to the pending activity info (temporalio#7727) Enable DLQ V2 for replication (temporalio#7746) chore: be smarter about when to use Stringer vs String (temporalio#7743) versioning entity workflows: enabling auto-restart pt1 (temporalio#7715) Refactor code generators (temporalio#7734) allow passive to generate replication tasks (temporalio#7713) Validate links in completion callbacks (temporalio#7726) CHASM: Engine Update/ReadComponent implementation (temporalio#7696) Enable transition history in dev env and tests (temporalio#7737) chore: Add Stringer tags (temporalio#7738) Add internal pod health check to DeepHealthCheck (temporalio#7709) Rename internal CHASM task processing interface (temporalio#7730) [Frontend] Log slow gRPC requests (temporalio#7718) Remove cap for dynamic config callback pool (temporalio#7723) Refactor updateworkflowoptions package (temporalio#7725) Remove a bunch of redundant utf-8 validation (temporalio#7720) [CHASM] Pure task processing - GetPureTasks, ExecutePureTasks (temporalio#7701) Send ActivityReset flag to the worker in heartbeat response (temporalio#7677) ...

laniehei added 12 commits May 2, 2025 21:45

Add health state starting to indicate service has not marked ready

7999f24

Add health server to history handler, check for health status activating

de8010b

Add check to frontend for host count

84f61eb

Add health state serving, unspecified to satisfy linter

cb26997

Fix test

1a89b51

Rename proto to clarify intent

b6edb2a

Revert

309bb34

Ensure that at least 2 hosts must not be ready to trigger failure of …

74dd836

…health check

Simplify health proportion

7e88c14

Fix bug, clarify const

3664bde

Fix test

3c4278f

Add host health metrics

b40be47

laniehei requested review from swgillespie and yux0 May 7, 2025 23:51

laniehei requested a review from a team as a code owner May 7, 2025 23:51

laniehei added the teams/cgs label May 7, 2025

laniehei added 2 commits May 8, 2025 11:48

Add documentation

c0bc24a

Merge fix

28b4ded

laniehei commented May 8, 2025

View reviewed changes

yux0 reviewed May 8, 2025

View reviewed changes

laniehei requested a review from yux0 May 8, 2025 23:44

Make proto

992f008

yux0 approved these changes May 12, 2025

View reviewed changes

laniehei merged commit bf35d1e into main May 12, 2025
53 checks passed

laniehei deleted the lanie/metric-fault-detection branch May 12, 2025 17:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add host health metrics gauge #7728

Add host health metrics gauge #7728

Uh oh!

laniehei commented May 7, 2025 •

edited

Loading

Uh oh!

laniehei May 8, 2025

Uh oh!

yux0 May 8, 2025

Uh oh!

laniehei May 8, 2025

Uh oh!

Uh oh!

Uh oh!

Add host health metrics gauge #7728

Add host health metrics gauge #7728

Uh oh!

Conversation

laniehei commented May 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changed?

Why?

How did you test it?

Potential risks

Uh oh!

laniehei May 8, 2025

Choose a reason for hiding this comment

Uh oh!

yux0 May 8, 2025

Choose a reason for hiding this comment

Uh oh!

laniehei May 8, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

laniehei commented May 7, 2025 •

edited

Loading