Add internal pod health check to DeepHealthCheck #7709

laniehei · 2025-05-03T05:09:32Z

What changed?

-> Part I (History service):
@dnr and others have recently implemented a health check to speed up the history service rollout by explicitly marking it as healthy when the history shards have stabilized after startup instead of waiting an arbitrary time period.

In this PR, I've reused this check in the history service to ensure that the history service has marked the host as healthy. This avoids a false positive because the host isn't yet receiving error metrics, and so it might not be accurate to say it's serving if history shards haven't ceased shuffling. Added health status starting to differentiate from unhealthy host status to use this upstream and help with error propagation and debugging.

(footnote) Further background on the host serving status for documentation and general interest:
The status is set here after it has both joined ringpop, and the shards have stabiliized. It reuses a gRPC history health check that was previously implemented, but was unused.

-> Part II (Frontend service, aggregator of history.DeepHealthCheck):
In the frontend service, I've added the default percentage of 5% based on the guidance from @00Sim that this would be quite a high error percentage, accounting for rollouts. I've added this percentage to the NOT_SERVING metric to avoid a combined outage where some pods are completely down, while some are crashlooping.

Why?

This PR starts to seek improvement for failure detection in the case of crashloops.

It's a small improvement as part of the effort to detect crashloopbackoff by not marking pods as healthy until they've been marked as such by the service while starting up. If the pods do not ever "start up", then it's a good indication that something is wrong.

How did you test it?

We can test during rollout by ensuring the failure detection is working as expected.

Potential risks

If the existing check is not marking pods as healthy, we could break the autodetection code. This is not directly workflow (customer)-impacting.

Documentation

N/A

Is hotfix candidate?

No

Copilot

Pull Request Overview

This PR refines the history pod health check by ensuring a pod is reported healthy only after the service has explicitly marked it as serving. Key changes include using the gRPC health check in the DeepHealthCheck function, injecting a healthServer into the handler via fx, and extending the HealthState enum with a new HEALTH_STATE_STARTING state.

Reviewed Changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 1 comment.

Show a summary per file

File	Description
service/history/handler.go	Integrates the gRPC health check and updates DeepHealthCheck logic.
service/history/fx.go	Injects the healthServer into the handler.
proto/internal/temporal/server/api/enums/v1/cluster.proto	Adds the new HEALTH_STATE_STARTING enum value.
api/enums/v1/cluster.pb.go	Regenerates the enum definitions to include the new state.
api/enums/v1/cluster.go-helpers.pb.go	Updates the helper mappings for the new enum state.

Comments suppressed due to low confidence (1)

service/history/handler.go:228

Verify that downstream consumers of the DeepHealthCheck response correctly handle the new HEALTH_STATE_STARTING state, ensuring that this change aligns with the overall health check design.

if status.Status != healthpb.HealthCheckResponse_SERVING {

service/history/handler.go

dnr · 2025-05-06T04:40:02Z

common/dynamicconfig/constants.go

+	HistoryHostStartingPercentage = NewGlobalFloatSetting(
+		"frontend.historyHostStartingPercentage",
+		0.05,
+		`HistoryHostStartingPercentage is the percentage of hosts that are not yet ready to serve traffic`,


Maybe say a little more here about what happens when the percentage is above or below this threshold?

I think this is up to the consumer of the float setting, not the float setting itself.

But the setting has a single meaning and a specific effect.. the docstring should describe those. And the name should too, ideally. There shouldn't be multiple consumers with different semantics for the same setting.

Anyway, 0.05 isn't "the percentage of hosts that are not yet ready to serve traffic", the percentage of hosts not ready to serve traffic is usually 0%, sometimes 10%, sometimes 50%, etc. The setting is a threshold for some behavior to change based on that value.

I see your point. I've changed the dynamic config documentation and the naming to better reflect its intent.

dnr · 2025-05-06T04:45:02Z

common/dynamicconfig/constants.go

@@ -761,6 +761,11 @@ This config is EXPERIMENTAL and may be changed or removed in a later release.`,
 		0.5,
 		`HistoryHostErrorPercentage is the percentage of hosts that are unhealthy`,
 	)
+	HistoryHostStartingPercentage = NewGlobalFloatSetting(
+		"frontend.historyHostStartingPercentage",
+		0.05,


nitpicky, but: "percentage" implies a number from 0 to 100. A fraction like 0.05 should be a "fraction".

technically the correct word would be proportion according to the math stack exchange 😅 It looks like we use this notation frequently across the dynamic config. Maybe I'm wrong, but when I see a decimal and the unit is percentage, I assume this notation.

I said it was nitpicky :)

"proportion" is also great

dnr · 2025-05-06T04:47:16Z

service/frontend/health_check.go

+	}
+
+	failedHostCountPercentage := failedHostCount / float64(len(hosts))
+	if failedHostCountPercentage+startingHostCountPercentage > h.hostFailurePercentage() {


I'm trying to think if it makes sense to include "starting" in with "failed" here... I think so but not 100% clear. Did you have a reason or example in mind when deciding that?

My reasoning was that, in my mental model, I'm defining host failure percentage as "this percentage must be online and ready to receive traffic to not be in outage". Up for debate whether this is too strict of course, but it seems like we're missing a corner case if we're in the middle of a rollout (for example) and also have a number of hosts that are offline. Regardless of reason, they're still not ready.

Makes sense to me

service/frontend/health_check.go

yux0 · 2025-05-06T17:15:05Z

proto/internal/temporal/server/api/enums/v1/cluster.proto

@@ -38,4 +38,5 @@ enum HealthState {
    HEALTH_STATE_UNSPECIFIED = 0;
    HEALTH_STATE_SERVING = 1;
    HEALTH_STATE_NOT_SERVING = 2;
+    HEALTH_STATE_STARTING = 3; 


not sure if the starting make sense or just use UNKNOWN?

I could change this to NOT_READY? I suppose this could also be flipped "off" by the service if the host/pod is no longer healthy re: our conversation earlier that we don't just set this on startup

…health check

Copilot

Pull Request Overview

This PR introduces an internal pod health check for the history and frontend services to improve failure detection when pods are crashlooping by leveraging an upstream gRPC health check and adjusting the error thresholds.

Integrates a gRPC-based health check in the history service's DeepHealthCheck.
Adds new dynamic configuration and logic in the frontend service to account for pods that mark themselves as not ready (DECLINED_SERVING).
Updates proto definitions and dynamic config constants to support the new HEALTH_STATE_DECLINED_SERVING.

Reviewed Changes

Copilot reviewed 10 out of 10 changed files in this pull request and generated 1 comment.

Show a summary per file

File	Description
service/history/handler.go	Adds grpc health check and corresponding healthServer field, integrating with DeepHealthCheck.
service/history/fx.go	Binds the new HealthServer to the Handler.
service/frontend/service.go	Introduces dynamic config for host self error proportion and passes it to the frontend service.
service/frontend/health_check.go	Updates health check logic to account for DECLINED_SERVING state and calculates thresholds.
service/frontend/health_check_test.go	Adds tests for the new health states and proportion calculations.
service/frontend/admin_handler.go	Passes the new configuration to the admin handler for health checks.
proto/internal/temporal/server/api/enums/v1/cluster.proto, api/enums/v1/cluster.pb.go, api/enums/v1/cluster.go-helpers.pb.go	Updates proto definitions to include the new HEALTH_STATE_DECLINED_SERVING.
common/dynamicconfig/constants.go	Adds new dynamic config setting with updated description for history host self error proportion.

Copilot · 2025-05-07T21:47:01Z

service/frontend/health_check.go

+	// Make sure that at lease 2 hosts must be not ready to trigger this check.
+	proportionOfDeclinedServiceHosts := getProportionOfDeclinedServiceHosts(hostDeclinedServingCount/float64(len(hosts)), len(hosts))
+
+	hostDeclinedServingProportion := hostDeclinedServingCount / float64(len(hosts))
+	if hostDeclinedServingProportion > proportionOfDeclinedServiceHosts {
+		h.logger.Warn("health check exceeded host declined serving proportion threshold", tag.NewFloat64("host declined serving proportion threshold", proportionOfDeclinedServiceHosts))


The local variable 'hostDeclinedServingProportion' declared here shadows the configured threshold value from the dynamic config, leading to a comparison against a computed measurement rather than a fixed configured threshold. Consider renaming this local variable to 'measuredDeclinedProportion' and using h.hostDeclinedServingProportion() to compute the intended threshold via getProportionOfDeclinedServiceHosts.

Suggested change

// Make sure that at lease 2 hosts must be not ready to trigger this check.

proportionOfDeclinedServiceHosts := getProportionOfDeclinedServiceHosts(hostDeclinedServingCount/float64(len(hosts)), len(hosts))

hostDeclinedServingProportion := hostDeclinedServingCount / float64(len(hosts))

if hostDeclinedServingProportion > proportionOfDeclinedServiceHosts {

h.logger.Warn("health check exceeded host declined serving proportion threshold", tag.NewFloat64("host declined serving proportion threshold", proportionOfDeclinedServiceHosts))

// Make sure that at least 2 hosts must be not ready to trigger this check.

measuredDeclinedProportion := hostDeclinedServingCount / float64(len(hosts))

if measuredDeclinedProportion > h.hostDeclinedServingProportion() {

h.logger.Warn("health check exceeded host declined serving proportion threshold", tag.NewFloat64("host declined serving proportion threshold", h.hostDeclinedServingProportion()))

service/frontend/health_check.go

yux0 · 2025-05-08T01:50:22Z

proto/internal/temporal/server/api/enums/v1/cluster.proto

@@ -38,4 +38,5 @@ enum HealthState {
    HEALTH_STATE_UNSPECIFIED = 0;
    HEALTH_STATE_SERVING = 1;
    HEALTH_STATE_NOT_SERVING = 2;
+    HEALTH_STATE_DECLINED_SERVING = 3;


nit: add some comments on the states?

yux0 · 2025-05-08T02:07:35Z

service/frontend/health_check.go

+	}
+
+	failedHostCountProportion := failedHostCount / float64(len(hosts))
+	if failedHostCountProportion+hostDeclinedServingProportion > h.hostFailurePercentage() {


I am a little worry about false positive:

double count in failure

the minimum proportion

The double count in failure can't currently happen, because the health check only returns 1 value and this gets counted. Am I misunderstanding what you mean? Each host can only have 1 status.

To the second point, I think that a host not serving traffic at all should factor into our hostFailurePercentage to catch the issue from before where we had some errors, but most pods were crashlooping so we didn't catch this case. If some hosts are failing, it means we have to pick up the slack on other hosts, so it's "failing". WDYT?

yux0 · 2025-05-08T02:15:35Z

service/frontend/health_check.go

 		return enumsspb.HEALTH_STATE_NOT_SERVING, nil
 	}

 	return enumsspb.HEALTH_STATE_SERVING, nil
 }
+
+func ensureMinimumProportionOfHosts(proportionOfDeclinedServingHosts float64, totalHosts int) float64 {
+	minimumHostsFailed := 2.0 // We want to ensure that at least 2 fail before we notify the upstream.


* main: (22 commits) Add host health metrics gauge (temporalio#7728) add rule expiration check (temporalio#7749) Add activity options to the pending activity info (temporalio#7727) Enable DLQ V2 for replication (temporalio#7746) chore: be smarter about when to use Stringer vs String (temporalio#7743) versioning entity workflows: enabling auto-restart pt1 (temporalio#7715) Refactor code generators (temporalio#7734) allow passive to generate replication tasks (temporalio#7713) Validate links in completion callbacks (temporalio#7726) CHASM: Engine Update/ReadComponent implementation (temporalio#7696) Enable transition history in dev env and tests (temporalio#7737) chore: Add Stringer tags (temporalio#7738) Add internal pod health check to DeepHealthCheck (temporalio#7709) Rename internal CHASM task processing interface (temporalio#7730) [Frontend] Log slow gRPC requests (temporalio#7718) Remove cap for dynamic config callback pool (temporalio#7723) Refactor updateworkflowoptions package (temporalio#7725) Remove a bunch of redundant utf-8 validation (temporalio#7720) [CHASM] Pure task processing - GetPureTasks, ExecutePureTasks (temporalio#7701) Send ActivityReset flag to the worker in heartbeat response (temporalio#7677) ...

laniehei added 2 commits May 2, 2025 21:45

Add health state starting to indicate service has not marked ready

7999f24

Add health server to history handler, check for health status activating

de8010b

laniehei requested review from dnr, yux0 and Copilot May 3, 2025 05:09

laniehei requested a review from a team as a code owner May 3, 2025 05:09

Copilot AI reviewed May 3, 2025

View reviewed changes

service/history/handler.go Show resolved Hide resolved

Add check to frontend for host count

84f61eb

laniehei requested a review from meiliang86 May 5, 2025 23:17

laniehei added the teams/cgs label May 5, 2025

laniehei added 2 commits May 5, 2025 17:09

Add health state serving, unspecified to satisfy linter

cb26997

Fix test

1a89b51

dnr reviewed May 6, 2025

View reviewed changes

laniehei requested a review from dnr May 6, 2025 16:35

yux0 reviewed May 6, 2025

View reviewed changes

service/frontend/health_check.go Outdated Show resolved Hide resolved

yux0 reviewed May 6, 2025

View reviewed changes

Rename proto to clarify intent

b6edb2a

laniehei force-pushed the lanie/fault-detection branch from a7c6f58 to b6edb2a Compare May 7, 2025 20:35

laniehei added 2 commits May 7, 2025 13:37

Revert

309bb34

Ensure that at least 2 hosts must not be ready to trigger failure of …

74dd836

…health check

laniehei changed the title ~~Report a history pod as healthy only if service has marked it~~ Add internal pod health check to DeepHealthCheck May 7, 2025

laniehei requested review from yux0 and Copilot May 7, 2025 21:46

Copilot AI reviewed May 7, 2025

View reviewed changes

yux0 reviewed May 7, 2025

View reviewed changes

service/frontend/health_check.go Outdated Show resolved Hide resolved

laniehei added 3 commits May 7, 2025 15:29

Simplify health proportion

7e88c14

Fix bug, clarify const

3664bde

Fix test

3c4278f

laniehei mentioned this pull request May 7, 2025

Add host health metrics gauge #7728

Merged

5 tasks

laniehei requested a review from yux0 May 7, 2025 23:54

yux0 reviewed May 8, 2025

View reviewed changes

yux0 approved these changes May 8, 2025

View reviewed changes

laniehei added 2 commits May 8, 2025 12:07

Add const

3af36d7

Merge branch 'main' into lanie/fault-detection

eb05d32

laniehei enabled auto-merge (squash) May 8, 2025 19:08

laniehei merged commit 5211b17 into main May 8, 2025
53 checks passed

laniehei deleted the lanie/fault-detection branch May 8, 2025 19:28

Add internal pod health check to DeepHealthCheck #7709

Add internal pod health check to DeepHealthCheck #7709

Uh oh!

Conversation

laniehei commented May 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changed?

Why?

How did you test it?

Potential risks

Documentation

Is hotfix candidate?

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Copilot AI May 7, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

laniehei May 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

laniehei commented May 3, 2025 •

edited

Loading

laniehei May 8, 2025 •

edited

Loading