Skip to content

Conversation

laniehei
Copy link
Member

@laniehei laniehei commented May 3, 2025

What changed?

-> Part I (History service):
@dnr and others have recently implemented a health check to speed up the history service rollout by explicitly marking it as healthy when the history shards have stabilized after startup instead of waiting an arbitrary time period.

In this PR, I've reused this check in the history service to ensure that the history service has marked the host as healthy. This avoids a false positive because the host isn't yet receiving error metrics, and so it might not be accurate to say it's serving if history shards haven't ceased shuffling. Added health status starting to differentiate from unhealthy host status to use this upstream and help with error propagation and debugging.

(footnote) Further background on the host serving status for documentation and general interest:
The status is set here after it has both joined ringpop, and the shards have stabiliized. It reuses a gRPC history health check that was previously implemented, but was unused.

-> Part II (Frontend service, aggregator of history.DeepHealthCheck):
In the frontend service, I've added the default percentage of 5% based on the guidance from @00Sim that this would be quite a high error percentage, accounting for rollouts. I've added this percentage to the NOT_SERVING metric to avoid a combined outage where some pods are completely down, while some are crashlooping.

Why?

This PR starts to seek improvement for failure detection in the case of crashloops.

It's a small improvement as part of the effort to detect crashloopbackoff by not marking pods as healthy until they've been marked as such by the service while starting up. If the pods do not ever "start up", then it's a good indication that something is wrong.

How did you test it?

We can test during rollout by ensuring the failure detection is working as expected.

Potential risks

If the existing check is not marking pods as healthy, we could break the autodetection code. This is not directly workflow (customer)-impacting.

Documentation

N/A

Is hotfix candidate?

No

@laniehei laniehei requested review from dnr, yux0 and Copilot May 3, 2025 05:09
@laniehei laniehei requested a review from a team as a code owner May 3, 2025 05:09
Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR refines the history pod health check by ensuring a pod is reported healthy only after the service has explicitly marked it as serving. Key changes include using the gRPC health check in the DeepHealthCheck function, injecting a healthServer into the handler via fx, and extending the HealthState enum with a new HEALTH_STATE_STARTING state.

Reviewed Changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
service/history/handler.go Integrates the gRPC health check and updates DeepHealthCheck logic.
service/history/fx.go Injects the healthServer into the handler.
proto/internal/temporal/server/api/enums/v1/cluster.proto Adds the new HEALTH_STATE_STARTING enum value.
api/enums/v1/cluster.pb.go Regenerates the enum definitions to include the new state.
api/enums/v1/cluster.go-helpers.pb.go Updates the helper mappings for the new enum state.
Comments suppressed due to low confidence (1)

service/history/handler.go:228

  • Verify that downstream consumers of the DeepHealthCheck response correctly handle the new HEALTH_STATE_STARTING state, ensuring that this change aligns with the overall health check design.
if status.Status != healthpb.HealthCheckResponse_SERVING {

@laniehei laniehei requested a review from meiliang86 May 5, 2025 23:17
HistoryHostStartingPercentage = NewGlobalFloatSetting(
"frontend.historyHostStartingPercentage",
0.05,
`HistoryHostStartingPercentage is the percentage of hosts that are not yet ready to serve traffic`,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe say a little more here about what happens when the percentage is above or below this threshold?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is up to the consumer of the float setting, not the float setting itself.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But the setting has a single meaning and a specific effect.. the docstring should describe those. And the name should too, ideally. There shouldn't be multiple consumers with different semantics for the same setting.

Anyway, 0.05 isn't "the percentage of hosts that are not yet ready to serve traffic", the percentage of hosts not ready to serve traffic is usually 0%, sometimes 10%, sometimes 50%, etc. The setting is a threshold for some behavior to change based on that value.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see your point. I've changed the dynamic config documentation and the naming to better reflect its intent.

@@ -761,6 +761,11 @@ This config is EXPERIMENTAL and may be changed or removed in a later release.`,
0.5,
`HistoryHostErrorPercentage is the percentage of hosts that are unhealthy`,
)
HistoryHostStartingPercentage = NewGlobalFloatSetting(
"frontend.historyHostStartingPercentage",
0.05,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpicky, but: "percentage" implies a number from 0 to 100. A fraction like 0.05 should be a "fraction".

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

technically the correct word would be proportion according to the math stack exchange 😅 It looks like we use this notation frequently across the dynamic config. Maybe I'm wrong, but when I see a decimal and the unit is percentage, I assume this notation.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I said it was nitpicky :)

"proportion" is also great

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed!

}

failedHostCountPercentage := failedHostCount / float64(len(hosts))
if failedHostCountPercentage+startingHostCountPercentage > h.hostFailurePercentage() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm trying to think if it makes sense to include "starting" in with "failed" here... I think so but not 100% clear. Did you have a reason or example in mind when deciding that?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My reasoning was that, in my mental model, I'm defining host failure percentage as "this percentage must be online and ready to receive traffic to not be in outage". Up for debate whether this is too strict of course, but it seems like we're missing a corner case if we're in the middle of a rollout (for example) and also have a number of hosts that are offline. Regardless of reason, they're still not ready.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense to me

@laniehei laniehei requested a review from dnr May 6, 2025 16:35
@@ -38,4 +38,5 @@ enum HealthState {
HEALTH_STATE_UNSPECIFIED = 0;
HEALTH_STATE_SERVING = 1;
HEALTH_STATE_NOT_SERVING = 2;
HEALTH_STATE_STARTING = 3;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure if the starting make sense or just use UNKNOWN?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could change this to NOT_READY? I suppose this could also be flipped "off" by the service if the host/pod is no longer healthy re: our conversation earlier that we don't just set this on startup

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed!

@laniehei laniehei force-pushed the lanie/fault-detection branch from a7c6f58 to b6edb2a Compare May 7, 2025 20:35
@laniehei laniehei changed the title Report a history pod as healthy only if service has marked it Add internal pod health check to DeepHealthCheck May 7, 2025
@laniehei laniehei requested review from yux0 and Copilot May 7, 2025 21:46
Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR introduces an internal pod health check for the history and frontend services to improve failure detection when pods are crashlooping by leveraging an upstream gRPC health check and adjusting the error thresholds.

  • Integrates a gRPC-based health check in the history service's DeepHealthCheck.
  • Adds new dynamic configuration and logic in the frontend service to account for pods that mark themselves as not ready (DECLINED_SERVING).
  • Updates proto definitions and dynamic config constants to support the new HEALTH_STATE_DECLINED_SERVING.

Reviewed Changes

Copilot reviewed 10 out of 10 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
service/history/handler.go Adds grpc health check and corresponding healthServer field, integrating with DeepHealthCheck.
service/history/fx.go Binds the new HealthServer to the Handler.
service/frontend/service.go Introduces dynamic config for host self error proportion and passes it to the frontend service.
service/frontend/health_check.go Updates health check logic to account for DECLINED_SERVING state and calculates thresholds.
service/frontend/health_check_test.go Adds tests for the new health states and proportion calculations.
service/frontend/admin_handler.go Passes the new configuration to the admin handler for health checks.
proto/internal/temporal/server/api/enums/v1/cluster.proto, api/enums/v1/cluster.pb.go, api/enums/v1/cluster.go-helpers.pb.go Updates proto definitions to include the new HEALTH_STATE_DECLINED_SERVING.
common/dynamicconfig/constants.go Adds new dynamic config setting with updated description for history host self error proportion.

Comment on lines 112 to 117
// Make sure that at lease 2 hosts must be not ready to trigger this check.
proportionOfDeclinedServiceHosts := getProportionOfDeclinedServiceHosts(hostDeclinedServingCount/float64(len(hosts)), len(hosts))

hostDeclinedServingProportion := hostDeclinedServingCount / float64(len(hosts))
if hostDeclinedServingProportion > proportionOfDeclinedServiceHosts {
h.logger.Warn("health check exceeded host declined serving proportion threshold", tag.NewFloat64("host declined serving proportion threshold", proportionOfDeclinedServiceHosts))
Copy link
Preview

Copilot AI May 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The local variable 'hostDeclinedServingProportion' declared here shadows the configured threshold value from the dynamic config, leading to a comparison against a computed measurement rather than a fixed configured threshold. Consider renaming this local variable to 'measuredDeclinedProportion' and using h.hostDeclinedServingProportion() to compute the intended threshold via getProportionOfDeclinedServiceHosts.

Suggested change
// Make sure that at lease 2 hosts must be not ready to trigger this check.
proportionOfDeclinedServiceHosts := getProportionOfDeclinedServiceHosts(hostDeclinedServingCount/float64(len(hosts)), len(hosts))
hostDeclinedServingProportion := hostDeclinedServingCount / float64(len(hosts))
if hostDeclinedServingProportion > proportionOfDeclinedServiceHosts {
h.logger.Warn("health check exceeded host declined serving proportion threshold", tag.NewFloat64("host declined serving proportion threshold", proportionOfDeclinedServiceHosts))
// Make sure that at least 2 hosts must be not ready to trigger this check.
measuredDeclinedProportion := hostDeclinedServingCount / float64(len(hosts))
if measuredDeclinedProportion > h.hostDeclinedServingProportion() {
h.logger.Warn("health check exceeded host declined serving proportion threshold", tag.NewFloat64("host declined serving proportion threshold", h.hostDeclinedServingProportion()))

Copilot uses AI. Check for mistakes.

@laniehei laniehei mentioned this pull request May 7, 2025
5 tasks
@laniehei laniehei requested a review from yux0 May 7, 2025 23:54
@@ -38,4 +38,5 @@ enum HealthState {
HEALTH_STATE_UNSPECIFIED = 0;
HEALTH_STATE_SERVING = 1;
HEALTH_STATE_NOT_SERVING = 2;
HEALTH_STATE_DECLINED_SERVING = 3;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: add some comments on the states?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added!

}

failedHostCountProportion := failedHostCount / float64(len(hosts))
if failedHostCountProportion+hostDeclinedServingProportion > h.hostFailurePercentage() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am a little worry about false positive:

  1. double count in failure
  2. the minimum proportion

Copy link
Member Author

@laniehei laniehei May 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The double count in failure can't currently happen, because the health check only returns 1 value and this gets counted. Am I misunderstanding what you mean? Each host can only have 1 status.

To the second point, I think that a host not serving traffic at all should factor into our hostFailurePercentage to catch the issue from before where we had some errors, but most pods were crashlooping so we didn't catch this case. If some hosts are failing, it means we have to pick up the slack on other hosts, so it's "failing". WDYT?

return enumsspb.HEALTH_STATE_NOT_SERVING, nil
}

return enumsspb.HEALTH_STATE_SERVING, nil
}

func ensureMinimumProportionOfHosts(proportionOfDeclinedServingHosts float64, totalHosts int) float64 {
minimumHostsFailed := 2.0 // We want to ensure that at least 2 fail before we notify the upstream.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: const

@laniehei laniehei enabled auto-merge (squash) May 8, 2025 19:08
@laniehei laniehei merged commit 5211b17 into main May 8, 2025
53 checks passed
@laniehei laniehei deleted the lanie/fault-detection branch May 8, 2025 19:28
josesa added a commit to josesa/temporal that referenced this pull request May 12, 2025
* main: (22 commits)
  Add host health metrics gauge (temporalio#7728)
  add rule expiration check (temporalio#7749)
  Add activity options to the pending activity info (temporalio#7727)
  Enable DLQ V2 for replication (temporalio#7746)
  chore: be smarter about when to use Stringer vs String (temporalio#7743)
  versioning entity workflows: enabling auto-restart pt1 (temporalio#7715)
  Refactor code generators (temporalio#7734)
  allow passive to generate replication tasks (temporalio#7713)
  Validate links in completion callbacks (temporalio#7726)
  CHASM: Engine Update/ReadComponent implementation (temporalio#7696)
  Enable transition history in dev env and tests (temporalio#7737)
  chore: Add Stringer tags (temporalio#7738)
  Add internal pod health check to DeepHealthCheck (temporalio#7709)
  Rename internal CHASM task processing interface (temporalio#7730)
  [Frontend] Log slow gRPC requests (temporalio#7718)
  Remove cap for dynamic config callback pool (temporalio#7723)
  Refactor updateworkflowoptions package (temporalio#7725)
  Remove a bunch of redundant utf-8 validation (temporalio#7720)
  [CHASM] Pure task processing - GetPureTasks, ExecutePureTasks (temporalio#7701)
  Send ActivityReset flag to the worker in heartbeat response (temporalio#7677)
  ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants