-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Add internal pod health check to DeepHealthCheck #7709
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR refines the history pod health check by ensuring a pod is reported healthy only after the service has explicitly marked it as serving. Key changes include using the gRPC health check in the DeepHealthCheck function, injecting a healthServer into the handler via fx, and extending the HealthState enum with a new HEALTH_STATE_STARTING state.
Reviewed Changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 1 comment.
Show a summary per file
File | Description |
---|---|
service/history/handler.go | Integrates the gRPC health check and updates DeepHealthCheck logic. |
service/history/fx.go | Injects the healthServer into the handler. |
proto/internal/temporal/server/api/enums/v1/cluster.proto | Adds the new HEALTH_STATE_STARTING enum value. |
api/enums/v1/cluster.pb.go | Regenerates the enum definitions to include the new state. |
api/enums/v1/cluster.go-helpers.pb.go | Updates the helper mappings for the new enum state. |
Comments suppressed due to low confidence (1)
service/history/handler.go:228
- Verify that downstream consumers of the DeepHealthCheck response correctly handle the new HEALTH_STATE_STARTING state, ensuring that this change aligns with the overall health check design.
if status.Status != healthpb.HealthCheckResponse_SERVING {
common/dynamicconfig/constants.go
Outdated
HistoryHostStartingPercentage = NewGlobalFloatSetting( | ||
"frontend.historyHostStartingPercentage", | ||
0.05, | ||
`HistoryHostStartingPercentage is the percentage of hosts that are not yet ready to serve traffic`, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe say a little more here about what happens when the percentage is above or below this threshold?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is up to the consumer of the float setting, not the float setting itself.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But the setting has a single meaning and a specific effect.. the docstring should describe those. And the name should too, ideally. There shouldn't be multiple consumers with different semantics for the same setting.
Anyway, 0.05 isn't "the percentage of hosts that are not yet ready to serve traffic", the percentage of hosts not ready to serve traffic is usually 0%, sometimes 10%, sometimes 50%, etc. The setting is a threshold for some behavior to change based on that value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see your point. I've changed the dynamic config documentation and the naming to better reflect its intent.
@@ -761,6 +761,11 @@ This config is EXPERIMENTAL and may be changed or removed in a later release.`, | |||
0.5, | |||
`HistoryHostErrorPercentage is the percentage of hosts that are unhealthy`, | |||
) | |||
HistoryHostStartingPercentage = NewGlobalFloatSetting( | |||
"frontend.historyHostStartingPercentage", | |||
0.05, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nitpicky, but: "percentage" implies a number from 0 to 100. A fraction like 0.05 should be a "fraction".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
technically the correct word would be proportion according to the math stack exchange 😅 It looks like we use this notation frequently across the dynamic config. Maybe I'm wrong, but when I see a decimal and the unit is percentage, I assume this notation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I said it was nitpicky :)
"proportion" is also great
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changed!
service/frontend/health_check.go
Outdated
} | ||
|
||
failedHostCountPercentage := failedHostCount / float64(len(hosts)) | ||
if failedHostCountPercentage+startingHostCountPercentage > h.hostFailurePercentage() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm trying to think if it makes sense to include "starting" in with "failed" here... I think so but not 100% clear. Did you have a reason or example in mind when deciding that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My reasoning was that, in my mental model, I'm defining host failure percentage as "this percentage must be online and ready to receive traffic to not be in outage". Up for debate whether this is too strict of course, but it seems like we're missing a corner case if we're in the middle of a rollout (for example) and also have a number of hosts that are offline. Regardless of reason, they're still not ready.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense to me
@@ -38,4 +38,5 @@ enum HealthState { | |||
HEALTH_STATE_UNSPECIFIED = 0; | |||
HEALTH_STATE_SERVING = 1; | |||
HEALTH_STATE_NOT_SERVING = 2; | |||
HEALTH_STATE_STARTING = 3; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not sure if the starting make sense or just use UNKNOWN?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I could change this to NOT_READY? I suppose this could also be flipped "off" by the service if the host/pod is no longer healthy re: our conversation earlier that we don't just set this on startup
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed!
a7c6f58
to
b6edb2a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR introduces an internal pod health check for the history and frontend services to improve failure detection when pods are crashlooping by leveraging an upstream gRPC health check and adjusting the error thresholds.
- Integrates a gRPC-based health check in the history service's DeepHealthCheck.
- Adds new dynamic configuration and logic in the frontend service to account for pods that mark themselves as not ready (DECLINED_SERVING).
- Updates proto definitions and dynamic config constants to support the new HEALTH_STATE_DECLINED_SERVING.
Reviewed Changes
Copilot reviewed 10 out of 10 changed files in this pull request and generated 1 comment.
Show a summary per file
File | Description |
---|---|
service/history/handler.go | Adds grpc health check and corresponding healthServer field, integrating with DeepHealthCheck. |
service/history/fx.go | Binds the new HealthServer to the Handler. |
service/frontend/service.go | Introduces dynamic config for host self error proportion and passes it to the frontend service. |
service/frontend/health_check.go | Updates health check logic to account for DECLINED_SERVING state and calculates thresholds. |
service/frontend/health_check_test.go | Adds tests for the new health states and proportion calculations. |
service/frontend/admin_handler.go | Passes the new configuration to the admin handler for health checks. |
proto/internal/temporal/server/api/enums/v1/cluster.proto, api/enums/v1/cluster.pb.go, api/enums/v1/cluster.go-helpers.pb.go | Updates proto definitions to include the new HEALTH_STATE_DECLINED_SERVING. |
common/dynamicconfig/constants.go | Adds new dynamic config setting with updated description for history host self error proportion. |
service/frontend/health_check.go
Outdated
// Make sure that at lease 2 hosts must be not ready to trigger this check. | ||
proportionOfDeclinedServiceHosts := getProportionOfDeclinedServiceHosts(hostDeclinedServingCount/float64(len(hosts)), len(hosts)) | ||
|
||
hostDeclinedServingProportion := hostDeclinedServingCount / float64(len(hosts)) | ||
if hostDeclinedServingProportion > proportionOfDeclinedServiceHosts { | ||
h.logger.Warn("health check exceeded host declined serving proportion threshold", tag.NewFloat64("host declined serving proportion threshold", proportionOfDeclinedServiceHosts)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The local variable 'hostDeclinedServingProportion' declared here shadows the configured threshold value from the dynamic config, leading to a comparison against a computed measurement rather than a fixed configured threshold. Consider renaming this local variable to 'measuredDeclinedProportion' and using h.hostDeclinedServingProportion() to compute the intended threshold via getProportionOfDeclinedServiceHosts.
// Make sure that at lease 2 hosts must be not ready to trigger this check. | |
proportionOfDeclinedServiceHosts := getProportionOfDeclinedServiceHosts(hostDeclinedServingCount/float64(len(hosts)), len(hosts)) | |
hostDeclinedServingProportion := hostDeclinedServingCount / float64(len(hosts)) | |
if hostDeclinedServingProportion > proportionOfDeclinedServiceHosts { | |
h.logger.Warn("health check exceeded host declined serving proportion threshold", tag.NewFloat64("host declined serving proportion threshold", proportionOfDeclinedServiceHosts)) | |
// Make sure that at least 2 hosts must be not ready to trigger this check. | |
measuredDeclinedProportion := hostDeclinedServingCount / float64(len(hosts)) | |
if measuredDeclinedProportion > h.hostDeclinedServingProportion() { | |
h.logger.Warn("health check exceeded host declined serving proportion threshold", tag.NewFloat64("host declined serving proportion threshold", h.hostDeclinedServingProportion())) |
Copilot uses AI. Check for mistakes.
@@ -38,4 +38,5 @@ enum HealthState { | |||
HEALTH_STATE_UNSPECIFIED = 0; | |||
HEALTH_STATE_SERVING = 1; | |||
HEALTH_STATE_NOT_SERVING = 2; | |||
HEALTH_STATE_DECLINED_SERVING = 3; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: add some comments on the states?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added!
} | ||
|
||
failedHostCountProportion := failedHostCount / float64(len(hosts)) | ||
if failedHostCountProportion+hostDeclinedServingProportion > h.hostFailurePercentage() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am a little worry about false positive:
- double count in failure
- the minimum proportion
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The double count in failure can't currently happen, because the health check only returns 1 value and this gets counted. Am I misunderstanding what you mean? Each host can only have 1 status.
To the second point, I think that a host not serving traffic at all should factor into our hostFailurePercentage to catch the issue from before where we had some errors, but most pods were crashlooping so we didn't catch this case. If some hosts are failing, it means we have to pick up the slack on other hosts, so it's "failing". WDYT?
service/frontend/health_check.go
Outdated
return enumsspb.HEALTH_STATE_NOT_SERVING, nil | ||
} | ||
|
||
return enumsspb.HEALTH_STATE_SERVING, nil | ||
} | ||
|
||
func ensureMinimumProportionOfHosts(proportionOfDeclinedServingHosts float64, totalHosts int) float64 { | ||
minimumHostsFailed := 2.0 // We want to ensure that at least 2 fail before we notify the upstream. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: const
* main: (22 commits) Add host health metrics gauge (temporalio#7728) add rule expiration check (temporalio#7749) Add activity options to the pending activity info (temporalio#7727) Enable DLQ V2 for replication (temporalio#7746) chore: be smarter about when to use Stringer vs String (temporalio#7743) versioning entity workflows: enabling auto-restart pt1 (temporalio#7715) Refactor code generators (temporalio#7734) allow passive to generate replication tasks (temporalio#7713) Validate links in completion callbacks (temporalio#7726) CHASM: Engine Update/ReadComponent implementation (temporalio#7696) Enable transition history in dev env and tests (temporalio#7737) chore: Add Stringer tags (temporalio#7738) Add internal pod health check to DeepHealthCheck (temporalio#7709) Rename internal CHASM task processing interface (temporalio#7730) [Frontend] Log slow gRPC requests (temporalio#7718) Remove cap for dynamic config callback pool (temporalio#7723) Refactor updateworkflowoptions package (temporalio#7725) Remove a bunch of redundant utf-8 validation (temporalio#7720) [CHASM] Pure task processing - GetPureTasks, ExecutePureTasks (temporalio#7701) Send ActivityReset flag to the worker in heartbeat response (temporalio#7677) ...
What changed?
-> Part I (History service):
@dnr and others have recently implemented a health check to speed up the history service rollout by explicitly marking it as healthy when the history shards have stabilized after startup instead of waiting an arbitrary time period.
In this PR, I've reused this check in the history service to ensure that the history service has marked the host as healthy. This avoids a false positive because the host isn't yet receiving error metrics, and so it might not be accurate to say it's serving if history shards haven't ceased shuffling. Added health status starting to differentiate from unhealthy host status to use this upstream and help with error propagation and debugging.
(footnote) Further background on the host serving status for documentation and general interest:
The status is set here after it has both joined ringpop, and the shards have stabiliized. It reuses a gRPC history health check that was previously implemented, but was unused.
-> Part II (Frontend service, aggregator of history.DeepHealthCheck):
In the frontend service, I've added the default percentage of 5% based on the guidance from @00Sim that this would be quite a high error percentage, accounting for rollouts. I've added this percentage to the NOT_SERVING metric to avoid a combined outage where some pods are completely down, while some are crashlooping.
Why?
This PR starts to seek improvement for failure detection in the case of crashloops.
It's a small improvement as part of the effort to detect crashloopbackoff by not marking pods as healthy until they've been marked as such by the service while starting up. If the pods do not ever "start up", then it's a good indication that something is wrong.
How did you test it?
We can test during rollout by ensuring the failure detection is working as expected.
Potential risks
If the existing check is not marking pods as healthy, we could break the autodetection code. This is not directly workflow (customer)-impacting.
Documentation
N/A
Is hotfix candidate?
No