Skip to content

Conversation

laniehei
Copy link
Member

@laniehei laniehei commented May 7, 2025

What changed?

Add a new metrics gauge to track host health

Why?

Visibility into which hosts are failing when

How did you test it?

  • built
  • run locally and tested manually
  • covered by existing tests
  • added new unit test(s)
  • added new functional test(s)

Potential risks

Adding Sean to the PR to make sure I didn't miss anything obvious in regards to how we want to display this.

@laniehei laniehei requested review from swgillespie and yux0 May 7, 2025 23:51
@laniehei laniehei requested a review from a team as a code owner May 7, 2025 23:51
HEALTH_STATE_NOT_SERVING = 2;
HEALTH_STATE_DECLINED_SERVING = 3;
// The host is unhealthy through external observation.
HEALTH_STATE_NOT_SERVING = 2;
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change didn't get in to the previous PR somehow.

@@ -202,15 +202,18 @@ func (h *Handler) DeepHealthCheck(
return nil, err
}
if status.Status != healthpb.HealthCheckResponse_SERVING {
metrics.HistoryHostHealthGauge.With(h.metricsHandler).Record(float64(enumsspb.HEALTH_STATE_DECLINED_SERVING))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this metrics per host?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes! Because this is on the handler, metrics get scraped by prometheus on per-cell and per-host information. We'll be able to create a graph with this on per-cluster per-host checks

@laniehei laniehei requested a review from yux0 May 8, 2025 23:44
@laniehei laniehei merged commit bf35d1e into main May 12, 2025
53 checks passed
@laniehei laniehei deleted the lanie/metric-fault-detection branch May 12, 2025 17:40
josesa added a commit to josesa/temporal that referenced this pull request May 12, 2025
* main: (22 commits)
  Add host health metrics gauge (temporalio#7728)
  add rule expiration check (temporalio#7749)
  Add activity options to the pending activity info (temporalio#7727)
  Enable DLQ V2 for replication (temporalio#7746)
  chore: be smarter about when to use Stringer vs String (temporalio#7743)
  versioning entity workflows: enabling auto-restart pt1 (temporalio#7715)
  Refactor code generators (temporalio#7734)
  allow passive to generate replication tasks (temporalio#7713)
  Validate links in completion callbacks (temporalio#7726)
  CHASM: Engine Update/ReadComponent implementation (temporalio#7696)
  Enable transition history in dev env and tests (temporalio#7737)
  chore: Add Stringer tags (temporalio#7738)
  Add internal pod health check to DeepHealthCheck (temporalio#7709)
  Rename internal CHASM task processing interface (temporalio#7730)
  [Frontend] Log slow gRPC requests (temporalio#7718)
  Remove cap for dynamic config callback pool (temporalio#7723)
  Refactor updateworkflowoptions package (temporalio#7725)
  Remove a bunch of redundant utf-8 validation (temporalio#7720)
  [CHASM] Pure task processing - GetPureTasks, ExecutePureTasks (temporalio#7701)
  Send ActivityReset flag to the worker in heartbeat response (temporalio#7677)
  ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants