-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Add host health metrics gauge #7728
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
HEALTH_STATE_NOT_SERVING = 2; | ||
HEALTH_STATE_DECLINED_SERVING = 3; | ||
// The host is unhealthy through external observation. | ||
HEALTH_STATE_NOT_SERVING = 2; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change didn't get in to the previous PR somehow.
@@ -202,15 +202,18 @@ func (h *Handler) DeepHealthCheck( | |||
return nil, err | |||
} | |||
if status.Status != healthpb.HealthCheckResponse_SERVING { | |||
metrics.HistoryHostHealthGauge.With(h.metricsHandler).Record(float64(enumsspb.HEALTH_STATE_DECLINED_SERVING)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this metrics per host?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes! Because this is on the handler, metrics get scraped by prometheus on per-cell and per-host information. We'll be able to create a graph with this on per-cluster per-host checks
* main: (22 commits) Add host health metrics gauge (temporalio#7728) add rule expiration check (temporalio#7749) Add activity options to the pending activity info (temporalio#7727) Enable DLQ V2 for replication (temporalio#7746) chore: be smarter about when to use Stringer vs String (temporalio#7743) versioning entity workflows: enabling auto-restart pt1 (temporalio#7715) Refactor code generators (temporalio#7734) allow passive to generate replication tasks (temporalio#7713) Validate links in completion callbacks (temporalio#7726) CHASM: Engine Update/ReadComponent implementation (temporalio#7696) Enable transition history in dev env and tests (temporalio#7737) chore: Add Stringer tags (temporalio#7738) Add internal pod health check to DeepHealthCheck (temporalio#7709) Rename internal CHASM task processing interface (temporalio#7730) [Frontend] Log slow gRPC requests (temporalio#7718) Remove cap for dynamic config callback pool (temporalio#7723) Refactor updateworkflowoptions package (temporalio#7725) Remove a bunch of redundant utf-8 validation (temporalio#7720) [CHASM] Pure task processing - GetPureTasks, ExecutePureTasks (temporalio#7701) Send ActivityReset flag to the worker in heartbeat response (temporalio#7677) ...
What changed?
Add a new metrics gauge to track host health
Why?
Visibility into which hosts are failing when
How did you test it?
Potential risks
Adding Sean to the PR to make sure I didn't miss anything obvious in regards to how we want to display this.