Skip to content

Conversation

lina-temporal
Copy link
Contributor

@lina-temporal lina-temporal commented May 6, 2025

What changed?

A new interceptor has been written and added to frontend's FX chain that logs slow gRPC requests. Requests are logged with their elapsed time, and more importantly, their extracted request IDs/workflow tags, so on-call can quickly determine the problematic IDs.

This is what the log line might look like:

{
  "level": "warn",
  "ts": "2025-05-06T15:34:35.600-0700",
  "msg": "Slow gRPC call for 'temporal.api.workflowservice.v1.WorkflowService/DescribeWorkflowExecution', took 6.000857875s",
  "wf-id": "fc48187c-3df0-4c6e-b898-027ffe3e9066",
  "wf-run-id": "0196a7bc-73ec-7f1b-890d-a0bbbf63ac21",
  "logging-call-at": "/Users/lina/code/temporal/common/rpc/interceptor/slow_request_logger.go:59"
}

Why?

Because these request IDs are often deep within a protobuf request body, they aren't visible through Envoy/Chronicle logs, so although we can often see that a customer is experiencing request-reply latency, we have little else to follow-up on without asking them for more information.

How did you test it?

  • built
  • run locally and tested manually
  • covered by existing tests
  • added new unit test(s)
  • added new functional test(s)

Potential risks

We could end up with spammy log lines if we have lots of methods which regularly complete over the threshold, that aren't on the ignore list. I based the ignore and threshold on the frontend API dashboards across a few cells.

@lina-temporal lina-temporal requested a review from a team as a code owner May 6, 2025 22:53
@lina-temporal lina-temporal added API Issues/features involving the API operations labels May 6, 2025
lina-temporal and others added 2 commits May 7, 2025 09:28
Co-authored-by: David Reiss <david@temporal.io>
@lina-temporal lina-temporal requested a review from dnr May 7, 2025 18:21
@lina-temporal lina-temporal requested a review from dnr May 7, 2025 21:26
Copy link
Contributor

@dnr dnr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, thanks!

lina-temporal and others added 2 commits May 8, 2025 10:49
Co-authored-by: David Reiss <david@temporal.io>
Co-authored-by: David Reiss <david@temporal.io>
@lina-temporal lina-temporal enabled auto-merge (squash) May 8, 2025 17:50
@lina-temporal lina-temporal merged commit c43e04e into main May 8, 2025
53 checks passed
@lina-temporal lina-temporal deleted the log_slow_ops branch May 8, 2025 18:47
josesa added a commit to josesa/temporal that referenced this pull request May 12, 2025
* main: (22 commits)
  Add host health metrics gauge (temporalio#7728)
  add rule expiration check (temporalio#7749)
  Add activity options to the pending activity info (temporalio#7727)
  Enable DLQ V2 for replication (temporalio#7746)
  chore: be smarter about when to use Stringer vs String (temporalio#7743)
  versioning entity workflows: enabling auto-restart pt1 (temporalio#7715)
  Refactor code generators (temporalio#7734)
  allow passive to generate replication tasks (temporalio#7713)
  Validate links in completion callbacks (temporalio#7726)
  CHASM: Engine Update/ReadComponent implementation (temporalio#7696)
  Enable transition history in dev env and tests (temporalio#7737)
  chore: Add Stringer tags (temporalio#7738)
  Add internal pod health check to DeepHealthCheck (temporalio#7709)
  Rename internal CHASM task processing interface (temporalio#7730)
  [Frontend] Log slow gRPC requests (temporalio#7718)
  Remove cap for dynamic config callback pool (temporalio#7723)
  Refactor updateworkflowoptions package (temporalio#7725)
  Remove a bunch of redundant utf-8 validation (temporalio#7720)
  [CHASM] Pure task processing - GetPureTasks, ExecutePureTasks (temporalio#7701)
  Send ActivityReset flag to the worker in heartbeat response (temporalio#7677)
  ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Issues/features involving the API operations
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants