Add Stats to DescribeWorkerDeploymentVersion #7959

stephanos · 2025-06-24T18:26:59Z

What changed?

Added task queue stats reporting to DescribeWorkerDeploymentVersion API.

Why?

To allow inspecting backlog etc of task queues of a particular worker version.

How did you test it?

stephanos · 2025-06-25T16:41:53Z

service/matching/matching_engine.go

+	resp := &matchingservice.DescribeVersionedTaskQueuesResponse{}
+	for _, tq := range request.VersionTaskQueues {
+		tqResp, err := e.DescribeTaskQueue(ctx,
+			&matchingservice.DescribeTaskQueueRequest{


I might make these concurrently. Didn't get there yet.

stephanos · 2025-06-25T16:42:47Z

service/matching/matching_engine.go

+			})
+	}
+
+	pm.PutCache(cacheKey, resp)


I had considered moving the cache into the matchingEngineImpl as part of this PR (it's outstanding tech debt), but that was difficult since it needs to be initialised with the TTL. But the TTL is a namespace-scoped config. So the lifetimes don't match up with the engine.

ShahabT · 2025-06-25T19:56:17Z

service/matching/matching_engine.go

+				DescRequest: &workflowservice.DescribeTaskQueueRequest{
+					TaskQueue: &taskqueuepb.TaskQueue{
+						Name: tq.Name,
+						Kind: enumspb.TaskQueueKind(tq.Type),


this seems wrong. In this case it's always NORMAL kind

You're right; good catch. I probably auto-completed that. Now I wonder why the tests are passing ...

ShahabT

looks good to me otherwise.

ShahabT · 2025-06-25T20:00:40Z

service/worker/workerdeployment/client.go

-			}
-			taskQueues = append(taskQueues, element)
-		}
+	infos := make([]*deploymentpb.WorkerDeploymentVersionInfo_VersionTaskQueueInfo, 0, len(taskQueueInfos))


this part will be removed once the deprecated filed is cleaned up, right?

ShahabT · 2025-06-25T20:07:51Z

service/worker/workerdeployment/client.go

+			if tqRespTQ, ok := tqRespMap[tqKey(tq.Name, tq.Type)]; ok {
+				tqOutputs[i].Stats = tqRespTQ.Stats
+			} else {
+				// Setting empty stats instead of leaving nil (which is only used when not querying for stats).


Not sure about this. empty stats will mislead users into thinking the stats are all zero which is a valid stats by itself. It's dangerous if users think backlog is empty while it is actually not.

Options are reporting partial result with some nils in the list, or failing the whole call.

I was thinking about this; it's not clear to me yet why/when we wouldn't have a result there.

Yeah, other than random edge cases I cannot think of one either... but in that case, why not just return error for the whole thing? Better to be safe than returning incorrect results.

👍 okidoki

ShahabT · 2025-06-25T20:14:57Z

service/matching/matching_engine.go

@@ -1354,6 +1363,56 @@ func (e *matchingEngineImpl) DescribeTaskQueue(
 	return descrResp, nil
 }

+func (e *matchingEngineImpl) DescribeVersionedTaskQueues(


So this API is for saving the tq fan-out by caching the version results, while DescribeTaskQueue is caching the per-tq results already, right?

It makes sense, I just wonder if we had to do it now or we could've waited for some more signals that the tq fan-out will be a practical problem. But it's good that we have it, now that we have it.

Exactly! Since we have the cache, it was easy to add 🤷

stephanos · 2025-06-26T00:04:01Z

service/matching/matching_engine.go

+			buildIds = []string{worker_versioning.WorkerDeploymentVersionToStringV31(request.Version)}
+		}
+
+		cacheKey := "dtq_default:" + strings.Join(buildIds, ",")


This needs to be keyed by the requested version or the cache results of all vs one version collide.

Not for this PR, but could we always put the per-version stats in the cache, keyed by version string so they do not collide? It'd mean the total result here would need to check cache for all individual versions but that should be fine, right?

👍 Good idea.

stephanos · 2025-06-26T00:05:59Z

service/matching/matching_engine.go

+		localPM, _, _ := e.getTaskQueuePartitionManager(ctx, partition, false, 0)
+		if localPM != nil {
+			// If available, query the local partition manager to save a network call.
+			tqResp, err = e.DescribeTaskQueue(ctx, tqReq)
+		} else {
+			// Otherwise, query the other matching service instance.
+			tqResp, err = e.matchingRawClient.DescribeTaskQueue(ctx, tqReq)
+		}
+		if err != nil {
+			return nil, err // some other error, return it
+		}


I realized that our functional test setup doesn't cover this; but in production it would need to actually use the matching client for task queues that aren't on the instance.

Note that getTaskQueuePartitionManager AFAIK can also return nil when the partition has been unloaded but should be on this instance. I don't know how to check for that; so right now it would make a network call despite that.

makes sense, it seems good to me.

For the future, I wonder we can improve the routing logic inside the client to use the local matching_engine if the target instance ends up being the same.

Hmm, this best-effort thing is a little odd. My thought with the DescribeTaskQueue fanout was: the DescribeTaskQueue call goes to the root. So the root partition should be in the same process. So when asking the root, look up the pm, otherwise do an rpc. The "is it loaded locally" check can theoretically be wrong in both directions, as you point out.

In this case if we're doing a fanout to multiple task queues I think we should just do rpcs.

Hah, that's what #6733 does. But that's way overkill for this.

Oh, I didn't realize it can go wrong in both directions? I only assumed that it could falsely state that the tq root isn't on this host (ie unloaded) and then make an avoidable RPC.

I'm trying to assess the impact here and whether I need to patch it for the release asap. Are you saying it's incorrect or unnecessary/overly complex?

Well, I suppose it could only be wrong in the positive direction for a few seconds, since we have the unload on membership change monitor. The cache ttl is a few seconds so... I would say not a correctness concern.

I just get worried about stuff like this because of the potential inconsistencies: the source of truth for partition owner is membership, not the map in matching engine. The map in matching engine eventually follows membership, hopefully within a few seconds. So I'd rather just go to the source (by doing an rpc that gets routed by membership) than ask the map directly.

In the DescribeTaskQueue case, the rpc to the root just arrived, so that seems safe.

👍 I'll simplify it to always making the rpc.

ShahabT · 2025-06-26T00:18:01Z

service/matching/matching_engine.go

+			buildIds = []string{worker_versioning.WorkerDeploymentVersionToStringV31(request.Version)}
+		}
+
+		cacheKey := "dtq_default:" + strings.Join(buildIds, ",")


Not for this PR, but could we always put the per-version stats in the cache, keyed by version string so they do not collide? It'd mean the total result here would need to check cache for all individual versions but that should be fine, right?

ShahabT · 2025-06-26T00:24:05Z

service/matching/matching_engine.go

+		localPM, _, _ := e.getTaskQueuePartitionManager(ctx, partition, false, 0)
+		if localPM != nil {
+			// If available, query the local partition manager to save a network call.
+			tqResp, err = e.DescribeTaskQueue(ctx, tqReq)
+		} else {
+			// Otherwise, query the other matching service instance.
+			tqResp, err = e.matchingRawClient.DescribeTaskQueue(ctx, tqReq)
+		}
+		if err != nil {
+			return nil, err // some other error, return it
+		}


makes sense, it seems good to me.

For the future, I wonder we can improve the routing logic inside the client to use the local matching_engine if the target instance ends up being the same.

_**READ BEFORE MERGING:** All PRs require approval by both Server AND SDK teams before merging! This is why the number of required approvals is "2" and not "1"--two reviewers from the same team is NOT sufficient. If your PR is not approved by someone in BOTH teams, it may be summarily reverted._  **What changed?** (1) Deprecated `task_queue_infos` in `deployment.WorkerDeploymentVersionInfo`. (2) Added `version_task_queues` to `DescribeWorkerDeploymentVersionResponse`.  **Why?** We want to report task queue stats for each task queue that is part of a worker deployment version. The challenge is that the `taskqueue` package depends on the `deployment` package. So adding `TaskQueueStats` to `deployment.WorkerDeploymentVersionInfo` causes a cycle import error. Weighing our options, we decided to effectively _move_ the task queue-related data from within the deployment package into the response message.  **Breaking changes** Not yet; but in subsequent releases the deprecated field `task_queue_infos` will be removed.  **Server PR** temporalio/temporal#7959 (draft) --------- Co-authored-by: Spencer Judge <sjudge@hey.com>

dnr · 2025-06-27T00:42:13Z

service/matching/matching_engine.go

@@ -134,6 +136,7 @@ type (
 		gaugeMetrics                  gaugeMetrics // per-namespace task queue counters
 		config                        *Config
 		versionChecker                headers.VersionChecker
+		cache                         cache.Cache


Ah, good catch! That's from when I realized this won't work due to the caching TTL being ns scoped.

fix in #7975

dnr · 2025-06-27T00:46:46Z

service/matching/matching_engine.go

+		localPM, _, _ := e.getTaskQueuePartitionManager(ctx, partition, false, 0)
+		if localPM != nil {
+			// If available, query the local partition manager to save a network call.
+			tqResp, err = e.DescribeTaskQueue(ctx, tqReq)
+		} else {
+			// Otherwise, query the other matching service instance.
+			tqResp, err = e.matchingRawClient.DescribeTaskQueue(ctx, tqReq)
+		}
+		if err != nil {
+			return nil, err // some other error, return it
+		}


Hmm, this best-effort thing is a little odd. My thought with the DescribeTaskQueue fanout was: the DescribeTaskQueue call goes to the root. So the root partition should be in the same process. So when asking the root, look up the pm, otherwise do an rpc. The "is it loaded locally" check can theoretically be wrong in both directions, as you point out.

In this case if we're doing a fanout to multiple task queues I think we should just do rpcs.

dnr · 2025-06-27T00:47:52Z

service/matching/matching_engine.go

+		localPM, _, _ := e.getTaskQueuePartitionManager(ctx, partition, false, 0)
+		if localPM != nil {
+			// If available, query the local partition manager to save a network call.
+			tqResp, err = e.DescribeTaskQueue(ctx, tqReq)
+		} else {
+			// Otherwise, query the other matching service instance.
+			tqResp, err = e.matchingRawClient.DescribeTaskQueue(ctx, tqReq)
+		}
+		if err != nil {
+			return nil, err // some other error, return it
+		}


Hah, that's what #6733 does. But that's way overkill for this.

stephanos mentioned this pull request Jun 24, 2025

Add Stats to DescribeWorkerDeploymentVersion temporalio/api#603

Merged

stephanos force-pushed the worker-deployment-stats branch 4 times, most recently from d5ecba5 to 71bb52e Compare June 25, 2025 16:40

stephanos commented Jun 25, 2025

View reviewed changes

stephanos force-pushed the worker-deployment-stats branch 3 times, most recently from 9f51c31 to 2aa69b3 Compare June 25, 2025 17:03

ShahabT reviewed Jun 25, 2025

View reviewed changes

ShahabT approved these changes Jun 25, 2025

View reviewed changes

stephanos commented Jun 26, 2025

View reviewed changes

ShahabT approved these changes Jun 26, 2025

View reviewed changes

stephanos force-pushed the worker-deployment-stats branch from abea51b to 7c46b92 Compare June 26, 2025 00:32

stephanos marked this pull request as ready for review June 26, 2025 18:34

stephanos requested a review from a team as a code owner June 26, 2025 18:34

stephanos force-pushed the worker-deployment-stats branch 7 times, most recently from c86b8b9 to 361f03f Compare June 26, 2025 22:33

stephanos enabled auto-merge (squash) June 26, 2025 23:11

stephanos force-pushed the worker-deployment-stats branch from f008ee9 to 29b5466 Compare June 26, 2025 23:29

Add Stats to DescribeWorkerDeploymentVersion

0a63ec7

stephanos added 11 commits June 26, 2025 16:48

fix up

6c05018

add deprecation comment

3521052

add todo comment

f7ff77f

address linter

17d6c05

add API to quotas

c4e3ea9

update go.mod

0b35d90

update clients

74dc703

go mod tidy

7166139

parallelize

1af3985

rename

6a82279

remove prints

90272e8

stephanos force-pushed the worker-deployment-stats branch from 29b5466 to 54a15c6 Compare June 26, 2025 23:54

fix

64daee2

stephanos force-pushed the worker-deployment-stats branch from 54a15c6 to 64daee2 Compare June 27, 2025 00:05

stephanos merged commit 2730e2a into temporalio:main Jun 27, 2025
52 checks passed

stephanos deleted the worker-deployment-stats branch June 27, 2025 00:35

dnr reviewed Jun 27, 2025

View reviewed changes

Add Stats to DescribeWorkerDeploymentVersion #7959

Add Stats to DescribeWorkerDeploymentVersion #7959

Uh oh!

Conversation

stephanos commented Jun 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changed?

Why?

How did you test it?

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ShahabT left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

stephanos Jun 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ShahabT Jun 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

stephanos Jun 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

stephanos Jun 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

stephanos Jun 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

stephanos Jun 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

stephanos commented Jun 24, 2025 •

edited

Loading

stephanos Jun 25, 2025 •

edited

Loading

ShahabT Jun 25, 2025 •

edited

Loading

stephanos Jun 26, 2025 •

edited

Loading

stephanos Jun 27, 2025 •

edited

Loading

stephanos Jun 27, 2025 •

edited

Loading

stephanos Jun 27, 2025 •

edited

Loading