Skip to content

Conversation

stephanos
Copy link
Contributor

@stephanos stephanos commented Jul 8, 2025

READ BEFORE MERGING: All PRs require approval by both Server AND SDK teams before merging! This is why the number of required approvals is "2" and not "1"--two reviewers from the same team is NOT sufficient. If your PR is not approved by someone in BOTH teams, it may be summarily reverted.

What changed?

Adding a breakdown of stats per-priority levels to the two APIs that return task queue stats already.

Why?

To give users insights on a per priority level.

Breaking changes

N/A

Server PR

temporalio/temporal#8030

@stephanos stephanos force-pushed the tq-pri-stats branch 2 times, most recently from c052655 to 37c774d Compare July 9, 2025 15:33
Copy link
Contributor

@dnr dnr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems straightforward and reasonable

temporal.api.taskqueue.v1.TaskQueueStats stats = 5;

// Task queue stats breakdown by priority key. Only contains actively used priority keys.
// Only set if `report_stats` is set on the request.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is expensive enough to do this on top of other stats such that it'd deserve its own opt-in?

Copy link
Contributor Author

@stephanos stephanos Jul 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No extra cost here, internally the total stats are actually just the aggregation of the per-priority stats.

// Only set if `report_stats` is set on the request.
// (-- api-linter: core::0140::prepositions=disabled
// aip.dev/not-precedent: "by" is used to clarify the keys and values. --)
map<int32, temporal.api.taskqueue.v1.TaskQueueStats> stats_by_priority = 6;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will this contain all priority keys ever used and/or is there some max count and/or is it only priority keys used recently?

Copy link
Contributor Author

@stephanos stephanos Jul 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's difficult to summarize that in a sentence or two; but generally speaking, it will contain all "recently" used priority keys plus the ones from any backlogs. So the first part is fairly transient and wouldn't survive a reboot; but the second part is durable. So it depends. (none of this is different from the regular stats we already have)

Copy link
Contributor Author

@stephanos stephanos Jul 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the best short description I could give: a priority key will have stats if ...

  • it was active within the last 30s (unless there was a server restart)
  • it has accumulated a backlog

Not sure I'd want to commit to the exact number there (30s) though as that's an internal detail. Hence the (admittedly) fuzzy "actively used" wording.

Copy link
Member

@cretz cretz Jul 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the info, I do like the new "activity used" wording and it makes sense to me to be vague. My only remaining concern is if we want to add an extreme max (or are we ok returning millions here?). But we don't have to decide that in the API part, all good.

Copy link
Contributor

@dnr dnr Jul 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the current implementation, won't it be all the ones that have ever been used? Or do you filter them out?

(I would like to eventually "GC" subqueue ids that aren't in active use, but probably not soon.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, you're right. Even priorities that have not accumulated a backlog but where actively used once will be returned, too (until a server reboot).

// Only set if `report_stats` is set on the request.
// (-- api-linter: core::0140::prepositions=disabled
// aip.dev/not-precedent: "by" is used to clarify the keys and values. --)
map<int32, temporal.api.taskqueue.v1.TaskQueueStats> stats_by_priority = 6;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any way for me as a user to even see the keys that might make up this list (i.e. just see the priority keys in use) without necessarily asking for stats on them? Is it reasonable to want to describe things without asking for their stats?

Copy link
Contributor Author

@stephanos stephanos Jul 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, this stats field is really the only way to find the in-use priority keys across the task queue.

Describing just the priority keys has the same cost as asking for the stats (ie non-negligible, requiring opt-in). And at that point, what are you really gaining by adding yet another boolean flag and response field?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the only way to list priority keys is to also get there stats, there is no need for another field. Just can be a bit confusing to a user that sometimes they need to ask for stats just to check existence of something.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think asking for the just set of priority keys that have been used (ever or recently) is that useful.. what's a use case you're thinking of?

Copy link
Member

@cretz cretz Jul 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To know what priority keys exist, same as wanting to know what task queues exist (though I understand we struggle to provide that to users too). One such situation may be to describe the configuration of a priority key (assuming we have that which I think I saw). In general if we have a "thing" in Temporal, it is often worth being able to describe it (and bonus for being able to list all of them). Just not sure if "stats" makes sense as a "describe" equivalent. But seeing how priority key configuration is just config on the task queue, I can see it not being a top-level thing.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The priority key is just a number from 1 to 5. We have at most 5 of them, and we don't have per-priority-key configuration. It's not really a "thing".

(We did consider per-priority-key rate limits as configuration but it doesn't make much sense, we can address those needs a better way)

stephanos added a commit to temporalio/temporal that referenced this pull request Jul 15, 2025
## What changed?

More granualar task queue stats tracking by priority level.

## Why?

Preparation for temporalio/api#611 where we
intend to expose the priority-level stats to users.

## How did you test it?
- [ ] built
- [ ] run locally and tested manually
- [x] covered by existing tests
- [ ] added new unit test(s)
- [ ] added new functional test(s)

---------

Co-authored-by: David Reiss <dnr@dnr.im>
@stephanos stephanos merged commit 2c64a5d into temporalio:master Jul 22, 2025
5 checks passed
@stephanos stephanos deleted the tq-pri-stats branch July 22, 2025 20:06
stephanos added a commit to temporalio/temporal that referenced this pull request Jul 22, 2025
## What changed?

Adding a breakdown of stats per-priority levels to the two APIs that
return task queue stats already.

## Why?

New API is defined here: temporalio/api#611.

## How did you test it?
- [ ] built
- [ ] run locally and tested manually
- [ ] covered by existing tests
- [ ] added new unit test(s)
- [x] added new functional test(s)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants