-
Notifications
You must be signed in to change notification settings - Fork 75
TaskQueue per-priority stats #611
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
c052655
to
37c774d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
seems straightforward and reasonable
temporal.api.taskqueue.v1.TaskQueueStats stats = 5; | ||
|
||
// Task queue stats breakdown by priority key. Only contains actively used priority keys. | ||
// Only set if `report_stats` is set on the request. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is expensive enough to do this on top of other stats such that it'd deserve its own opt-in?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No extra cost here, internally the total stats are actually just the aggregation of the per-priority stats.
// Only set if `report_stats` is set on the request. | ||
// (-- api-linter: core::0140::prepositions=disabled | ||
// aip.dev/not-precedent: "by" is used to clarify the keys and values. --) | ||
map<int32, temporal.api.taskqueue.v1.TaskQueueStats> stats_by_priority = 6; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will this contain all priority keys ever used and/or is there some max count and/or is it only priority keys used recently?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's difficult to summarize that in a sentence or two; but generally speaking, it will contain all "recently" used priority keys plus the ones from any backlogs. So the first part is fairly transient and wouldn't survive a reboot; but the second part is durable. So it depends. (none of this is different from the regular stats we already have)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the best short description I could give: a priority key will have stats if ...
- it was active within the last 30s (unless there was a server restart)
- it has accumulated a backlog
Not sure I'd want to commit to the exact number there (30s) though as that's an internal detail. Hence the (admittedly) fuzzy "actively used" wording.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the info, I do like the new "activity used" wording and it makes sense to me to be vague. My only remaining concern is if we want to add an extreme max (or are we ok returning millions here?). But we don't have to decide that in the API part, all good.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the current implementation, won't it be all the ones that have ever been used? Or do you filter them out?
(I would like to eventually "GC" subqueue ids that aren't in active use, but probably not soon.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, you're right. Even priorities that have not accumulated a backlog but where actively used once will be returned, too (until a server reboot).
// Only set if `report_stats` is set on the request. | ||
// (-- api-linter: core::0140::prepositions=disabled | ||
// aip.dev/not-precedent: "by" is used to clarify the keys and values. --) | ||
map<int32, temporal.api.taskqueue.v1.TaskQueueStats> stats_by_priority = 6; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any way for me as a user to even see the keys that might make up this list (i.e. just see the priority keys in use) without necessarily asking for stats on them? Is it reasonable to want to describe things without asking for their stats?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, this stats field is really the only way to find the in-use priority keys across the task queue.
Describing just the priority keys has the same cost as asking for the stats (ie non-negligible, requiring opt-in). And at that point, what are you really gaining by adding yet another boolean flag and response field?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the only way to list priority keys is to also get there stats, there is no need for another field. Just can be a bit confusing to a user that sometimes they need to ask for stats just to check existence of something.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think asking for the just set of priority keys that have been used (ever or recently) is that useful.. what's a use case you're thinking of?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To know what priority keys exist, same as wanting to know what task queues exist (though I understand we struggle to provide that to users too). One such situation may be to describe the configuration of a priority key (assuming we have that which I think I saw). In general if we have a "thing" in Temporal, it is often worth being able to describe it (and bonus for being able to list all of them). Just not sure if "stats" makes sense as a "describe" equivalent. But seeing how priority key configuration is just config on the task queue, I can see it not being a top-level thing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The priority key is just a number from 1 to 5. We have at most 5 of them, and we don't have per-priority-key configuration. It's not really a "thing".
(We did consider per-priority-key rate limits as configuration but it doesn't make much sense, we can address those needs a better way)
## What changed? More granualar task queue stats tracking by priority level. ## Why? Preparation for temporalio/api#611 where we intend to expose the priority-level stats to users. ## How did you test it? - [ ] built - [ ] run locally and tested manually - [x] covered by existing tests - [ ] added new unit test(s) - [ ] added new functional test(s) --------- Co-authored-by: David Reiss <dnr@dnr.im>
## What changed? Adding a breakdown of stats per-priority levels to the two APIs that return task queue stats already. ## Why? New API is defined here: temporalio/api#611. ## How did you test it? - [ ] built - [ ] run locally and tested manually - [ ] covered by existing tests - [ ] added new unit test(s) - [x] added new functional test(s)
READ BEFORE MERGING: All PRs require approval by both Server AND SDK teams before merging! This is why the number of required approvals is "2" and not "1"--two reviewers from the same team is NOT sufficient. If your PR is not approved by someone in BOTH teams, it may be summarily reverted.
What changed?
Adding a breakdown of stats per-priority levels to the two APIs that return task queue stats already.
Why?
To give users insights on a per priority level.
Breaking changes
N/A
Server PR
temporalio/temporal#8030