-
Notifications
You must be signed in to change notification settings - Fork 75
Add worker visibility API - RecordWorkerHeartbeat and ListWorkers #601
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
temporal/api/worker/v1/message.proto
Outdated
|
||
// Worker identity, set by the client, may not be unique. | ||
// Usually host_name+(user group name)+process_id, but can be overwritten by the user. | ||
string worker_identity = 3; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mentioned on previous PR, this is not specific to host just because it has a default that is, and therefore this should be moved to top-level worker info IMO. I also think the other two fields don't deserve their own message any more than SDK name/version do, but not a strong opinion there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think last time I either missread this or the context was different. I agree, worker_identity shouldn't be here. I will move it out. But I will keep the structure - unlike SDK I feel it will make sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unlike SDK I feel it will make sense
Can you clarify how host name and process ID in its own object makes sense but SDK name and version does not?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I already did in previous PR - this structure can be extended.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The SDK structure could be extended too. I am not sure I understand why host uniquely deserves its own message just because it may be extended like any other. It's obviously not a big deal, just a strange inconsistency.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The probability is different. I think I've shared what I can on this topic, not much to add. Let's call it here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think the probability is different. I have noticed you have chosen not to put CPU and memory host info into host info, so I am suspecting you may not put other host info into host info? Also, how sure are you there is no more SDK info we may want in SDK info?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I hear your concerns, but I'm not convinced the removal is necessary based on the arguments presented. Keeping has value. Let's keep this one for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this is hugely important either way, but if you're convinced this is worth keeping, I'd agree with Chad that putting the CPU/Mem in there makes sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mmm.
Pro: it is host related, not worker related. So should be in host struct since we have it.
Con: not all metrics in one place.
Well. Yes, Pro > Con for me in this case, I will move them.
//* SdkName | ||
//* SdkVersion | ||
//* StartTime | ||
//* LastHeartbeatTime |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are there concerns on updating this value so frequently in visibility store?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we are going to update them (in batch, but still) for every heartbeat. I don't think we can avoid updating this info/skip updates. Question is - will it be indexable? This is something to decide in P2 (I think it shouldn't, but may be users will lead us in different direction).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 My only concern here is whether indexable. Do we want to remove this from the indexable item list here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we for sure want to be able to query by "last heartbeat time".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we? Or do we need to query by a certain state that may relate to last heartbeat time (e.g. staleness)?
will it be indexable? [...] I think it shouldn't [...] we for sure want to
I am a bit confused here. Doesn't it have to be indexable to query on it? So if I have 5 million workers, as part of this first version, do we expect users to be able to query for all last heartbeat times before a certain timestamp? How can we defer indexability to a later decision but list it as an indexable attribute here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How can we defer indelibility to a later decision
Easily - this is a two-way door decision, and right now it is a comment.
We are going to use "in-memory" approach for MVP.
Id we decide not to index it later - we will remove it from the comment (and from filtering list).
} | ||
|
||
message ListWorkersResponse { | ||
repeated temporal.api.worker.v1.WorkerInfo worker_info = 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider making a wrapper message for this instead of returning the same worker info. There may be a time where we want server-set info and we'll regret we had nowhere to put it. Consider something like:
message WorkerDescription {
temporal.api.worker.v1.WorkerInfo info = 1;
}
Or WorkerListEntry
or anything. Basically some way for server to have room to add server-only things and not assume that the only thing we will ever return from list is the exact thing worker gives us. I expect some server-set info in here one day.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can't we just add server-only fields to WorkerInfo and mark them as such?
There's no shortage of tag numbers...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it gets a little confusing to only accept some fields from worker on the same message. If we're willing to make separate messages for trivial things like host info, we should be willing to create separate messages for worker-provided data vs non-worker-provided data. I think of it like spec vs status in k8s. Or like in schedules case, ScheduleListEntry
/ScheduleListInfo
that wraps ScheduleSpec
. I don't think it's very future proof to reuse the spec type as the list type with out some kind of wrapping. A single-field message just for list entries is mostly harmless but lets us grow as needed with the list response items.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do expect to have something else (coming from server) in this answer, besides just heartbeat info. Creating separate structure for further extension make sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also will allow us to separate receiving heartbeats from providing this info to the users.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
keeping open
} | ||
|
||
message ListWorkersResponse { | ||
repeated temporal.api.worker.v1.WorkerInfo worker_info = 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can't we just add server-only fields to WorkerInfo and mark them as such?
There's no shortage of tag numbers...
temporal/api/worker/v1/message.proto
Outdated
// Number of slots used by the worker for specific tasks. | ||
int32 current_slots_used = 2; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
slots used by the worker for specific tasks.
@Sushisource - can you confirm that this is intentionally slots marked "used" and not meant to be slots reserved? Meaning you can't really get total slots on available + used (unless you also add active pollers, but even then there are races e.g. with eager activities/workflows).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I think having it be used makes good sense. It's more useful than reserved. I don't think being able to add them together to get the total is really useful in any meaningful way. Remaining available is the useful thing (if known).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your work on this. Nothing blocking, but a couple last things worth considering.
temporal/api/worker/v1/message.proto
Outdated
|
||
// Worker identity, set by the client, may not be unique. | ||
// Usually host_name+(user group name)+process_id, but can be overwritten by the user. | ||
string worker_identity = 3; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this is hugely important either way, but if you're convinced this is worth keeping, I'd agree with Chad that putting the CPU/Mem in there makes sense.
What changed?
What changed?
This is "for merge".
Here is a previous PR, where we discuss this at length: #599
Why?
Part of the worker visibility work. Workers should be able to report their status.
Users should be able to get information about workers.
Breaking changes
Yes - new API is introduce.
Server PR
Server PR