Skip to content

Conversation

ychebotarev
Copy link
Contributor

@ychebotarev ychebotarev commented Jun 4, 2025

What changed?
What changed?

  • 2 new APIs added:
    • WorkerHeartbeat
    • ListWorkerStatus
  • WorkerInfo information added to:
    • ShutDownWorker request
    • Poll(Activity|Nexus|Workflow)TaskQueue requests

This is "for merge".
Here is a previous PR, where we discuss this at length: #599

Why?
Part of the worker visibility work. Workers should be able to report their status.
Users should be able to get information about workers.

Breaking changes
Yes - new API is introduce.

Server PR
Server PR

@ychebotarev ychebotarev requested review from a team as code owners June 4, 2025 16:35

// Worker identity, set by the client, may not be unique.
// Usually host_name+(user group name)+process_id, but can be overwritten by the user.
string worker_identity = 3;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mentioned on previous PR, this is not specific to host just because it has a default that is, and therefore this should be moved to top-level worker info IMO. I also think the other two fields don't deserve their own message any more than SDK name/version do, but not a strong opinion there.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think last time I either missread this or the context was different. I agree, worker_identity shouldn't be here. I will move it out. But I will keep the structure - unlike SDK I feel it will make sense.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unlike SDK I feel it will make sense

Can you clarify how host name and process ID in its own object makes sense but SDK name and version does not?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I already did in previous PR - this structure can be extended.

Copy link
Member

@cretz cretz Jun 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The SDK structure could be extended too. I am not sure I understand why host uniquely deserves its own message just because it may be extended like any other. It's obviously not a big deal, just a strange inconsistency.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The probability is different. I think I've shared what I can on this topic, not much to add. Let's call it here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think the probability is different. I have noticed you have chosen not to put CPU and memory host info into host info, so I am suspecting you may not put other host info into host info? Also, how sure are you there is no more SDK info we may want in SDK info?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I hear your concerns, but I'm not convinced the removal is necessary based on the arguments presented. Keeping has value. Let's keep this one for now.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is hugely important either way, but if you're convinced this is worth keeping, I'd agree with Chad that putting the CPU/Mem in there makes sense.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mmm.
Pro: it is host related, not worker related. So should be in host struct since we have it.
Con: not all metrics in one place.

Well. Yes, Pro > Con for me in this case, I will move them.

//* SdkName
//* SdkVersion
//* StartTime
//* LastHeartbeatTime
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there concerns on updating this value so frequently in visibility store?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we are going to update them (in batch, but still) for every heartbeat. I don't think we can avoid updating this info/skip updates. Question is - will it be indexable? This is something to decide in P2 (I think it shouldn't, but may be users will lead us in different direction).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 My only concern here is whether indexable. Do we want to remove this from the indexable item list here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we for sure want to be able to query by "last heartbeat time".

Copy link
Member

@cretz cretz Jun 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we? Or do we need to query by a certain state that may relate to last heartbeat time (e.g. staleness)?

will it be indexable? [...] I think it shouldn't [...] we for sure want to

I am a bit confused here. Doesn't it have to be indexable to query on it? So if I have 5 million workers, as part of this first version, do we expect users to be able to query for all last heartbeat times before a certain timestamp? How can we defer indexability to a later decision but list it as an indexable attribute here?

Copy link
Contributor Author

@ychebotarev ychebotarev Jun 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How can we defer indelibility to a later decision

Easily - this is a two-way door decision, and right now it is a comment.
We are going to use "in-memory" approach for MVP.
Id we decide not to index it later - we will remove it from the comment (and from filtering list).

}

message ListWorkersResponse {
repeated temporal.api.worker.v1.WorkerInfo worker_info = 1;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider making a wrapper message for this instead of returning the same worker info. There may be a time where we want server-set info and we'll regret we had nowhere to put it. Consider something like:

message WorkerDescription {
  temporal.api.worker.v1.WorkerInfo info = 1;
}

Or WorkerListEntry or anything. Basically some way for server to have room to add server-only things and not assume that the only thing we will ever return from list is the exact thing worker gives us. I expect some server-set info in here one day.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't we just add server-only fields to WorkerInfo and mark them as such?

There's no shortage of tag numbers...

Copy link
Member

@cretz cretz Jun 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it gets a little confusing to only accept some fields from worker on the same message. If we're willing to make separate messages for trivial things like host info, we should be willing to create separate messages for worker-provided data vs non-worker-provided data. I think of it like spec vs status in k8s. Or like in schedules case, ScheduleListEntry/ScheduleListInfo that wraps ScheduleSpec. I don't think it's very future proof to reuse the spec type as the list type with out some kind of wrapping. A single-field message just for list entries is mostly harmless but lets us grow as needed with the list response items.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do expect to have something else (coming from server) in this answer, besides just heartbeat info. Creating separate structure for further extension make sense.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also will allow us to separate receiving heartbeats from providing this info to the users.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

keeping open

}

message ListWorkersResponse {
repeated temporal.api.worker.v1.WorkerInfo worker_info = 1;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't we just add server-only fields to WorkerInfo and mark them as such?

There's no shortage of tag numbers...

Comment on lines 29 to 30
// Number of slots used by the worker for specific tasks.
int32 current_slots_used = 2;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

slots used by the worker for specific tasks.

@Sushisource - can you confirm that this is intentionally slots marked "used" and not meant to be slots reserved? Meaning you can't really get total slots on available + used (unless you also add active pollers, but even then there are races e.g. with eager activities/workflows).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I think having it be used makes good sense. It's more useful than reserved. I don't think being able to add them together to get the total is really useful in any meaningful way. Remaining available is the useful thing (if known).

@ychebotarev ychebotarev requested review from Sushisource, dnr and cretz June 4, 2025 22:00
@ychebotarev ychebotarev requested a review from cretz June 5, 2025 15:55
@ychebotarev ychebotarev changed the title Add worker visibility API - heartbeat and LisWorkers Add worker visibility API - heartbeat and ListWorkers Jun 5, 2025
@ychebotarev ychebotarev changed the title Add worker visibility API - heartbeat and ListWorkers Add worker visibility API - RecordWorkerHeartbeat and ListWorkers Jun 5, 2025
Copy link
Member

@Sushisource Sushisource left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your work on this. Nothing blocking, but a couple last things worth considering.


// Worker identity, set by the client, may not be unique.
// Usually host_name+(user group name)+process_id, but can be overwritten by the user.
string worker_identity = 3;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is hugely important either way, but if you're convinced this is worth keeping, I'd agree with Chad that putting the CPU/Mem in there makes sense.

@ychebotarev ychebotarev merged commit 49f9286 into master Jun 5, 2025
5 checks passed
@ychebotarev ychebotarev deleted the y_hbt1 branch June 5, 2025 21:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants