SetWorkerDeploymentCurrentVersion: Versioning-3.1 #7154

Shivs11 · 2025-01-23T22:44:48Z

What changed?

title

Why?

versioning-3.1

How did you test it?

existing suite of tests + added functional tests

Potential risks

None, feature branch

Documentation

Is hotfix candidate?

No

Shivs11 · 2025-01-23T22:45:54Z

go.mod

@@ -7,7 +7,7 @@ retract (
 	v1.26.0 // Published accidentally.
 )

-replace go.temporal.io/api => go.temporal.io/api v1.43.2-0.20250121234928-f4e3b077daa6
+replace go.temporal.io/api => go.temporal.io/api v1.43.2-0.20250123223624-792f7c873972


pointing to a feature on api-go: ss-routing-info-in-worker-deployment
Corresponding API PR: temporalio/api#527

proto/internal/temporal/server/api/deployment/v1/message.proto

Shivs11 · 2025-01-23T22:48:36Z

service/frontend/workflow_handler.go

+			return &workflowservice.SetWorkerDeploymentCurrentVersionResponse{
+				PreviousVersion: request.Version.Value, // previousVersion is the version which was requested to be set as current
+			}, nil


idempotency

I think it's cleaner to just not return any error if the requested Current is already Current. No need to handle the error here.

Shivs11 · 2025-01-23T22:49:41Z

service/worker/workerdeployment/client.go

@@ -268,72 +267,14 @@ func (d *ClientImpl) DescribeWorkerDeployment(
 	return d.deploymentStateToDeploymentInfo(ctx, namespaceEntry, deploymentName, queryResponse.State)
 }

-func (d *ClientImpl) GetCurrentVersion(


I was in two minds about this - should we keep this method purely for testing purposes and to verify if local data changes all the way to the version workflow later were indeed happening? Removed it for now since it's not an API exposed on the frontend

Shivs11 · 2025-01-23T22:50:26Z

service/worker/workerdeployment/version_activities.go

+				Operation: &matchingservice.SyncDeploymentUserDataRequest_UpdateVersionData{
+					UpdateVersionData: syncData.Data,
+				},


Note: I have not added code when we shall remove a version from the task-queue - will be coming in later

proto/internal/temporal/server/api/persistence/v1/task_queues.proto

carlydf · 2025-01-24T03:52:50Z

service/frontend/workflow_handler.go

+		return nil, err
+	}
+
+	// TODO (Shivam) : Verify if sending in uuid.New() is correct


Whether uuid.New() is the correct thing to pass in for requestID depends on the purpose of requestID. Since you're generating the UUID instead of getting it in the user's request, if the user's first request times out / fails and they retry, the request id would be different for their second request.

Looking at handleSetCurrent (which is where the request ID ends up (via SetCurrentVersionArgs) it looks like you're not currently using the request ID for any de-duping.

It's possible that you don't need to de-dupe at all, since SetCurrent as a whole is idempotent. If someone calls SetCurrent(A) and A was already current, do you want to update the last_became_current timestamp and sync with all the task queues + version workflows just so they update their timestamp? If so, then there is maybe some value in de-duping requests by ID, but I would say that it's reasonable to not update the last_became_current time on the second request, not sync a new time to the task queues / versions, and return early. In that case, I think you can get rid of request ID altogether.

Thanks for this thought provoking comment - I was struggling to come up with this thinking because the code was initially written by David in versioning-3.0 (I think) , hence the TODO comment - Since setCurrent as a whole is going to be idempotent, I don't think syncing it again makes a lot of sense since we would have already synced task-queues if A was already current earlier and any new task-queues added are synced when registered.

Hence, I think it's okay to get rid of this altogether :)

carlydf · 2025-01-24T03:56:07Z

service/frontend/workflow_handler.go

+			return &workflowservice.SetWorkerDeploymentCurrentVersionResponse{
+				PreviousVersion: request.Version.Value, // previousVersion is the version which was requested to be set as current
+			}, nil


I think it's cleaner to just not return any error if the requested Current is already Current. No need to handle the error here.

service/worker/workerdeployment/client.go

carlydf · 2025-01-24T04:37:17Z

service/worker/workerdeployment/workflow.go

+	// tell new current that it's current
+	if _, err := d.syncVersion(ctx, args.Version, versionUpdateTime); err != nil {
+		return nil, err
+	}
+
+	if prevCurrentVersion != "" {
+		// tell previous current that it's no longer current
+		if _, err := d.syncVersion(ctx, prevCurrentVersion, versionUpdateTime); err != nil {
+			return nil, err
+		}
+	}


I thought the plan was to sync to the task queues, then commit the new current to the deployment WF, then sync the new current to the version WFs.

It seems here that you're syncing the new current to the previous current and new current version WFs, and then having those workflows sync their new state to the task queues in VersionWorkflowRunner.handleSyncState.

I think the whole // sync to task queues section of handleSyncState should be moved to handleSetCurrent, let me know if you disagree and we should discuss! My thinking is that our focus should be on how to make SetCurrent return quickly and reliably while guaranteeing:

DescribeDeployment will not return the new Current until SetCurrent has completed and has consistency with all the TQs

SetCurrent will not return success until it has consistency with all the TQs (it might time out or return error but not success)

After the task queues and the Deployment WF have the new current, then we send updates to the version workflows to tell them too. I'm not super opinionated about whether those should complete before we return success to the SetCurrent caller, but I think it's fine to return success as soon as the new current is persisted in the TQs and in the Deployment WF, since those are the sources of truth for workflow task dispatch and DescribeDeployment.CurrentVersion

cc @ShahabT in case I have the wrong understanding here

thanks for starting this discussion!
a couple of things:

a) I think syncing to the task-queues should happen inside each version workflow themselves. We need these operations to be atomic in nature and since these version workflows own the task-queues, syncing to the user-data should happen inside of handleSyncState rather than inside handleSetCurrent.

The chain of flow right now, as implemented, is as follows:

Sync task-queues of the "to-be-current" version.

Update local state of the "to-be-current" version after syncing is complete.

Sync task-queues of the "previously-current" version.

Update local state of the "previously-current" version after syncing is successfully done.

Update local state of the worker deployment workflow.

b) Another point, and perhaps the more important one, is the ordering of this chain of flow. I am fully for the motion of returning from setCurrent after the TQ's are consistent. If I have understood your reasoning correctly, you reason for (2), (3), (4) to happen after (5) (in the above listed order). I thought about that and here are my concerns:

After (1) and (5), if we send async write operations to the version-workflow (in the form of signals) and return to the client, it could lead to a state inconsistency scenario - consider an example if another client were to send an update request which gets processed before the initial scheduled signals - this would be pretty bad since our version local state then would go out of sync and things would break

Thus, after thinking about this, I think we should keep all our updates in a "serial" fashion - Yes, it will be at more latency but we should prefer keeping things consistent as much as we can

Let me know if I have gotten this wrong and I am happy to be corrected!

I'm convinced that we should sync to the version workflows sequentially because it would be hard to get out of a bad state if someone sends two different SetCurrent requests back-to-back while we are still eventually making the version workflows consistent with the already-returned first request.

service/worker/workerdeployment/workflow.go

Shivs11 · 2025-01-24T19:03:08Z

service/matching/matching_engine.go

+				if old.GetRoutingUpdateTime().AsTime().After(vd.GetRoutingUpdateTime().AsTime()) {
 					return nil, false, errUserDataUnmodified
 				}
+				// only update if the timestamp is more recent
+				deploymentData.Versions[idx] = vd


changed this to stop lint complaints - also easier to read imo

service/worker/workerdeployment/client.go

carlydf · 2025-01-24T19:20:49Z

service/worker/workerdeployment/client.go

-		return nil, nil, serviceerror.NewAlreadyExist(fmt.Sprintf("Build ID %q is already current for %q",
-			version, deploymentName))
+		res.PreviousVersion = version
+		return &res, nil
 	} else if failure != nil {
 		// TODO: is there an easy way to recover the original type here?


if you know it's an application failure, you could do failure.GetApplicationFailureInfo.GetType() to get the type

but I don't think you need to!

## What changed?  - title ## Why?  - versioning-3.1 ## How did you test it?  - existing suite of tests + added functional tests ## Potential risks  - None, feature branch ## Documentation  ## Is hotfix candidate?  - No

Shivs11 added 16 commits January 20, 2025 17:48

registering version inside of worker-deployment

44782de

DescribeWorkerDeployment: working prototype

aabb482

Modified Query to send back state

cc2608e

helper to convert state->info

7b8e310

lint

7b6b55a

go-mod pointing to latest commit on api-go

51a2b14

fix unit-tests

7b52125

Safeguard against out-of-bounds error in functional tests

3b61692

fixing out-of-bounds error

d753696

fixing breaking sdk-tests

26d4dc5

setCurrent: working prototype

bb51139

Merge branch 'versioning-3.1' into shivam/versioning-3.1-set-current

ca711e9

current work

66756b1

working prototpye of set-current

305aea6

Merge branch 'versioning-3.1' into shivam/versioning-3.1-set-current

272ecc7

setCurrent: draft-1

e5963dd

Shivs11 commented Jan 23, 2025

View reviewed changes

Shivs11 marked this pull request as ready for review January 23, 2025 22:52

Shivs11 requested a review from a team as a code owner January 23, 2025 22:52

carlydf reviewed Jan 24, 2025

View reviewed changes

Shivs11 added 3 commits January 24, 2025 11:14

addressed some comments + lint fixes

3ec4829

go.mod -> latest@versioning3.1

e89ec23

addressed comments

d0b8daf

Shivs11 commented Jan 24, 2025

View reviewed changes

stop misc-checks from complaining

879b0e1

carlydf reviewed Jan 24, 2025

View reviewed changes

service/worker/workerdeployment/client.go Show resolved Hide resolved

carlydf reviewed Jan 24, 2025

View reviewed changes

carlydf approved these changes Jan 24, 2025

View reviewed changes

removed request_id from SetCurrentVersionArgs

c0f7244

Shivs11 merged commit 7ff7876 into versioning-3.1 Jan 24, 2025
48 checks passed

Shivs11 deleted the shivam/versioning-3.1-set-current-copy branch January 24, 2025 19:42

SetWorkerDeploymentCurrentVersion: Versioning-3.1 #7154

SetWorkerDeploymentCurrentVersion: Versioning-3.1 #7154

Uh oh!

Conversation

Shivs11 commented Jan 23, 2025

What changed?

Why?

How did you test it?

Potential risks

Documentation

Is hotfix candidate?

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!