-
Notifications
You must be signed in to change notification settings - Fork 1.1k
SetWorkerDeploymentCurrentVersion: Versioning-3.1 #7154
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SetWorkerDeploymentCurrentVersion: Versioning-3.1 #7154
Conversation
go.mod
Outdated
@@ -7,7 +7,7 @@ retract ( | |||
v1.26.0 // Published accidentally. | |||
) | |||
|
|||
replace go.temporal.io/api => go.temporal.io/api v1.43.2-0.20250121234928-f4e3b077daa6 | |||
replace go.temporal.io/api => go.temporal.io/api v1.43.2-0.20250123223624-792f7c873972 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pointing to a feature on api-go: ss-routing-info-in-worker-deployment
Corresponding API PR: temporalio/api#527
service/frontend/workflow_handler.go
Outdated
return &workflowservice.SetWorkerDeploymentCurrentVersionResponse{ | ||
PreviousVersion: request.Version.Value, // previousVersion is the version which was requested to be set as current | ||
}, nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
idempotency
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's cleaner to just not return any error if the requested Current is already Current. No need to handle the error here.
@@ -268,72 +267,14 @@ func (d *ClientImpl) DescribeWorkerDeployment( | |||
return d.deploymentStateToDeploymentInfo(ctx, namespaceEntry, deploymentName, queryResponse.State) | |||
} | |||
|
|||
func (d *ClientImpl) GetCurrentVersion( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was in two minds about this - should we keep this method purely for testing purposes and to verify if local data changes all the way to the version workflow later were indeed happening? Removed it for now since it's not an API exposed on the frontend
Operation: &matchingservice.SyncDeploymentUserDataRequest_UpdateVersionData{ | ||
UpdateVersionData: syncData.Data, | ||
}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note: I have not added code when we shall remove a version from the task-queue - will be coming in later
proto/internal/temporal/server/api/persistence/v1/task_queues.proto
Outdated
Show resolved
Hide resolved
service/frontend/workflow_handler.go
Outdated
return nil, err | ||
} | ||
|
||
// TODO (Shivam) : Verify if sending in uuid.New() is correct |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Whether uuid.New() is the correct thing to pass in for requestID
depends on the purpose of requestID
. Since you're generating the UUID instead of getting it in the user's request, if the user's first request times out / fails and they retry, the request id would be different for their second request.
Looking at handleSetCurrent
(which is where the request ID ends up (via SetCurrentVersionArgs
) it looks like you're not currently using the request ID for any de-duping.
It's possible that you don't need to de-dupe at all, since SetCurrent
as a whole is idempotent. If someone calls SetCurrent(A)
and A
was already current, do you want to update the last_became_current
timestamp and sync with all the task queues + version workflows just so they update their timestamp? If so, then there is maybe some value in de-duping requests by ID, but I would say that it's reasonable to not update the last_became_current
time on the second request, not sync a new time to the task queues / versions, and return early. In that case, I think you can get rid of request ID altogether.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this thought provoking comment - I was struggling to come up with this thinking because the code was initially written by David in versioning-3.0 (I think) , hence the TODO comment - Since setCurrent
as a whole is going to be idempotent, I don't think syncing it again makes a lot of sense since we would have already synced task-queues if A was already current earlier and any new task-queues added are synced when registered.
Hence, I think it's okay to get rid of this altogether :)
service/frontend/workflow_handler.go
Outdated
return &workflowservice.SetWorkerDeploymentCurrentVersionResponse{ | ||
PreviousVersion: request.Version.Value, // previousVersion is the version which was requested to be set as current | ||
}, nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's cleaner to just not return any error if the requested Current is already Current. No need to handle the error here.
// tell new current that it's current | ||
if _, err := d.syncVersion(ctx, args.Version, versionUpdateTime); err != nil { | ||
return nil, err | ||
} | ||
|
||
if prevCurrentVersion != "" { | ||
// tell previous current that it's no longer current | ||
if _, err := d.syncVersion(ctx, prevCurrentVersion, versionUpdateTime); err != nil { | ||
return nil, err | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought the plan was to sync to the task queues, then commit the new current to the deployment WF, then sync the new current to the version WFs.
It seems here that you're syncing the new current to the previous current and new current version WFs, and then having those workflows sync their new state to the task queues in VersionWorkflowRunner.handleSyncState
.
I think the whole // sync to task queues
section of handleSyncState
should be moved to handleSetCurrent
, let me know if you disagree and we should discuss! My thinking is that our focus should be on how to make SetCurrent
return quickly and reliably while guaranteeing:
- DescribeDeployment will not return the new Current until SetCurrent has completed and has consistency with all the TQs
- SetCurrent will not return success until it has consistency with all the TQs (it might time out or return error but not success)
After the task queues and the Deployment WF have the new current, then we send updates to the version workflows to tell them too. I'm not super opinionated about whether those should complete before we return success to the SetCurrent caller, but I think it's fine to return success as soon as the new current is persisted in the TQs and in the Deployment WF, since those are the sources of truth for workflow task dispatch and DescribeDeployment.CurrentVersion
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @ShahabT in case I have the wrong understanding here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for starting this discussion!
a couple of things:
a) I think syncing to the task-queues should happen inside each version workflow themselves. We need these operations to be atomic in nature and since these version workflows own the task-queues, syncing to the user-data should happen inside of handleSyncState
rather than inside handleSetCurrent
.
The chain of flow right now, as implemented, is as follows:
- Sync task-queues of the "to-be-current" version.
- Update local state of the "to-be-current" version after syncing is complete.
- Sync task-queues of the "previously-current" version.
- Update local state of the "previously-current" version after syncing is successfully done.
- Update local state of the
worker deployment
workflow.
b) Another point, and perhaps the more important one, is the ordering of this chain of flow. I am fully for the motion of returning from setCurrent after the TQ's are consistent. If I have understood your reasoning correctly, you reason for (2), (3), (4) to happen after (5) (in the above listed order). I thought about that and here are my concerns:
After (1) and (5), if we send async write operations to the version-workflow (in the form of signals) and return to the client, it could lead to a state inconsistency scenario - consider an example if another client were to send an update request which gets processed before the initial scheduled signals - this would be pretty bad since our version local state then would go out of sync and things would break
Thus, after thinking about this, I think we should keep all our updates in a "serial" fashion - Yes, it will be at more latency but we should prefer keeping things consistent as much as we can
Let me know if I have gotten this wrong and I am happy to be corrected!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm convinced that we should sync to the version workflows sequentially because it would be hard to get out of a bad state if someone sends two different SetCurrent requests back-to-back while we are still eventually making the version workflows consistent with the already-returned first request.
if old.GetRoutingUpdateTime().AsTime().After(vd.GetRoutingUpdateTime().AsTime()) { | ||
return nil, false, errUserDataUnmodified | ||
} | ||
// only update if the timestamp is more recent | ||
deploymentData.Versions[idx] = vd |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changed this to stop lint complaints - also easier to read imo
return nil, nil, serviceerror.NewAlreadyExist(fmt.Sprintf("Build ID %q is already current for %q", | ||
version, deploymentName)) | ||
res.PreviousVersion = version | ||
return &res, nil | ||
} else if failure != nil { | ||
// TODO: is there an easy way to recover the original type here? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if you know it's an application failure, you could do failure.GetApplicationFailureInfo.GetType()
to get the type
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but I don't think you need to!
## What changed? <!-- Describe what has changed in this PR --> - title ## Why? <!-- Tell your future self why have you made these changes --> - versioning-3.1 ## How did you test it? <!-- How have you verified this change? Tested locally? Added a unit test? Checked in staging env? --> - existing suite of tests + added functional tests ## Potential risks <!-- Assuming the worst case, what can be broken when deploying this change to production? --> - None, feature branch ## Documentation <!-- Have you made sure this change doesn't falsify anything currently stated in `docs/`? If significant new behavior is added, have you described that in `docs/`? --> ## Is hotfix candidate? <!-- Is this PR a hotfix candidate or does it require a notification to be sent to the broader community? (Yes/No) --> - No
## What changed? <!-- Describe what has changed in this PR --> - title ## Why? <!-- Tell your future self why have you made these changes --> - versioning-3.1 ## How did you test it? <!-- How have you verified this change? Tested locally? Added a unit test? Checked in staging env? --> - existing suite of tests + added functional tests ## Potential risks <!-- Assuming the worst case, what can be broken when deploying this change to production? --> - None, feature branch ## Documentation <!-- Have you made sure this change doesn't falsify anything currently stated in `docs/`? If significant new behavior is added, have you described that in `docs/`? --> ## Is hotfix candidate? <!-- Is this PR a hotfix candidate or does it require a notification to be sent to the broader community? (Yes/No) --> - No
## What changed? <!-- Describe what has changed in this PR --> - title ## Why? <!-- Tell your future self why have you made these changes --> - versioning-3.1 ## How did you test it? <!-- How have you verified this change? Tested locally? Added a unit test? Checked in staging env? --> - existing suite of tests + added functional tests ## Potential risks <!-- Assuming the worst case, what can be broken when deploying this change to production? --> - None, feature branch ## Documentation <!-- Have you made sure this change doesn't falsify anything currently stated in `docs/`? If significant new behavior is added, have you described that in `docs/`? --> ## Is hotfix candidate? <!-- Is this PR a hotfix candidate or does it require a notification to be sent to the broader community? (Yes/No) --> - No
## What changed? <!-- Describe what has changed in this PR --> - title ## Why? <!-- Tell your future self why have you made these changes --> - versioning-3.1 ## How did you test it? <!-- How have you verified this change? Tested locally? Added a unit test? Checked in staging env? --> - existing suite of tests + added functional tests ## Potential risks <!-- Assuming the worst case, what can be broken when deploying this change to production? --> - None, feature branch ## Documentation <!-- Have you made sure this change doesn't falsify anything currently stated in `docs/`? If significant new behavior is added, have you described that in `docs/`? --> ## Is hotfix candidate? <!-- Is this PR a hotfix candidate or does it require a notification to be sent to the broader community? (Yes/No) --> - No
## What changed? <!-- Describe what has changed in this PR --> - title ## Why? <!-- Tell your future self why have you made these changes --> - versioning-3.1 ## How did you test it? <!-- How have you verified this change? Tested locally? Added a unit test? Checked in staging env? --> - existing suite of tests + added functional tests ## Potential risks <!-- Assuming the worst case, what can be broken when deploying this change to production? --> - None, feature branch ## Documentation <!-- Have you made sure this change doesn't falsify anything currently stated in `docs/`? If significant new behavior is added, have you described that in `docs/`? --> ## Is hotfix candidate? <!-- Is this PR a hotfix candidate or does it require a notification to be sent to the broader community? (Yes/No) --> - No
## What changed? <!-- Describe what has changed in this PR --> - title ## Why? <!-- Tell your future self why have you made these changes --> - versioning-3.1 ## How did you test it? <!-- How have you verified this change? Tested locally? Added a unit test? Checked in staging env? --> - existing suite of tests + added functional tests ## Potential risks <!-- Assuming the worst case, what can be broken when deploying this change to production? --> - None, feature branch ## Documentation <!-- Have you made sure this change doesn't falsify anything currently stated in `docs/`? If significant new behavior is added, have you described that in `docs/`? --> ## Is hotfix candidate? <!-- Is this PR a hotfix candidate or does it require a notification to be sent to the broader community? (Yes/No) --> - No
## What changed? <!-- Describe what has changed in this PR --> - title ## Why? <!-- Tell your future self why have you made these changes --> - versioning-3.1 ## How did you test it? <!-- How have you verified this change? Tested locally? Added a unit test? Checked in staging env? --> - existing suite of tests + added functional tests ## Potential risks <!-- Assuming the worst case, what can be broken when deploying this change to production? --> - None, feature branch ## Documentation <!-- Have you made sure this change doesn't falsify anything currently stated in `docs/`? If significant new behavior is added, have you described that in `docs/`? --> ## Is hotfix candidate? <!-- Is this PR a hotfix candidate or does it require a notification to be sent to the broader community? (Yes/No) --> - No
## What changed? <!-- Describe what has changed in this PR --> - title ## Why? <!-- Tell your future self why have you made these changes --> - versioning-3.1 ## How did you test it? <!-- How have you verified this change? Tested locally? Added a unit test? Checked in staging env? --> - existing suite of tests + added functional tests ## Potential risks <!-- Assuming the worst case, what can be broken when deploying this change to production? --> - None, feature branch ## Documentation <!-- Have you made sure this change doesn't falsify anything currently stated in `docs/`? If significant new behavior is added, have you described that in `docs/`? --> ## Is hotfix candidate? <!-- Is this PR a hotfix candidate or does it require a notification to be sent to the broader community? (Yes/No) --> - No
What changed?
Why?
How did you test it?
Potential risks
Documentation
Is hotfix candidate?