Skip to content

Conversation

Shivs11
Copy link
Member

@Shivs11 Shivs11 commented Jan 23, 2025

What changed?

  • title

Why?

  • versioning-3.1

How did you test it?

  • existing suite of tests + added functional tests

Potential risks

  • None, feature branch

Documentation

Is hotfix candidate?

  • No

go.mod Outdated
@@ -7,7 +7,7 @@ retract (
v1.26.0 // Published accidentally.
)

replace go.temporal.io/api => go.temporal.io/api v1.43.2-0.20250121234928-f4e3b077daa6
replace go.temporal.io/api => go.temporal.io/api v1.43.2-0.20250123223624-792f7c873972
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pointing to a feature on api-go: ss-routing-info-in-worker-deployment
Corresponding API PR: temporalio/api#527

Comment on lines 3373 to 3375
return &workflowservice.SetWorkerDeploymentCurrentVersionResponse{
PreviousVersion: request.Version.Value, // previousVersion is the version which was requested to be set as current
}, nil
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

idempotency

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's cleaner to just not return any error if the requested Current is already Current. No need to handle the error here.

@@ -268,72 +267,14 @@ func (d *ClientImpl) DescribeWorkerDeployment(
return d.deploymentStateToDeploymentInfo(ctx, namespaceEntry, deploymentName, queryResponse.State)
}

func (d *ClientImpl) GetCurrentVersion(
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was in two minds about this - should we keep this method purely for testing purposes and to verify if local data changes all the way to the version workflow later were indeed happening? Removed it for now since it's not an API exposed on the frontend

Comment on lines +79 to +81
Operation: &matchingservice.SyncDeploymentUserDataRequest_UpdateVersionData{
UpdateVersionData: syncData.Data,
},
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: I have not added code when we shall remove a version from the task-queue - will be coming in later

@Shivs11 Shivs11 marked this pull request as ready for review January 23, 2025 22:52
@Shivs11 Shivs11 requested a review from a team as a code owner January 23, 2025 22:52
return nil, err
}

// TODO (Shivam) : Verify if sending in uuid.New() is correct
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whether uuid.New() is the correct thing to pass in for requestID depends on the purpose of requestID. Since you're generating the UUID instead of getting it in the user's request, if the user's first request times out / fails and they retry, the request id would be different for their second request.

Looking at handleSetCurrent (which is where the request ID ends up (via SetCurrentVersionArgs) it looks like you're not currently using the request ID for any de-duping.

It's possible that you don't need to de-dupe at all, since SetCurrent as a whole is idempotent. If someone calls SetCurrent(A) and A was already current, do you want to update the last_became_current timestamp and sync with all the task queues + version workflows just so they update their timestamp? If so, then there is maybe some value in de-duping requests by ID, but I would say that it's reasonable to not update the last_became_current time on the second request, not sync a new time to the task queues / versions, and return early. In that case, I think you can get rid of request ID altogether.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this thought provoking comment - I was struggling to come up with this thinking because the code was initially written by David in versioning-3.0 (I think) , hence the TODO comment - Since setCurrent as a whole is going to be idempotent, I don't think syncing it again makes a lot of sense since we would have already synced task-queues if A was already current earlier and any new task-queues added are synced when registered.

Hence, I think it's okay to get rid of this altogether :)

Comment on lines 3373 to 3375
return &workflowservice.SetWorkerDeploymentCurrentVersionResponse{
PreviousVersion: request.Version.Value, // previousVersion is the version which was requested to be set as current
}, nil
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's cleaner to just not return any error if the requested Current is already Current. No need to handle the error here.

Comment on lines +141 to +151
// tell new current that it's current
if _, err := d.syncVersion(ctx, args.Version, versionUpdateTime); err != nil {
return nil, err
}

if prevCurrentVersion != "" {
// tell previous current that it's no longer current
if _, err := d.syncVersion(ctx, prevCurrentVersion, versionUpdateTime); err != nil {
return nil, err
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought the plan was to sync to the task queues, then commit the new current to the deployment WF, then sync the new current to the version WFs.

It seems here that you're syncing the new current to the previous current and new current version WFs, and then having those workflows sync their new state to the task queues in VersionWorkflowRunner.handleSyncState.

I think the whole // sync to task queues section of handleSyncState should be moved to handleSetCurrent, let me know if you disagree and we should discuss! My thinking is that our focus should be on how to make SetCurrent return quickly and reliably while guaranteeing:

  • DescribeDeployment will not return the new Current until SetCurrent has completed and has consistency with all the TQs
  • SetCurrent will not return success until it has consistency with all the TQs (it might time out or return error but not success)

After the task queues and the Deployment WF have the new current, then we send updates to the version workflows to tell them too. I'm not super opinionated about whether those should complete before we return success to the SetCurrent caller, but I think it's fine to return success as soon as the new current is persisted in the TQs and in the Deployment WF, since those are the sources of truth for workflow task dispatch and DescribeDeployment.CurrentVersion

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @ShahabT in case I have the wrong understanding here

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for starting this discussion!
a couple of things:

a) I think syncing to the task-queues should happen inside each version workflow themselves. We need these operations to be atomic in nature and since these version workflows own the task-queues, syncing to the user-data should happen inside of handleSyncState rather than inside handleSetCurrent.

The chain of flow right now, as implemented, is as follows:

  1. Sync task-queues of the "to-be-current" version.
  2. Update local state of the "to-be-current" version after syncing is complete.
  3. Sync task-queues of the "previously-current" version.
  4. Update local state of the "previously-current" version after syncing is successfully done.
  5. Update local state of the worker deployment workflow.

b) Another point, and perhaps the more important one, is the ordering of this chain of flow. I am fully for the motion of returning from setCurrent after the TQ's are consistent. If I have understood your reasoning correctly, you reason for (2), (3), (4) to happen after (5) (in the above listed order). I thought about that and here are my concerns:

After (1) and (5), if we send async write operations to the version-workflow (in the form of signals) and return to the client, it could lead to a state inconsistency scenario - consider an example if another client were to send an update request which gets processed before the initial scheduled signals - this would be pretty bad since our version local state then would go out of sync and things would break

Thus, after thinking about this, I think we should keep all our updates in a "serial" fashion - Yes, it will be at more latency but we should prefer keeping things consistent as much as we can

Let me know if I have gotten this wrong and I am happy to be corrected!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm convinced that we should sync to the version workflows sequentially because it would be hard to get out of a bad state if someone sends two different SetCurrent requests back-to-back while we are still eventually making the version workflows consistent with the already-returned first request.

Comment on lines +1679 to +1683
if old.GetRoutingUpdateTime().AsTime().After(vd.GetRoutingUpdateTime().AsTime()) {
return nil, false, errUserDataUnmodified
}
// only update if the timestamp is more recent
deploymentData.Versions[idx] = vd
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed this to stop lint complaints - also easier to read imo

return nil, nil, serviceerror.NewAlreadyExist(fmt.Sprintf("Build ID %q is already current for %q",
version, deploymentName))
res.PreviousVersion = version
return &res, nil
} else if failure != nil {
// TODO: is there an easy way to recover the original type here?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you know it's an application failure, you could do failure.GetApplicationFailureInfo.GetType() to get the type

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but I don't think you need to!

@Shivs11 Shivs11 merged commit 7ff7876 into versioning-3.1 Jan 24, 2025
48 checks passed
@Shivs11 Shivs11 deleted the shivam/versioning-3.1-set-current-copy branch January 24, 2025 19:42
ShahabT pushed a commit that referenced this pull request Feb 4, 2025
## What changed?
<!-- Describe what has changed in this PR -->
- title

## Why?
<!-- Tell your future self why have you made these changes -->
- versioning-3.1

## How did you test it?
<!-- How have you verified this change? Tested locally? Added a unit
test? Checked in staging env? -->
- existing suite of tests + added functional tests

## Potential risks
<!-- Assuming the worst case, what can be broken when deploying this
change to production? -->
- None, feature branch

## Documentation
<!-- Have you made sure this change doesn't falsify anything currently
stated in `docs/`? If significant
new behavior is added, have you described that in `docs/`? -->

## Is hotfix candidate?
<!-- Is this PR a hotfix candidate or does it require a notification to
be sent to the broader community? (Yes/No) -->
- No
ShahabT pushed a commit that referenced this pull request Feb 4, 2025
## What changed?
<!-- Describe what has changed in this PR -->
- title

## Why?
<!-- Tell your future self why have you made these changes -->
- versioning-3.1

## How did you test it?
<!-- How have you verified this change? Tested locally? Added a unit
test? Checked in staging env? -->
- existing suite of tests + added functional tests

## Potential risks
<!-- Assuming the worst case, what can be broken when deploying this
change to production? -->
- None, feature branch

## Documentation
<!-- Have you made sure this change doesn't falsify anything currently
stated in `docs/`? If significant
new behavior is added, have you described that in `docs/`? -->

## Is hotfix candidate?
<!-- Is this PR a hotfix candidate or does it require a notification to
be sent to the broader community? (Yes/No) -->
- No
ShahabT pushed a commit that referenced this pull request Feb 5, 2025
## What changed?
<!-- Describe what has changed in this PR -->
- title

## Why?
<!-- Tell your future self why have you made these changes -->
- versioning-3.1

## How did you test it?
<!-- How have you verified this change? Tested locally? Added a unit
test? Checked in staging env? -->
- existing suite of tests + added functional tests

## Potential risks
<!-- Assuming the worst case, what can be broken when deploying this
change to production? -->
- None, feature branch

## Documentation
<!-- Have you made sure this change doesn't falsify anything currently
stated in `docs/`? If significant
new behavior is added, have you described that in `docs/`? -->

## Is hotfix candidate?
<!-- Is this PR a hotfix candidate or does it require a notification to
be sent to the broader community? (Yes/No) -->
- No
ShahabT pushed a commit that referenced this pull request Feb 5, 2025
## What changed?
<!-- Describe what has changed in this PR -->
- title

## Why?
<!-- Tell your future self why have you made these changes -->
- versioning-3.1

## How did you test it?
<!-- How have you verified this change? Tested locally? Added a unit
test? Checked in staging env? -->
- existing suite of tests + added functional tests

## Potential risks
<!-- Assuming the worst case, what can be broken when deploying this
change to production? -->
- None, feature branch

## Documentation
<!-- Have you made sure this change doesn't falsify anything currently
stated in `docs/`? If significant
new behavior is added, have you described that in `docs/`? -->

## Is hotfix candidate?
<!-- Is this PR a hotfix candidate or does it require a notification to
be sent to the broader community? (Yes/No) -->
- No
ShahabT pushed a commit that referenced this pull request Feb 6, 2025
## What changed?
<!-- Describe what has changed in this PR -->
- title

## Why?
<!-- Tell your future self why have you made these changes -->
- versioning-3.1

## How did you test it?
<!-- How have you verified this change? Tested locally? Added a unit
test? Checked in staging env? -->
- existing suite of tests + added functional tests

## Potential risks
<!-- Assuming the worst case, what can be broken when deploying this
change to production? -->
- None, feature branch

## Documentation
<!-- Have you made sure this change doesn't falsify anything currently
stated in `docs/`? If significant
new behavior is added, have you described that in `docs/`? -->

## Is hotfix candidate?
<!-- Is this PR a hotfix candidate or does it require a notification to
be sent to the broader community? (Yes/No) -->
- No
ShahabT pushed a commit that referenced this pull request Feb 6, 2025
## What changed?
<!-- Describe what has changed in this PR -->
- title

## Why?
<!-- Tell your future self why have you made these changes -->
- versioning-3.1

## How did you test it?
<!-- How have you verified this change? Tested locally? Added a unit
test? Checked in staging env? -->
- existing suite of tests + added functional tests

## Potential risks
<!-- Assuming the worst case, what can be broken when deploying this
change to production? -->
- None, feature branch

## Documentation
<!-- Have you made sure this change doesn't falsify anything currently
stated in `docs/`? If significant
new behavior is added, have you described that in `docs/`? -->

## Is hotfix candidate?
<!-- Is this PR a hotfix candidate or does it require a notification to
be sent to the broader community? (Yes/No) -->
- No
ShahabT pushed a commit that referenced this pull request Feb 6, 2025
## What changed?
<!-- Describe what has changed in this PR -->
- title

## Why?
<!-- Tell your future self why have you made these changes -->
- versioning-3.1

## How did you test it?
<!-- How have you verified this change? Tested locally? Added a unit
test? Checked in staging env? -->
- existing suite of tests + added functional tests

## Potential risks
<!-- Assuming the worst case, what can be broken when deploying this
change to production? -->
- None, feature branch

## Documentation
<!-- Have you made sure this change doesn't falsify anything currently
stated in `docs/`? If significant
new behavior is added, have you described that in `docs/`? -->

## Is hotfix candidate?
<!-- Is this PR a hotfix candidate or does it require a notification to
be sent to the broader community? (Yes/No) -->
- No
ShahabT pushed a commit that referenced this pull request Feb 6, 2025
## What changed?
<!-- Describe what has changed in this PR -->
- title

## Why?
<!-- Tell your future self why have you made these changes -->
- versioning-3.1

## How did you test it?
<!-- How have you verified this change? Tested locally? Added a unit
test? Checked in staging env? -->
- existing suite of tests + added functional tests

## Potential risks
<!-- Assuming the worst case, what can be broken when deploying this
change to production? -->
- None, feature branch

## Documentation
<!-- Have you made sure this change doesn't falsify anything currently
stated in `docs/`? If significant
new behavior is added, have you described that in `docs/`? -->

## Is hotfix candidate?
<!-- Is this PR a hotfix candidate or does it require a notification to
be sent to the broader community? (Yes/No) -->
- No
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants