Skip to content

Conversation

Shivs11
Copy link
Member

@Shivs11 Shivs11 commented May 6, 2025

What changed?

  • Worker Deployment entity workflow to now CAN after every pending update and signal has been processed.

Why?

  • So that one can avoid NDE errors when server rollbacks to the previous version. This is because CAN'ing effectively reduces the history size as well as the probability of an NDE error!

How did you test it?

  • Current suite of tests are passing locally.
  • Updated unit tests as well.

Potential risks

  • NOTE: This is a breaking change and customers currently using our entity workflows will be affected with potential NDE's.

Documentation

Is hotfix candidate?

@@ -1603,7 +1603,7 @@ func (s *WorkerDeploymentSuite) verifyDescribeWorkerDeployment(
expectedResp *workflowservice.DescribeWorkerDeploymentResponse,
) {
// relaxed timestamp constraint since the tests make sync calls, which could theoretically take seconds.
maxDurationBetweenTimeStamps := 2 * time.Second
maxDurationBetweenTimeStamps := 5 * time.Second
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

relaxing the timestamp checks a little more since now, after every update, we CAN and this is taking about a second or so.

I intend on shipping out a change very soon that shall un-revert this change and pass in stricter expected timestamps from the tests itself. Right now, in most tests, the expected value after an operation is conducted is passed as time.Now() which is not accurate.

@Shivs11 Shivs11 changed the title Ss/versioning auto restart pt1 versioning entity workflows: enabling auto-restart pt1 May 6, 2025
Comment on lines 918 to 926
func (d *WorkflowRunner) setStateChanged(ctx workflow.Context) {
if !d.stateChanged.value {
// Send the value on the channel without blocking.
workflow.Go(ctx, func(ctx workflow.Context) {
d.stateChanged.value = true
d.stateChanged.ch.Send(ctx, nil)
})
}
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Worried our entity workflow could get stuck if I have not implemented the Send on channels correctly. The main requirement is to not block for the sends that happen after a state has changed. This is why I have implemented them to be in their own go-routine.

My initial attempt was to not have this snippet of code in a go-routine, but I soon realized that this was a blocking call and was causing timeouts to happen in most tests. I think this works, but feel free to correct me if I am wrong.

cc - @dnr

@Shivs11 Shivs11 marked this pull request as ready for review May 7, 2025 17:37
@Shivs11 Shivs11 requested a review from a team as a code owner May 7, 2025 17:37
c.Receive(ctx, nil)
forceCAN = true
d.forceCAN = true
d.signalHandler.processingSignals--
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's put these lines in defer in case some one later returns from the middle.

@ShahabT ShahabT merged commit 0f68f3f into main May 9, 2025
53 checks passed
@ShahabT ShahabT deleted the ss/versioning-auto-restart-pt1 branch May 9, 2025 17:32
josesa added a commit to josesa/temporal that referenced this pull request May 12, 2025
* main: (22 commits)
  Add host health metrics gauge (temporalio#7728)
  add rule expiration check (temporalio#7749)
  Add activity options to the pending activity info (temporalio#7727)
  Enable DLQ V2 for replication (temporalio#7746)
  chore: be smarter about when to use Stringer vs String (temporalio#7743)
  versioning entity workflows: enabling auto-restart pt1 (temporalio#7715)
  Refactor code generators (temporalio#7734)
  allow passive to generate replication tasks (temporalio#7713)
  Validate links in completion callbacks (temporalio#7726)
  CHASM: Engine Update/ReadComponent implementation (temporalio#7696)
  Enable transition history in dev env and tests (temporalio#7737)
  chore: Add Stringer tags (temporalio#7738)
  Add internal pod health check to DeepHealthCheck (temporalio#7709)
  Rename internal CHASM task processing interface (temporalio#7730)
  [Frontend] Log slow gRPC requests (temporalio#7718)
  Remove cap for dynamic config callback pool (temporalio#7723)
  Refactor updateworkflowoptions package (temporalio#7725)
  Remove a bunch of redundant utf-8 validation (temporalio#7720)
  [CHASM] Pure task processing - GetPureTasks, ExecutePureTasks (temporalio#7701)
  Send ActivityReset flag to the worker in heartbeat response (temporalio#7677)
  ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants