Skip to content

Conversation

slinkydeveloper
Copy link
Contributor

Followup of #2891, part of #2890.

This implements the InvocationEpoch mechanism, as described in #2890.

@slinkydeveloper slinkydeveloper force-pushed the invocation-epoch branch 2 times, most recently from 4b6b994 to f6df49b Compare April 8, 2025 08:57
@slinkydeveloper slinkydeveloper force-pushed the invocation-epoch branch 2 times, most recently from 7515ab7 to 52dfc8a Compare April 22, 2025 17:21
Copy link
Contributor

@tillrohrmann tillrohrmann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for introducing the invocation epoch for enabling trimming and restarting of invocations @slinkydeveloper. The general idea makes sense to me. I think in the implementation there might be a problem wrt to handling errors in the invoker. Moreover, what's not fully clear to me is how the invocation epoch will work during a rolling upgrade where one node is running on an older version and does not know about the invocation epoch and the caller is using it. Would this pose a problem if the response is sent back to the caller with 0 as the invocation epoch and then potentially being ignored?

@slinkydeveloper
Copy link
Contributor Author

Moreover, what's not fully clear to me is how the invocation epoch will work during a rolling upgrade where one node is running on an older version and does not know about the invocation epoch and the caller is using it. Would this pose a problem if the response is sent back to the caller with 0 as the invocation epoch and then potentially being ignored?

It simply doesn't work. You can't use the new trim and restart feature unless you're 100% sure you're using the new restate version, that's the constraint.

@tillrohrmann
Copy link
Contributor

It simply doesn't work. You can't use the new trim and restart feature unless you're 100% sure you're using the new restate version, that's the constraint.

So should this feature by default disabled with the next version unless users explicitly enable it in the configuration which states that they should only use it if all their nodes have upgraded to the next version and they forfeit being able to go back to an earlier version?

@slinkydeveloper
Copy link
Contributor Author

So should this feature by default disabled with the next version unless users explicitly enable it in the configuration which states that they should only use it if all their nodes have upgraded to the next version and they forfeit being able to go back to an earlier version?

I guess we can, but we can also defer this decision to a later point IMO.

@tillrohrmann
Copy link
Contributor

I guess we can, but we can also defer this decision to a later point IMO.

If so, then please create a release blocker issue for the next release. I would, however, advise against postponing making a decision for too long because it will otherwise bite us when we have/want to create a quick release.

Copy link
Contributor

@tillrohrmann tillrohrmann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for updating the PR @slinkydeveloper. I unresolved a few of my previous comments because it was not clear to me whether you rejected them or forgot to address them.

My main question is how things are supposed to work with Awakeables. It looks as if awakeables will always have the invocation epoch 0. However, once I time travel, I will be in an invocation epoch > 0. If this happens, then it seems that awakaeble completions are no longer accepted. Is this correct or did I overlook something?

I also wanted to ask what's the plan with how to enable/disable this feature to prevent users from shooting themselves in the foot if the time travel during a rolling upgrade. Are you gonna address in the PR where you introduce the time travel feature?

@@ -603,6 +614,7 @@ message OutboxMessage {
message OutboxServiceInvocationResponse {
InvocationId invocation_id = 1;
uint32 entry_index = 2;
uint32 caller_invocation_epoch = 4;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest to sort them in a way that everyone can follow. Your logical sorting might be different than mine.

@slinkydeveloper
Copy link
Contributor Author

slinkydeveloper commented Apr 28, 2025

I also wanted to ask what's the plan with how to enable/disable this feature to prevent users from shooting themselves in the foot if the time travel during a rolling upgrade. Are you gonna address in the PR where you introduce the time travel feature?

I'm not sure yet. For the time being let's move on this PR, i'm gonna check with @igalshilman about this. This intersects with how we wanna present it in the UI too (it would be bad if we show in the UI the button but then it fails because some config option is not enabled).

Copy link
Contributor

@tillrohrmann tillrohrmann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for bearing with me and addressing my comments @slinkydeveloper. The changes look good to me :-) +1 for merging.

@@ -603,6 +614,7 @@ message OutboxMessage {
message OutboxServiceInvocationResponse {
InvocationId invocation_id = 1;
uint32 entry_index = 2;
uint32 caller_invocation_epoch = 4;
Copy link
Contributor

@tillrohrmann tillrohrmann Apr 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: To make it explicit why I believe this ordering is sub-optimal is because now users need to scan the whole list of fields to figure out what's the current max tag number when adding a new field.

@slinkydeveloper slinkydeveloper force-pushed the invocation-epoch branch 2 times, most recently from 36f7e11 to 2c31c3a Compare April 28, 2025 14:47
* Add current_invocation_epoch and completion_range_epoch_map, we need this state to support the algorithm described here restatedev#2890 (comment) . This two fields will be updated whenever a new trim happens.
* Add propagation of invocation_epoch in all the relevant places. This also includes a little reshape of the InvocationResponse. To support serde front-back compat, we use a proxy struct as done for ServiceInvocation. Also includes a lil test for that.
* Added the new fields support to partition-store
* Add InvocationEpoch awareness tests.
* Implement Invocation Epoch awareness and the fencing off algorithm. The core of the change is in InvocationStatusExt::should_accept_completion, everything else is wiring up and propagation of the InvocationEpoch
* Make Invoker InvocationEpoch aware:
  * Both events from PP and from internal tasks now require the InvocationEpoch. All events except Invoke and Abort will be ignored when epoch doesn't match
  * Handle the cases where Abort/Invoke are out of order (as described here restatedev#2890 (comment)). Added few tests for that.
Copy link
Contributor

@tillrohrmann tillrohrmann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. +1 for merging :-)

Some(f(tx, ism))
} else {
// If no state machine
trace!("No state machine found for selected server header");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does server header relates to?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's the header that tells us which SDK the user is using.

assert!(!is.should_accept_completion(0, 2));
// In all the other cases, I accept it.
assert!(is.should_accept_completion(0, 1));
assert!(is.should_accept_completion(1, 1));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I don't fully understand is how completion id 1 will get the invocation epoch 1 given that the trim point is at completion id 2. But no need to change anything.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed this specific case cannot happen.

@slinkydeveloper slinkydeveloper merged commit f6cbbb2 into restatedev:main Apr 29, 2025
27 checks passed
@slinkydeveloper slinkydeveloper deleted the invocation-epoch branch April 29, 2025 16:20
@github-actions github-actions bot locked and limited conversation to collaborators Apr 29, 2025
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants