-
Notifications
You must be signed in to change notification settings - Fork 94
Invocation epoch #2904
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Invocation epoch #2904
Conversation
e02a3dd
to
c46d1b9
Compare
4b6b994
to
f6df49b
Compare
7515ab7
to
52dfc8a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for introducing the invocation epoch for enabling trimming and restarting of invocations @slinkydeveloper. The general idea makes sense to me. I think in the implementation there might be a problem wrt to handling errors in the invoker. Moreover, what's not fully clear to me is how the invocation epoch will work during a rolling upgrade where one node is running on an older version and does not know about the invocation epoch and the caller is using it. Would this pose a problem if the response is sent back to the caller with 0
as the invocation epoch and then potentially being ignored?
crates/worker/src/partition/state_machine/entries/attach_invocation_command.rs
Outdated
Show resolved
Hide resolved
crates/worker/src/partition/state_machine/entries/get_invocation_output_command.rs
Outdated
Show resolved
Hide resolved
crates/worker/src/partition/state_machine/invocation_status_ext.rs
Outdated
Show resolved
Hide resolved
crates/worker/src/partition/state_machine/lifecycle/notify_invocation_response.rs
Show resolved
Hide resolved
crates/worker/src/partition/state_machine/lifecycle/notify_sleep_completion.rs
Show resolved
Hide resolved
It simply doesn't work. You can't use the new trim and restart feature unless you're 100% sure you're using the new restate version, that's the constraint. |
So should this feature by default disabled with the next version unless users explicitly enable it in the configuration which states that they should only use it if all their nodes have upgraded to the next version and they forfeit being able to go back to an earlier version? |
I guess we can, but we can also defer this decision to a later point IMO. |
If so, then please create a release blocker issue for the next release. I would, however, advise against postponing making a decision for too long because it will otherwise bite us when we have/want to create a quick release. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for updating the PR @slinkydeveloper. I unresolved a few of my previous comments because it was not clear to me whether you rejected them or forgot to address them.
My main question is how things are supposed to work with Awakeables. It looks as if awakeables will always have the invocation epoch 0. However, once I time travel, I will be in an invocation epoch > 0. If this happens, then it seems that awakaeble completions are no longer accepted. Is this correct or did I overlook something?
I also wanted to ask what's the plan with how to enable/disable this feature to prevent users from shooting themselves in the foot if the time travel during a rolling upgrade. Are you gonna address in the PR where you introduce the time travel feature?
crates/worker/src/partition/state_machine/invocation_status_ext.rs
Outdated
Show resolved
Hide resolved
@@ -603,6 +614,7 @@ message OutboxMessage { | |||
message OutboxServiceInvocationResponse { | |||
InvocationId invocation_id = 1; | |||
uint32 entry_index = 2; | |||
uint32 caller_invocation_epoch = 4; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would suggest to sort them in a way that everyone can follow. Your logical sorting might be different than mine.
crates/worker/src/partition/state_machine/invocation_status_ext.rs
Outdated
Show resolved
Hide resolved
crates/worker/src/partition/state_machine/lifecycle/notify_invocation_response.rs
Show resolved
Hide resolved
e047d8b
to
e47f9e1
Compare
I'm not sure yet. For the time being let's move on this PR, i'm gonna check with @igalshilman about this. This intersects with how we wanna present it in the UI too (it would be bad if we show in the UI the button but then it fails because some config option is not enabled). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for bearing with me and addressing my comments @slinkydeveloper. The changes look good to me :-) +1 for merging.
crates/worker/src/partition/state_machine/lifecycle/notify_invocation_response.rs
Show resolved
Hide resolved
crates/worker/src/partition/state_machine/invocation_status_ext.rs
Outdated
Show resolved
Hide resolved
@@ -603,6 +614,7 @@ message OutboxMessage { | |||
message OutboxServiceInvocationResponse { | |||
InvocationId invocation_id = 1; | |||
uint32 entry_index = 2; | |||
uint32 caller_invocation_epoch = 4; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: To make it explicit why I believe this ordering is sub-optimal is because now users need to scan the whole list of fields to figure out what's the current max tag number when adding a new field.
36f7e11
to
2c31c3a
Compare
2c31c3a
to
4e70add
Compare
* Add current_invocation_epoch and completion_range_epoch_map, we need this state to support the algorithm described here restatedev#2890 (comment) . This two fields will be updated whenever a new trim happens. * Add propagation of invocation_epoch in all the relevant places. This also includes a little reshape of the InvocationResponse. To support serde front-back compat, we use a proxy struct as done for ServiceInvocation. Also includes a lil test for that. * Added the new fields support to partition-store * Add InvocationEpoch awareness tests. * Implement Invocation Epoch awareness and the fencing off algorithm. The core of the change is in InvocationStatusExt::should_accept_completion, everything else is wiring up and propagation of the InvocationEpoch * Make Invoker InvocationEpoch aware: * Both events from PP and from internal tasks now require the InvocationEpoch. All events except Invoke and Abort will be ignored when epoch doesn't match * Handle the cases where Abort/Invoke are out of order (as described here restatedev#2890 (comment)). Added few tests for that.
4e70add
to
75bb667
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. +1 for merging :-)
Some(f(tx, ism)) | ||
} else { | ||
// If no state machine | ||
trace!("No state machine found for selected server header"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does server header relates to?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's the header that tells us which SDK the user is using.
assert!(!is.should_accept_completion(0, 2)); | ||
// In all the other cases, I accept it. | ||
assert!(is.should_accept_completion(0, 1)); | ||
assert!(is.should_accept_completion(1, 1)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What I don't fully understand is how completion id 1 will get the invocation epoch 1 given that the trim point is at completion id 2. But no need to change anything.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed this specific case cannot happen.
Followup of #2891, part of #2890.
This implements the
InvocationEpoch
mechanism, as described in #2890.