[Core] Read-modify-write for global metadata #3164

AhmedSoliman · 2025-04-15T21:25:30Z

Stack created with Sapling. Best reviewed with ReviewStack.

github-actions · 2025-04-16T08:59:53Z

Test Results

7 files + 3 7 suites +3 4m 58s ⏱️ + 3m 56s
54 tests + 38 53 ✅ + 37 1 💤 +1 0 ❌ ±0
223 runs +191 220 ✅ +188 3 💤 +3 0 ❌ ±0

Results for commit a932e8a. ± Comparison against base commit d100b46.

This pull request removes 16 and adds 54 tests. Note that renamed tests count towards both.

dev.restate.sdktesting.tests.AwakeableIngressEndpoint ‑ completeWithFailure(Client)
dev.restate.sdktesting.tests.AwakeableIngressEndpoint ‑ completeWithSuccess(Client)
dev.restate.sdktesting.tests.FrontCompatibilityTest$NewVersion ‑ completeAwakeable(Client)
dev.restate.sdktesting.tests.FrontCompatibilityTest$NewVersion ‑ completeRetryableOperation(Client)
dev.restate.sdktesting.tests.FrontCompatibilityTest$NewVersion ‑ proxyCallShouldBeDone(Client)
dev.restate.sdktesting.tests.FrontCompatibilityTest$NewVersion ‑ proxyOneWayCallShouldBeDone(Client)
dev.restate.sdktesting.tests.FrontCompatibilityTest$OldVersion ‑ createAwakeable(Client)
dev.restate.sdktesting.tests.FrontCompatibilityTest$OldVersion ‑ startOneWayProxyCall(Client)
dev.restate.sdktesting.tests.FrontCompatibilityTest$OldVersion ‑ startProxyCall(Client)
dev.restate.sdktesting.tests.FrontCompatibilityTest$OldVersion ‑ startRetryableOperation(Client)
…

dev.restate.sdktesting.tests.CallOrdering ‑ ordering(boolean[], Client)[1]
dev.restate.sdktesting.tests.CallOrdering ‑ ordering(boolean[], Client)[2]
dev.restate.sdktesting.tests.CallOrdering ‑ ordering(boolean[], Client)[3]
dev.restate.sdktesting.tests.Cancellation ‑ cancelFromAdminAPI(BlockingOperation, Client, URI)[1]
dev.restate.sdktesting.tests.Cancellation ‑ cancelFromAdminAPI(BlockingOperation, Client, URI)[2]
dev.restate.sdktesting.tests.Cancellation ‑ cancelFromAdminAPI(BlockingOperation, Client, URI)[3]
dev.restate.sdktesting.tests.Cancellation ‑ cancelFromContext(BlockingOperation, Client)[1]
dev.restate.sdktesting.tests.Cancellation ‑ cancelFromContext(BlockingOperation, Client)[2]
dev.restate.sdktesting.tests.Cancellation ‑ cancelFromContext(BlockingOperation, Client)[3]
dev.restate.sdktesting.tests.Combinators ‑ awakeableOrTimeoutUsingAwaitAny(Client)
…

♻️ This comment has been updated with latest results.

tillrohrmann

Thanks for adding read_modify_write to the MetadataClientWrapper @AhmedSoliman. The changes look good to me. I think it's a good improvement to also include other sources (nodes) to get the latest metadata from.

Reasoning about the retry behavior of the read_modify_write call has gotten a little bit harder because of the added layer of indirection when fetching updates. I guess this is hard to avoid. And it wasn't simple before either.

The one question I had was whether errors that happen within the interaction between the metadata store client and the store are retried indefinitely. If yes, then this looks as if it slightly changes the semantics of some of the call sites (admin API, cluster configuration updates and the bifrost watchdog). Should we add some higher level timeouts to the affected call sites?

crates/core/src/metadata/metadata_client_wrapper.rs

tillrohrmann · 2025-04-22T16:28:05Z

crates/types/src/net/metadata.rs

+}
+
+pub trait Extraction: GlobalMetadata {
+    type Output;


Is there a need for extracting a different Output than Self?

Not really but that's the only trick I could find to get the dynamic extraction to work.

So it wouldn't work to change fn extract_as_global_metadata(v: MetadataContainer) -> Option<Arc<Self>>; and to remove the associated type?

crates/core/src/metadata/metadata_client_wrapper.rs

AhmedSoliman · 2025-04-25T05:49:41Z

@tillrohrmann Thanks for taking a look and the feedback. I've adjusted the approach a little to take into account the total retry duration to indirectly limit the wait time. In all fairness it's a tricky problem to solve, but at least we don't end up in being stuck forever with this change.

This commit introduces a new network protocol version with automatica backward compatibility layer and an accompanying modified networking API to provider support for the following: 1. Native RPC in protocol level with fabric-level service and service-shard routing. Routing to service shards is offered via an opaque `SortCode`, a u64 value that can be used to route messages to a specific shard of a service (e.g. `PartitionId`, or `LogletId`, and etc.). 2. Support for cancellation of enqueued egress RPC requests if the caller is not interested in the result anymore. 3. Message payload deserialization offloaded from the network reactor to the call-site's thread to reduce the network reactor's CPU usage and reduce the effect of head-of-line blocking caused by expensive messages. 5. Adds the concept of `Service` that can handle `rpc`, `unary`, and `watch` messages. 6. Improved ergonomics for using the message fabric for sending rpc requests, no need to get access to `MessageRouterBuilder` anymore, this unlocks the ability to create arbitrary connections that are self-managed (not tied to connection manager). 7. WIP Introduces `Swimlane`s concept to classify streams/connections into different swim lanes. This will provide isolation between fat data streams and metadata low-latency streams in the future. 8. WIP support for "remote watches". Not fully implemented, but will be available in the future. 9. A variety of fixes and improvements to the existing code.

tillrohrmann

Thanks for adding a timeout for the read-modify-write operation @AhmedSoliman. Nice tooling for the RetryPolicy you've added :-) LGTM. +1 for merging.

crates/types/src/retries.rs

tillrohrmann · 2025-04-25T13:35:42Z

crates/types/src/retries.rs

+                //-------------------------------------------------------------
+                // Put all arithmetic in f64 milliseconds for convenience
+                //-------------------------------------------------------------
+                let r = *factor as f64;
+                let d1_ms = next_delay.as_secs_f64() * 1_000.0; // d₁
+                let cap_ms = max_interval.map(|d| d.as_secs_f64() * 1_000.0); // M
+
+                //-------------------------------------------------------------
+                // How many future delays remain purely exponential (< cap)?
+                //-------------------------------------------------------------
+                let n_exp = match cap_ms {
+                    None => retries_left,       // no cap at all
+                    Some(m) if d1_ms >= m => 0, // already above / at the cap
+                    Some(m) => {
+                        // smallest j s.t. d₁·rʲ ≥ M  →  j = ceil(log_r(M/d₁))
+                        let ceil_j = ((m / d1_ms).ln() / r.ln()).ceil() as usize;
+                        retries_left.min(ceil_j)
+                    }
+                };
+
+                //-------------------------------------------------------------
+                // Geometric part (those still < cap)
+                //-------------------------------------------------------------
+                let geom_ms = if n_exp == 0 {
+                    0.0
+                } else {
+                    d1_ms * (r.powi(n_exp as i32) - 1.0) / (r - 1.0)
+                };
+
+                //-------------------------------------------------------------
+                // Flat tail at the cap, if any
+                //-------------------------------------------------------------
+                let cap_tail_ms = match cap_ms {
+                    Some(m) => (retries_left - n_exp) as f64 * m,
+                    None => 0.0,
+                };


Super nicely explained the math 🤩

tillrohrmann · 2025-04-25T13:40:46Z

crates/types/src/retries.rs

+                    None => retries_left,       // no cap at all
+                    Some(m) if d1_ms >= m => 0, // already above / at the cap
+                    Some(m) => {
+                        // smallest j s.t. d₁·rʲ ≥ M  →  j = ceil(log_r(M/d₁))


I guess this works because if M/d1 is a power of r, then we simply count the last exponential step as the first tail step, right?

Network swimlanes categorize network connections into different groups based on their characteristics. This PR is a continuation of the work done in the previous PR, which introduced the concept of swimlanes. The goal is to enhance the network management capabilities of the system by allowing for more granular control and monitoring of network connections. Most notable changes: - We keep a single connection per swimlane instead of a pool of connections. This removes the configuration option `num-concurrent-connections`. - Connections/stream's window sizes are now configurable and are set to reasonable defaults that suit our workloads better than tonic/hyper's defaults. - Removal of `outbound-queue-length` in networking in favor of relying on memory-level settings (stream window sizes) - Introducing `swimlane` in `Hello` message to communicate the swimlane to the peer during handshake - Gossip connections are automatically bidirectional

AhmedSoliman force-pushed the pr3164 branch 2 times, most recently from cbba7d3 to 6907bb8 Compare April 16, 2025 08:34

AhmedSoliman mentioned this pull request Apr 16, 2025

[minor] PeerAddress should be generational id #3167

Merged

AhmedSoliman force-pushed the pr3164 branch 5 times, most recently from 3332a02 to ed30410 Compare April 17, 2025 09:25

AhmedSoliman marked this pull request as ready for review April 17, 2025 09:31

tillrohrmann mentioned this pull request Apr 17, 2025

Use global metadata wrapper when writing metadata for provisioning the cluster #3174

Merged

AhmedSoliman force-pushed the pr3164 branch from ed30410 to 5897952 Compare April 22, 2025 08:14

AhmedSoliman requested a review from tillrohrmann April 22, 2025 11:12

AhmedSoliman force-pushed the pr3164 branch from 5897952 to f2ed192 Compare April 22, 2025 16:29

AhmedSoliman mentioned this pull request Apr 22, 2025

Introduction of Message Fabric V2 #3183

Merged

tillrohrmann reviewed Apr 23, 2025

View reviewed changes

AhmedSoliman force-pushed the pr3164 branch from f2ed192 to 21bd80d Compare April 23, 2025 15:46

AhmedSoliman mentioned this pull request Apr 23, 2025

[minor] Move partition processor rpc client to ingress-http crate #3191

Merged

AhmedSoliman force-pushed the pr3164 branch 4 times, most recently from e3c08df to 9a80ce4 Compare April 24, 2025 13:17

AhmedSoliman mentioned this pull request Apr 24, 2025

[Core] Introducing network swimlanes #3193

Merged

AhmedSoliman force-pushed the pr3164 branch 2 times, most recently from 5a114a0 to 44f78c3 Compare April 24, 2025 13:37

AhmedSoliman force-pushed the pr3164 branch 2 times, most recently from 83e8432 to 55201e5 Compare April 24, 2025 17:24

AhmedSoliman force-pushed the pr3164 branch from 55201e5 to 0369827 Compare April 25, 2025 09:12

AhmedSoliman mentioned this pull request Apr 25, 2025

[chore] Moving ClusterCtrlSvc into admin #3195

Closed

AhmedSoliman added 2 commits April 25, 2025 14:46

[minor] Move partition processor rpc client to ingress-http crate

835d2cf

tillrohrmann approved these changes Apr 25, 2025

View reviewed changes

AhmedSoliman force-pushed the pr3164 branch from 0369827 to b1f13fc Compare April 25, 2025 13:52

AhmedSoliman added 2 commits April 25, 2025 14:59

[Core] Read-modify-write for global metadata

a932e8a

AhmedSoliman force-pushed the pr3164 branch from b1f13fc to a932e8a Compare April 25, 2025 14:00

AhmedSoliman merged commit a932e8a into main Apr 25, 2025
55 checks passed

AhmedSoliman deleted the pr3164 branch April 25, 2025 14:36

github-actions bot locked and limited conversation to collaborators Apr 25, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Core] Read-modify-write for global metadata #3164

[Core] Read-modify-write for global metadata #3164

Uh oh!

AhmedSoliman commented Apr 15, 2025 •

edited

Loading

Uh oh!

github-actions bot commented Apr 16, 2025 •

edited

Loading

Uh oh!

tillrohrmann left a comment •

edited

Loading

Uh oh!

Uh oh!

tillrohrmann Apr 22, 2025

Uh oh!

AhmedSoliman Apr 24, 2025

Uh oh!

tillrohrmann Apr 24, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

AhmedSoliman commented Apr 25, 2025

Uh oh!

tillrohrmann left a comment

Uh oh!

Uh oh!

tillrohrmann Apr 25, 2025

Uh oh!

tillrohrmann Apr 25, 2025

Uh oh!

Uh oh!

Uh oh!

[Core] Read-modify-write for global metadata #3164

[Core] Read-modify-write for global metadata #3164

Uh oh!

Conversation

AhmedSoliman commented Apr 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Apr 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Test Results

Uh oh!

tillrohrmann left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

tillrohrmann Apr 22, 2025

Choose a reason for hiding this comment

Uh oh!

AhmedSoliman Apr 24, 2025

Choose a reason for hiding this comment

Uh oh!

tillrohrmann Apr 24, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

AhmedSoliman commented Apr 25, 2025

Uh oh!

tillrohrmann left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

tillrohrmann Apr 25, 2025

Choose a reason for hiding this comment

Uh oh!

tillrohrmann Apr 25, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

AhmedSoliman commented Apr 15, 2025 •

edited

Loading

github-actions bot commented Apr 16, 2025 •

edited

Loading

tillrohrmann left a comment •

edited

Loading