Skip to content

Conversation

AhmedSoliman
Copy link
Contributor

@AhmedSoliman AhmedSoliman commented Apr 15, 2025

The new global metadata synchronization infrastructure is designed to
ensure that all nodes in the cluster have consistent metadata while reducing load over the metadata store.

This has significant implications for the overall performance and reliability of the system, especially when metadata store is overloaded or naturally slow.
The new infrastructure removes the need for adhoc metadata sync requests and limits the staleness to a configurable value.
Additionally, the new infrastructure will allow nodes to keep clawing metadata changes from the growing number of peers that report the new metadata version to it.

Other minor changes:

  • Removed the option to manually sync metadata when requesting to create a snapshot
  • On node startup, we wait for a full global metadata sync
  • Added initially loaded metadata versions to the My Node ID log message

Note that there can be certain scenarios when an isolated node doesn't hear about a metadata change that happened in the cluster (a node that normally doesn't interact with the rest of the cluster). This is still covered by the metadata update interval which will eventually fetch the metadata from metadata store, and in the future, this will be automatically covered by gossip messaging disseminating metadata version changes much quicker than this.

// intentionally empty

Stack created with Sapling. Best reviewed with ReviewStack.

Copy link

Test Results

 4 files   -   3   4 suites   - 3   58s ⏱️ - 1m 50s
15 tests  -  37  15 ✅  -  36  0 💤  - 1  0 ❌ ±0 
30 runs   - 183  30 ✅  - 180  0 💤  - 3  0 ❌ ±0 

Results for commit 3c16a91. ± Comparison against base commit fed9f84.

This pull request removes 52 and adds 15 tests. Note that renamed tests count towards both.
dev.restate.sdktesting.tests.CallOrdering ‑ ordering(boolean[], Client)[1]
dev.restate.sdktesting.tests.CallOrdering ‑ ordering(boolean[], Client)[2]
dev.restate.sdktesting.tests.CallOrdering ‑ ordering(boolean[], Client)[3]
dev.restate.sdktesting.tests.Cancellation ‑ cancelFromAdminAPI(BlockingOperation, Client, URI)[1]
dev.restate.sdktesting.tests.Cancellation ‑ cancelFromAdminAPI(BlockingOperation, Client, URI)[2]
dev.restate.sdktesting.tests.Cancellation ‑ cancelFromAdminAPI(BlockingOperation, Client, URI)[3]
dev.restate.sdktesting.tests.Cancellation ‑ cancelFromContext(BlockingOperation, Client)[1]
dev.restate.sdktesting.tests.Cancellation ‑ cancelFromContext(BlockingOperation, Client)[2]
dev.restate.sdktesting.tests.Cancellation ‑ cancelFromContext(BlockingOperation, Client)[3]
dev.restate.sdktesting.tests.Combinators ‑ awakeableOrTimeoutUsingAwakeableTimeoutCommand(Client)
…
dev.restate.sdktesting.tests.AwakeableIngressEndpoint ‑ completeWithFailure(Client)
dev.restate.sdktesting.tests.AwakeableIngressEndpoint ‑ completeWithSuccess(Client)
dev.restate.sdktesting.tests.FrontCompatibilityTest$NewVersion ‑ completeAwakeable(Client)
dev.restate.sdktesting.tests.FrontCompatibilityTest$NewVersion ‑ completeRetryableOperation(Client)
dev.restate.sdktesting.tests.FrontCompatibilityTest$NewVersion ‑ proxyCallShouldBeDone(Client)
dev.restate.sdktesting.tests.FrontCompatibilityTest$NewVersion ‑ proxyOneWayCallShouldBeDone(Client)
dev.restate.sdktesting.tests.FrontCompatibilityTest$OldVersion ‑ createAwakeable(Client)
dev.restate.sdktesting.tests.FrontCompatibilityTest$OldVersion ‑ startOneWayProxyCall(Client)
dev.restate.sdktesting.tests.FrontCompatibilityTest$OldVersion ‑ startProxyCall(Client)
dev.restate.sdktesting.tests.FrontCompatibilityTest$OldVersion ‑ startRetryableOperation(Client)
…

Copy link

github-actions bot commented Apr 15, 2025

Test Results

  7 files  ±0    7 suites  ±0   4m 42s ⏱️ -6s
 54 tests ±0   52 ✅  - 1  1 💤 ±0  0 ❌ ±0  1 🔥 +1 
223 runs  ±0  219 ✅  - 1  3 💤 ±0  0 ❌ ±0  1 🔥 +1 

For more details on these errors, see this check.

Results for commit 255fbf4. ± Comparison against base commit 71758bc.

♻️ This comment has been updated with latest results.

Copy link
Contributor

@muhamadazmy muhamadazmy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you so much @AhmedSoliman for this huge improvement. The changes looks good to me :). I left few comments/questions.

@@ -1205,7 +1203,10 @@ impl LogsController {
WriteError::FailedPrecondition(err) => {
info!(%err, "Detected a concurrent modification to the log chain");
// Perhaps we already have a newer version, if not, fetch latest.
let _ = Metadata::current().sync(MetadataKind::Logs, TargetVersion::Version(previous_version.next())).await;
let metadata = Metadata::current();
metadata.notify_observed_version(MetadataKind::Logs, previous_version.next(), None);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe in the future, FailedPrecondition should contain information about the current version. Hence calling notify_observed_version with the real current version.

@@ -262,9 +253,6 @@ struct AdminOptionsShadow {
#[serde_as(as = "serde_with::DisplayFromStr")]
heartbeat_interval: humantime::Duration,

#[serde_as(as = "serde_with::DisplayFromStr")]
metadata_sync_interval: humantime::Duration,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make sense to leave this in AdminOptionsShadow to print proper deprecation warnings (that it has been removed) or it has been replaced by metadata_update_interval

use crate::{Version, Versioned, flexbuffers_storage_encode_decode};

/// A trait all metadata types managed by metadata manager.
pub trait GlobalMetadata: Versioned + StorageEncode + StorageDecode {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice

/// A trait all metadata types managed by metadata manager.
pub trait GlobalMetadata: Versioned + StorageEncode + StorageDecode {
/// The key for this metadata type in metadata store
const KEY: &'static str;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like how the key is defined near the type, I always forget where the keys constants are defined.

let metadata = Metadata::current();
let (logs_version, partition_table_version) = match join(
metadata.wait_for_version(MetadataKind::Logs, restate_types::Version::MIN),
metadata.wait_for_version(MetadataKind::PartitionTable, restate_types::Version::MIN),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not synching the schema as well?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because Schema is not required to have a Version=1 at provisioning time. The first version is written after the user registers a deployment. Note that update_task for schema will continue polling metadata server until the version of the schema is set anyway.

@AhmedSoliman AhmedSoliman force-pushed the pr3159 branch 3 times, most recently from 06c2882 to a434074 Compare April 16, 2025 14:44
Copy link
Contributor

@tillrohrmann tillrohrmann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work @AhmedSoliman 🦸‍♂️ Avoiding the excessive syncs and making the metadata update tasks responsive to newly observed versions and allowing to fetch from more than one peer are great improvements. The results speak for itself :-) LGTM. +1 for merging.

// which peers did we ask already?
peers_attempted_for_this_version: HashSet<GenerationalNodeId>,
in_flight_peer_requests: JoinSet<Option<Arc<T>>>,
_marker: std::marker::PhantomData<T>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PhantomData does not seem to be needed.


let handle = TaskCenter::spawn_unmanaged(
TaskKind::MetadataBackgroundSync,
"metadata-update-{}",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is here a format!("metadata-update-{}", T::KIND) missing?

@@ -279,12 +279,24 @@ pub struct CommonOptions {

/// # Metadata update interval
///
/// The interval at which each node checks for metadata updates it has observed from different
/// nodes or other sources.
/// The idle time afterwhich the node will check for metadata updates from metadata store.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

after which

@AhmedSoliman AhmedSoliman force-pushed the pr3159 branch 2 times, most recently from 67fb684 to 1351e67 Compare April 17, 2025 09:25
- Auto-recovery needs to be higher than http2 keep-alive timeout
- Shorter raft election timeout (~1-ish second)
The new global metadata synchronization infrastructure is designed to
ensure that all nodes in the cluster have consistent metadata while reducing load over the metadata store.

This has significant implications for the overall performance and reliability of the system, especially when metadata store is overloaded or naturally slow.
The new infrastructure removes the need for adhoc metadata sync requests and limits the staleness to a configurable value.
Additionally, the new infrastructure will allow nodes to keep clawing metadata changes from the growing number of peers that report the new metadata version to it.

Other minor changes:
- Removed the option to manually `sync` metadata when requesting to create a snapshot
- On node startup, we wait for a full global metadata sync
- Added initially loaded metadata versions to the `My Node ID` log message


Note that there can be certain scenarios when an isolated node doesn't hear about a metadata change that happened in the cluster (a node that normally doesn't interact with the rest of the cluster). This is still covered by the metadata update interval which will eventually fetch the metadata from metadata store, and in the future, this will be automatically covered by gossip messaging disseminating metadata version changes much quicker than this.

```
// intentionally empty
```
@AhmedSoliman AhmedSoliman merged commit 255fbf4 into main Apr 22, 2025
53 of 55 checks passed
@AhmedSoliman AhmedSoliman deleted the pr3159 branch April 22, 2025 11:09
@github-actions github-actions bot locked and limited conversation to collaborators Apr 22, 2025
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants