-
Notifications
You must be signed in to change notification settings - Fork 97
Heartbeat improvements and configurability #6666
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Heartbeat improvements and configurability #6666
Conversation
📝 Walkthrough📝 WalkthroughWalkthroughThe pull request introduces modifications to the HOPR node's configuration and transport layer, specifically enhancing the heartbeat mechanism. Key changes include the addition of the Changes
Possibly related PRs
Suggested labels
Suggested reviewers
📜 Recent review detailsConfiguration used: .coderabbit.yaml ⛔ Files ignored due to path filters (1)
📒 Files selected for processing (7)
🚧 Files skipped from review as they are similar to previous changes (6)
🔇 Additional comments (6)transport/network/src/heartbeat.rs (6)
The new imports are correctly placed and properly grouped with related imports. Also applies to: 31-34
The
The
The peer list shuffling is well-implemented:
The
Let's verify the ✅ Verification successfulConstant is properly defined with a reasonable value The constant 🏁 Scripts executedThe following scripts were executed for the analysis: Script: #!/bin/bash
# Description: Check the DEFAULT_MAX_PARALLEL_PINGS constant definition and value
# Search for the constant definition
rg -A 1 "DEFAULT_MAX_PARALLEL_PINGS\s*=\s*" transport/network/src/constants.rs
# Check if the constant is used consistently
rg "DEFAULT_MAX_PARALLEL_PINGS" --type rust
Length of output: 342 Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
Documentation and Community
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Outside diff range and nitpick comments (7)
transport/network/src/constants.rs (1)
12-13
: Enhance documentation with rationale and considerationsWhile the basic purpose is documented, it would be helpful to include:
- Rationale for choosing this specific value
- Any system resource considerations
- Impact on network performance
Consider expanding the documentation like this:
-/// The maximum number of parallel ping operation the heartbeat performs +/// The maximum number of parallel ping operations the heartbeat performs. +/// This limit helps prevent resource exhaustion while maintaining efficient +/// network probing. The value is chosen based on [add rationale here]. +/// +/// Note: Higher values may increase CPU and network usage, while lower values +/// may increase the time needed to detect unresponsive peers.transport/network/src/config.rs (2)
165-166
: Remove or update the misleading documentation.The comment "Enforcing backwards compatibility with the 'no ignore behaviour' of the previous release" appears incorrect, as this change actually introduces a longer ignore timeframe (4 minutes) compared to the previous value (1 second).
Consider removing the comment or updating it to reflect the actual purpose of this change:
-/// Enforcing backwards compatibility with the "no ignore behaviour" of the previous release +/// Default ignore timeframe of 4 minutes for heartbeat mechanism
Line range hint
73-142
: Consider adding validation for ignore_timeframe.The
ignore_timeframe
field currently lacks validation checks. Consider adding validation to ensure:
- The value is not zero (which would effectively disable the ignore mechanism)
- The value is within reasonable bounds
Add validation in the
validate()
method:impl Validate for NetworkConfig { fn validate(&self) -> std::result::Result<(), validator::ValidationErrors> { let mut errors = validator::ValidationErrors::new(); + + if self.ignore_timeframe.as_secs() == 0 { + errors.add( + "ignore_timeframe", + validator::ValidationError::new("ignore_timeframe must be greater than 0"), + ); + } + + // Suggest maximum of 10 minutes to prevent excessive delays + if self.ignore_timeframe.as_secs() > 600 { + errors.add( + "ignore_timeframe", + validator::ValidationError::new("ignore_timeframe must not exceed 10 minutes"), + ); + } // ... rest of the validation ...transport/network/src/heartbeat.rs (2)
43-47
: Consider adding an upper bound validation for max_parallel_probesWhile the field is well-configured with non-negative validation, consider adding an upper bound to prevent excessive resource consumption. This could be important for system stability under high load.
#[validate(range(min = 0))] +#[validate(range(max = 100))] // Or another appropriate upper limit #[default(default_max_parallel_pings())] #[serde(default = "default_max_parallel_pings")] pub max_parallel_probes: usize,
178-183
: Add error handling for RNG initializationWhile using crypto-secure RNG for shuffling is good practice, consider adding error handling for RNG initialization failures.
-let mut rng = hopr_crypto_random::rng(); -peers.shuffle(&mut rng); +match hopr_crypto_random::rng() { + Ok(mut rng) => peers.shuffle(&mut rng), + Err(e) => { + error!("Failed to initialize RNG for peer shuffling: {}", e); + // Fallback to default order or return early + } +}hoprd/hoprd/example_cfg.yaml (1)
Line range hint
189-214
: Review interaction between heartbeat timing parametersThe new changes should be considered in context of existing timing parameters:
heartbeat.interval: 60s
- probe frequencyheartbeat.threshold: 60s
- renewal windowheartbeat.variance: 2s
- jitternetwork_options.quality_step: 0.1
- quality adjustmentnetwork_options.quality_offline_threshold: 0.0
- offline detectionConsider documenting:
- How
max_parallel_probes: 14
affects the effective probe rate given the 60s interval- Whether the 240s ignore timeframe aligns well with the 60s heartbeat interval and threshold
- If the quality adjustment step (0.1) is appropriate given the longer ignore timeframe
This would help operators understand the timing relationships and tune parameters appropriately for their deployment scenario.
transport/api/src/lib.rs (1)
299-299
: Consider consistent naming between configuration and implementationThe configuration uses
max_parallel_probes
while the implementation usesmax_parallel_pings
. Consider aligning these names for better code maintainability.- max_parallel_pings: self.cfg.heartbeat.max_parallel_probes, + max_parallel_pings: self.cfg.heartbeat.max_parallel_pings,
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
⛔ Files ignored due to path filters (1)
Cargo.lock
is excluded by!**/*.lock
📒 Files selected for processing (6)
hoprd/hoprd/example_cfg.yaml
(2 hunks)transport/api/src/lib.rs
(1 hunks)transport/network/Cargo.toml
(1 hunks)transport/network/src/config.rs
(2 hunks)transport/network/src/constants.rs
(1 hunks)transport/network/src/heartbeat.rs
(6 hunks)
🔇 Additional comments (11)
transport/network/src/constants.rs (1)
13-13
: Verify the choice of 14 parallel pings
The specific value of 14 for DEFAULT_MAX_PARALLEL_PINGS
needs justification:
- What factors influenced this choice?
- Was this value determined through performance testing?
- How does it balance resource usage vs. heartbeat effectiveness?
Let's check if there are any related configuration values or tests that might provide context:
transport/network/Cargo.toml (1)
25-25
: LGTM! Addition of rand crate aligns with randomized peer probing requirement.
The addition of the rand
dependency with workspace configuration is appropriate for implementing the randomized peer probing feature mentioned in the PR objectives.
Let's verify the workspace configuration:
✅ Verification successful
✓ Verified: rand dependency is properly configured in workspace
The rand
crate is correctly defined in the root Cargo.toml
with version "0.8.5" and is being used consistently across multiple workspace crates, including the transport/network
crate, through workspace inheritance. This configuration aligns with the project's dependency management practices.
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Verify rand crate configuration in the workspace
# Expected: Find rand dependency in the workspace root's Cargo.toml
# Check workspace root Cargo.toml for rand configuration
rg -A 5 'rand.*=.*{.*version' ./Cargo.toml
# Verify other crates using rand from workspace
rg -l 'rand.*=.*{.*workspace.*=.*true' -g 'Cargo.toml'
Length of output: 334
Script:
#!/bin/bash
# Check workspace root Cargo.toml for rand configuration
rg "rand" ./Cargo.toml
# Verify other crates using rand from workspace
fd Cargo.toml --exec rg -l "rand.*workspace.*true" {}
Length of output: 823
transport/network/src/config.rs (2)
51-52
: LGTM! Default value update aligns with PR objectives.
The change to use duration_4_min
as the default value for ignore_timeframe
correctly implements the intended increase in default ignore timeout to 4 minutes.
51-52
: Verify integration with heartbeat logic.
Let's ensure this configuration change is properly integrated with the heartbeat mechanism.
Also applies to: 165-166
✅ Verification successful
Let me gather more information about the heartbeat configuration and its usage.
Let me check one more aspect to understand the heartbeat configuration better.
Configuration change is properly integrated with heartbeat logic
The ignore_timeframe
duration setting (4 minutes) is correctly integrated with the heartbeat mechanism:
- Used appropriately in peer status checks via
is_ignored()
method - Works alongside the heartbeat constants (60-second intervals and thresholds)
- Applied consistently for peer ignore status calculations
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Check how ignore_timeframe is used in heartbeat-related code
# Look for usage of ignore_timeframe in heartbeat-related files
echo "Checking usage of ignore_timeframe in heartbeat logic:"
rg -l "ignore_timeframe" | grep -i "heart"
# Look for any hardcoded timing values that might need adjustment
echo "Checking for potential hardcoded timing values that might need adjustment:"
rg "Duration::from_secs\((?:60|120|180|240|300)\)" | grep -i "heart"
Length of output: 650
Script:
#!/bin/bash
# Check the context around heartbeat configuration and duration_4_min
echo "Checking duration_4_min usage and context:"
rg -A 5 "duration_4_min" transport/network/src/config.rs
echo "\nChecking heartbeat configuration structure:"
ast-grep --pattern 'struct $_ {
$$$
heartbeat $$$
$$$
}'
echo "\nChecking heartbeat-related functions:"
rg -A 5 "heartbeat" transport/network/src/config.rs
Length of output: 930
Script:
#!/bin/bash
# Check for heartbeat-related constants and their usage
echo "Checking heartbeat constants and their usage:"
rg -A 3 "DEFAULT_HEARTBEAT" transport/network/src/constants.rs
echo "\nChecking files importing these constants:"
rg -l "use.*constants::.*DEFAULT_HEARTBEAT"
echo "\nChecking ignore_timeframe usage context:"
rg -B 2 -A 2 "ignore_timeframe" transport/network/src
Length of output: 2922
transport/network/src/heartbeat.rs (3)
9-9
: LGTM: Import changes align with new functionality
The new imports support the peer shuffling feature and parallel probing configuration.
Also applies to: 31-34
66-70
: LGTM: Default function follows established pattern
The implementation is consistent with other default functions in the file.
Line range hint 1-1
: Verify default ignore timeout implementation
The PR objective mentions increasing the default ignore timeout to 4 minutes, but this change is not visible in the current file.
✅ Verification successful
Default ignore timeout is correctly implemented
The default ignore timeout of 4 minutes is properly implemented in transport/network/src/config.rs
through the duration_4_min()
function which returns Duration::from_secs(4 * 60)
. This configuration is used throughout the network implementation for peer ignore timeframes.
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Search for ignore timeout configuration
rg -l "ignore.*time(out|frame)" --type rust
Length of output: 160
Script:
#!/bin/bash
# Check the content of network config and network files for ignore timeout
rg "ignore.*time(out|frame)" transport/network/src/config.rs transport/network/src/network.rs -B 2 -A 2
Length of output: 1969
Script:
#!/bin/bash
# Check the default value implementation for ignore_timeframe
rg "duration_4_min" transport/network/src/config.rs -B 2 -A 4
Length of output: 481
hoprd/hoprd/example_cfg.yaml (2)
214-214
: Verify impact of increased ignore_timeframe
The ignore_timeframe
has been increased from 1 to 240 seconds (4 minutes) which aligns with PR objectives. This change affects how long a node is considered "ignored" after failed probes.
Let's analyze the interaction with other timing parameters:
Consider documenting:
- How this value interacts with
quality_step
(0.1) andquality_offline_threshold
(0.0) - The rationale behind 4 minutes - is it based on network analysis?
- Impact on peer discovery and network resilience
✅ Verification successful
Let me search for more specific information about how the ignore_timeframe parameter is used.
Verified: ignore_timeframe increase aligns with implementation
The 240 second ignore_timeframe is safely implemented:
- It works with the peer quality system where:
- Failed probes decrease quality by quality_step (0.1)
- Successful probes increase quality by quality_step (0.1)
- Peers below quality_offline_threshold (0.0) are considered offline
- The ignore_timeframe prevents repeated probe attempts to unresponsive peers
- The 4 minute duration provides adequate time for temporary network issues to resolve while not permanently ignoring peers
The configuration values work together appropriately - failed peers can recover quality over time while preventing excessive probing of offline nodes.
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Search for code that uses ignore_timeframe to understand impact
rg -A 5 "ignore_timeframe|240" --type rust
# Look for related timing constants
rg "interval|threshold|timeout" --type rust
Length of output: 132351
Script:
#!/bin/bash
# Search for code that uses ignore_timeframe to understand its interaction with other timing parameters
rg -A 5 "ignore_timeframe" --type yaml --type rust
# Look for quality_step and quality_offline_threshold usage
rg "quality_step|quality_offline_threshold" --type yaml --type rust
Length of output: 6108
189-190
: Consider documenting the rationale for max_parallel_probes value
The new max_parallel_probes: 14
parameter enables parallel heartbeat probing, which aligns with the PR objectives. However, the choice of 14 as the default value needs justification:
- What factors influenced this specific number?
- Are there any resource constraints to consider?
- How does this value affect network load and node performance?
Let's check if this value is consistently used across the codebase:
Consider adding comments in the configuration file to explain:
- The purpose of this parameter
- Recommended ranges for different deployment scenarios
- Resource implications of increasing/decreasing this value
transport/api/src/lib.rs (2)
Line range hint 400-472
: Implementation looks robust and well-structured
The ping implementation includes:
- Proper access validation
- Self-ping prevention
- Timeout handling
- Comprehensive error handling
- Status tracking
299-299
: Verify the ignore timeout configuration
The PR mentions increasing the default ignore timeout to 4 minutes, but this setting is not visible in this file.
✅ Verification successful
Ignore timeout is correctly set to 4 minutes (240 seconds) in the example configuration
The PR's change to increase the default ignore timeout to 4 minutes is correctly implemented in hoprd/hoprd/example_cfg.yaml
where ignore_timeframe: 240
(240 seconds = 4 minutes) is set.
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Search for ignore timeout configuration
rg -A 2 "ignore_timeframe|ignore_timeout"
Length of output: 1828
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Crate version bump missing
02db6ac
to
c76315f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Outside diff range and nitpick comments (4)
transport/network/src/config.rs (2)
51-52
: Document the rationale for the 4-minute timeout.While increasing the ignore timeframe aligns with the PR objectives, it would be helpful to document why 4 minutes was chosen as the optimal value. Consider adding a doc comment explaining the reasoning and any performance implications.
pub struct NetworkConfig { + /// Default ignore timeframe of 4 minutes was chosen because... + /// [Add explanation here] #[serde_as(as = "DurationSeconds<u64>")] #[serde(default = "duration_4_min")] #[default(duration_4_min())] pub ignore_timeframe: Duration,
51-52
: Consider adding validation for ignore_timeframe.The ignore_timeframe should probably be validated against max_delay to ensure consistent behavior.
Add this check to the validate() method:
impl Validate for NetworkConfig { fn validate(&self) -> std::result::Result<(), validator::ValidationErrors> { let mut errors = validator::ValidationErrors::new(); + + if self.ignore_timeframe > self.max_delay { + errors.add( + "ignore_timeframe", + validator::ValidationError::new("ignore_timeframe must not exceed max_delay"), + ); + }transport/network/src/heartbeat.rs (2)
43-47
: Add documentation for max_parallel_probes performance implicationsWhile the implementation is correct, please add documentation explaining:
- How this value affects system performance
- Recommended ranges for different deployment scenarios
- Impact on resource usage (network/CPU)
178-183
: Consider adding empty peer list handlingWhile the shuffling implementation is correct, consider adding a guard clause for empty peer lists to avoid unnecessary RNG operations:
let mut peers = self.external_api.get_peers(from_timestamp).await; +if peers.is_empty() { + debug!("No peers to probe in this round"); + return; +} // shuffle the peers to make sure that the order is different each heartbeat round let mut rng = hopr_crypto_random::rng(); peers.shuffle(&mut rng);
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
⛔ Files ignored due to path filters (1)
Cargo.lock
is excluded by!**/*.lock
📒 Files selected for processing (7)
hoprd/hoprd/example_cfg.yaml
(2 hunks)transport/api/Cargo.toml
(1 hunks)transport/api/src/lib.rs
(1 hunks)transport/network/Cargo.toml
(2 hunks)transport/network/src/config.rs
(2 hunks)transport/network/src/constants.rs
(1 hunks)transport/network/src/heartbeat.rs
(6 hunks)
✅ Files skipped from review due to trivial changes (1)
- transport/api/Cargo.toml
🚧 Files skipped from review as they are similar to previous changes (4)
- hoprd/hoprd/example_cfg.yaml
- transport/api/src/lib.rs
- transport/network/Cargo.toml
- transport/network/src/constants.rs
🔇 Additional comments (4)
transport/network/src/config.rs (2)
164-165
: LGTM!
The implementation is clear and consistent with other duration functions in the file.
51-52
: Verify the impact of the increased timeout.
The significant increase in the ignore timeframe from 1 second to 4 minutes might affect existing tests and behaviors.
Let's check for potential impacts:
Also applies to: 164-165
transport/network/src/heartbeat.rs (2)
9-9
: LGTM: Appropriate imports for new functionality
The new imports support the peer shuffling feature and constant organization is clean.
Also applies to: 31-34
66-69
: LGTM: Consistent with existing default functions
The implementation follows the established pattern in the codebase.
c76315f
to
d0b8791
Compare
Improvements to the heartbeat logic:
Notes
Refs #6662