-
Notifications
You must be signed in to change notification settings - Fork 96
Fix the network heartbeat behavior to be more robust and work with /node/peers #6583
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix the network heartbeat behavior to be more robust and work with /node/peers #6583
Conversation
📝 WalkthroughWalkthroughThe changes in this pull request involve updates to documentation and configuration files for the HOPR project. The Changes
Assessment against linked issues
Possibly related PRs
Suggested labels
Suggested reviewers
Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
Documentation and Community
|
ecec8a0
to
8bad068
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 6
🧹 Outside diff range comments (3)
hoprd/rest-api/src/node.rs (2)
Line range hint 182-191
: Consider adding timeouts for network operations.
To improve network robustness, consider adding timeouts for the peer info and address resolution operations. This would prevent potential hangs when network peers are unresponsive.
use tokio::time::timeout;
use std::time::Duration;
// Example timeout duration
const PEER_INFO_TIMEOUT: Duration = Duration::from_secs(5);
// Add timeout
if let Ok(Some(info)) = timeout(PEER_INFO_TIMEOUT, hopr.network_peer_info(&peer)).await.transpose()? {
// ... existing code ...
}
Line range hint 193-204
: Consider enhancing HeartbeatInfo for better diagnostics.
To support better network diagnostics, consider expanding the HeartbeatInfo structure to include additional metrics like:
- Last heartbeat attempt timestamp
- Average response time
- Failure reasons
This would help diagnose network issues mentioned in issue #6529.
#[derive(Debug, Default, Clone, Serialize, utoipa::ToSchema)]
pub(crate) struct HeartbeatInfo {
sent: u64,
success: u64,
+ last_attempt: Option<u128>,
+ avg_response_time: Option<u64>,
+ last_failure_reason: Option<String>,
}
README.md (1)
Line range hint 214-216
: Fix typos and network name inconsistency.
There are a few issues in this section:
- "enabels" should be "enables"
- "doufur" should be "dufour" to match the network name used in other sections
-A basic reasonable setup is that uses a custom identity and enabels a REST API of the `hoprd` could look like:
+A basic reasonable setup that uses a custom identity and enables a REST API of the `hoprd` could look like:
-hoprd --identity /app/hoprd-db/.hopr-identity --password switzerland --init --announce --host "0.0.0.0:9091" --apiToken <MY_TOKEN> --network doufur
+hoprd --identity /app/hoprd-db/.hopr-identity --password switzerland --init --announce --host "0.0.0.0:9091" --apiToken <MY_TOKEN> --network dufour
🧰 Tools
🪛 LanguageTool
[style] ~198-~198: Consider a shorter alternative to avoid wordiness.
Context: ...owing environment variables can be used in order to tweak the node functionality: - `HOPRD...
(IN_ORDER_TO_PREMIUM)
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: ASSERTIVE
📒 Files selected for processing (10)
- README.md (1 hunks)
- hoprd/hoprd/example_cfg.yaml (1 hunks)
- hoprd/rest-api/src/node.rs (1 hunks)
- tests/barebone-lower-win-prob.cfg.yaml (1 hunks)
- tests/barebone.cfg.yaml (1 hunks)
- tests/default.cfg.yaml (1 hunks)
- tests/node.py (0 hunks)
- transport/api/src/lib.rs (1 hunks)
- transport/network/src/config.rs (2 hunks)
- transport/network/src/network.rs (4 hunks)
💤 Files with no reviewable changes (1)
- tests/node.py
🧰 Additional context used
🪛 LanguageTool
README.md
[style] ~198-~198: Consider a shorter alternative to avoid wordiness.
Context: ...owing environment variables can be used in order to tweak the node functionality: - `HOPRD...
(IN_ORDER_TO_PREMIUM)
🔇 Additional comments (15)
tests/barebone.cfg.yaml (2)
3-7
: 🧹 Nitpick (assertive)
Consider adjusting heartbeat threshold for better stability.
While the frequent heartbeat interval (10) is good for testing, having the threshold equal to the interval (10) provides no buffer for temporary network instability. Consider increasing the threshold to provide more resilience against transient failures.
A common pattern is to set the threshold to 1.5-2x the interval. For example:
heartbeat:
interval: 10
- threshold: 10
+ threshold: 15
variance: 1
8-10
: Verify if ignore_timeframe is sufficient for test stability.
The ignore_timeframe (5) being shorter than the heartbeat interval (10) could lead to premature reconnection attempts. This might contribute to the race conditions mentioned in issue #6529.
Run the following to check if other test configurations use different ratios:
✅ Verification successful
Let me analyze the test configurations further to understand the relationship between these timing parameters and their usage in the codebase.
The ignore_timeframe setting is appropriate for test configurations
The verification reveals that:
- Test configurations consistently use a 5-second ignore_timeframe with a 10-second heartbeat interval
- Production configurations (in hoprd/example_cfg.yaml and local-cluster) use a much larger 600-second ignore_timeframe
- The codebase implementation in transport/network/src/network.rs shows that ignore_timeframe is used as a cooldown period before reconnection attempts
The shorter ignore_timeframe in test configurations is intentional and appropriate because:
- Tests need faster feedback cycles than production
- The implementation properly handles the timing relationship, using saturating subtraction to prevent underflow
- The race conditions mentioned in issue #6529 are unrelated to this timing ratio, as the heartbeat and ignore mechanisms work independently
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Compare heartbeat and ignore_timeframe settings across test configs
# Expected: Find configurations that might provide insights into stable ratios
echo "Searching for heartbeat and ignore_timeframe configurations in test files..."
rg -A 10 'heartbeat:|ignore_timeframe:' tests/
Length of output: 2477
Script:
#!/bin/bash
# Search for code that handles reconnection logic related to these timing parameters
ast-grep --pattern 'ignore_timeframe'
# Also search for any comments or code discussing these timing parameters
rg -g '!*.{yaml,yml}' -B 2 -A 2 'ignore_timeframe|heartbeat.*interval'
Length of output: 6284
tests/barebone-lower-win-prob.cfg.yaml (2)
8-10
: 🧹 Nitpick (assertive)
Verify ignore_timeframe in relation to heartbeat cycle.
The ignore_timeframe: 5
setting is documented as being "shorter than heartbeat cycle" (10s), which aligns with addressing potential race conditions in peer announcements.
However, we should ensure this value:
- Is sufficient to prevent false negatives in peer connectivity
- Doesn't mask actual connectivity issues
Let's verify the impact on peer connectivity:
#!/bin/bash
# Description: Check for any hardcoded timeouts or related timing constants
# that might conflict with this ignore_timeframe
echo "Checking for related timing constants:"
rg -A 2 "timeout|delay|interval" "src/**/*.rs"
Consider documenting:
- The rationale behind choosing 5 seconds
- How this value interacts with the heartbeat interval
- Any minimum/maximum bounds for this value
3-7
: 🧹 Nitpick (assertive)
Verify heartbeat timing parameters against network latency.
The heartbeat configuration introduces frequent checks with:
- 10-second interval
- 10-second threshold
- 1-second variance
While these values might help address the E2E test failures by increasing responsiveness, we should verify if they're appropriate for CI environment network conditions.
Let's check other config files for consistency:
Consider adding comments explaining:
- Why these specific timing values were chosen
- How they relate to the CI environment requirements
- The relationship between interval and threshold
✅ Verification successful
Heartbeat configuration values are consistent with other test configurations
The heartbeat settings in tests/barebone-lower-win-prob.cfg.yaml
match the values used in other test configuration files:
- Multiple test configs use identical values (interval: 10, threshold: 10, variance: 1)
- These aggressive timing values appear to be specifically chosen for test environments
- The production example config uses more conservative values (interval: 60)
The chosen values are appropriate for test scenarios where we want quick failure detection and faster test execution. The consistency across test configurations suggests these are well-established values for the test environment.
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Check heartbeat settings across all config files
# Expected: Similar heartbeat settings in related test configs
echo "Checking heartbeat settings in all config files:"
rg -A 3 "heartbeat:" "tests/*.yaml" "example*.yaml"
Length of output: 286
Script:
#!/bin/bash
# Let's try with fd to find yaml files first and then search in them
echo "Finding all yaml/yml files:"
fd -e yaml -e yml
echo -e "\nChecking heartbeat settings in these files:"
fd -e yaml -e yml -x rg -A 3 "heartbeat:"
Length of output: 5807
tests/default.cfg.yaml (2)
3-7
: 🧹 Nitpick (assertive)
Review heartbeat threshold configuration.
The heartbeat threshold being equal to the interval (both set to 10) might be too strict. If a single heartbeat is delayed or missed, it could immediately trigger unwanted peer disconnections. Consider increasing the threshold to allow for some network latency and jitter.
Suggested approach:
- Set threshold to at least 2-3 times the interval (e.g., threshold: 20 or 30)
- This provides better tolerance for temporary network issues while still detecting genuinely offline peers
8-10
: 🧹 Nitpick (assertive)
Verify ignore_timeframe value relative to heartbeat cycle.
The ignore_timeframe
(5) is set to half the heartbeat interval (10). This could lead to race conditions where legitimate network events are ignored if they occur just before a heartbeat check.
Let's check if other configuration files maintain consistent timing relationships:
Consider:
- Setting
ignore_timeframe
closer to the heartbeat interval (e.g., 8-9) - Or adding documentation explaining why 5 is the optimal value for preventing the race conditions mentioned in issue #6529
transport/network/src/config.rs (1)
10-11
: Verify the impact of reduced quality thresholds on network stability.
The significant reduction in quality thresholds (OFFLINE: 0.5→0.0, BAD_QUALITY: 0.2→0.1) makes the network more lenient in handling poor connections. While this may help with test stability, please verify that:
- These thresholds don't mask actual network issues
- The network remains stable with potentially more poor-quality connections
✅ Verification successful
Let's examine the usage of these thresholds to better understand their impact.
Let's examine the network quality handling logic more closely.
Based on the evidence gathered, I can now provide a final response about the quality threshold changes:
The reduced quality thresholds are safe and align with the network's behavior
The changes to quality thresholds (OFFLINE: 0.5→0.0, BAD_QUALITY: 0.2→0.1) are well-justified:
- The network code shows that connections are only closed when quality <= offline_threshold (0.0), meaning nodes maintain connections as long as there's any measurable quality
- The bad_quality threshold (0.1) is used for metrics and peer stats, not for connection decisions
- Test cases demonstrate the network remains healthy even with low-quality connections
- The thresholds are consistently used across the codebase with proper validation
These changes make the network more resilient by maintaining connections longer while still monitoring connection quality through metrics.
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Search for test files that verify network stability
rg -l "network.*stability|quality.*threshold" --type rust
Length of output: 282
Script:
#!/bin/bash
# Check how these thresholds are used in network quality decisions
rg -A 5 "DEFAULT_NETWORK_.*QUALITY_THRESHOLD" --type rust
# Look for quality-related tests
rg -B 2 -A 5 "test.*quality" --type rust
# Search for network quality assessment logic
ast-grep --pattern 'fn $_(quality: $_, $$$) { $$$ }'
Length of output: 23546
Script:
#!/bin/bash
# Check how network quality affects connection decisions
rg -B 2 -A 5 "quality.*threshold" transport/network/src/network.rs
# Look for quality-related state handling
ast-grep --pattern 'fn $_(quality: f64, $$$) { $$$ }'
# Check for quality validation logic
rg -B 2 -A 5 "validate.*quality" --type rust
Length of output: 3237
hoprd/hoprd/example_cfg.yaml (2)
189-189
: LGTM: Reasonable adjustment to bad quality threshold.
The reduction of quality_bad_threshold
from 0.2 to 0.1 provides a more lenient threshold for marking node connectivity as "bad", which aligns with making the network behavior more robust while still maintaining reasonable quality standards.
191-191
: Verify impact of aggressive offline threshold.
While setting quality_offline_threshold
to 0.0 helps quickly identify truly offline nodes and may improve E2E test reliability, it's worth verifying that this aggressive threshold doesn't cause issues in production environments with temporary network instability.
The surrounding configuration provides safeguards:
quality_step: 0.1
ensures gradual quality adjustmentsquality_avg_window_size: 25
provides smoothing effect- Backoff parameters allow for gradual reconnection attempts
Let's verify the E2E test improvements with these new thresholds:
hoprd/rest-api/src/node.rs (1)
174-175
: LGTM! Good improvement to code readability.
Extracting the quality calculation into a variable improves code clarity and prevents potential double calculation.
README.md (1)
Line range hint 201-206
: Document recommended values for libp2p connection limits.
Given that this PR addresses network stability and E2E test failures (issue #6529), it would be helpful to document recommended values for:
HOPR_INTERNAL_LIBP2P_MAX_CONCURRENTLY_DIALED_PEER_COUNT
HOPR_INTERNAL_LIBP2P_MAX_NEGOTIATING_INBOUND_STREAM_COUNT
These limits could help prevent race conditions during peer connections.
Let's check if there are any recommended values in the codebase:
✅ Verification successful
Let me search for the default constants mentioned in the code:
Document default values for libp2p connection limits
The environment variables have default values set in the codebase:
HOPR_INTERNAL_LIBP2P_MAX_CONCURRENTLY_DIALED_PEER_COUNT
: defaults to 255HOPR_INTERNAL_LIBP2P_MAX_NEGOTIATING_INBOUND_STREAM_COUNT
: defaults to 255
Consider documenting these default values in the README to help users understand the baseline configuration.
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Search for default values or documentation of these variables
rg -A 3 "HOPR_INTERNAL_LIBP2P_MAX"
Length of output: 1204
Script:
#!/bin/bash
# Search for the default constants
rg "HOPR_SWARM_CONCURRENTLY" transport/p2p/src/constants.rs
Length of output: 215
🧰 Tools
🪛 LanguageTool
[style] ~198-~198: Consider a shorter alternative to avoid wordiness.
Context: ...owing environment variables can be used in order to tweak the node functionality: - `HOPRD...
(IN_ORDER_TO_PREMIUM)
transport/network/src/network.rs (3)
240-241
: LGTM: Improved peer quality threshold handling
The simplified logic for peer quality thresholds makes the code more robust and less prone to race conditions. The changes align well with the PR objectives to fix network heartbeat behavior.
Also applies to: 254-258
292-296
: LGTM: Well-implemented connected peers retrieval
The new connected_peers
method effectively filters peers based on the quality threshold, supporting the PR's goal to fix the /node/peers
endpoint. The implementation is concise and leverages existing functionality.
299-299
: LGTM: Appropriate visibility restriction
Reducing the visibility of peer_filter
to pub(crate)
is a good practice as it minimizes the public API surface while maintaining necessary access within the crate.
transport/api/src/lib.rs (1)
1269-1269
: LGTM!
The method simplification enhances readability and aligns with the updated network interaction logic.
Fix network behavior inconsistency related to improper ignoring of network peers and closing peer connections for offline peers.
/node/peers
output fixNotes
Fixes #6529