-
Notifications
You must be signed in to change notification settings - Fork 97
Fix network issues #6791
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix network issues #6791
Conversation
📝 WalkthroughWalkthroughThis pull request introduces modifications to network configuration and error handling across multiple components of the HOPR protocol. The changes primarily focus on adjusting timeout settings, network management parameters, and error handling mechanisms. Key modifications include increasing the Changes
Possibly related PRs
Suggested Labels
Suggested Reviewers
📜 Recent review detailsConfiguration used: .coderabbit.yaml ⛔ Files ignored due to path filters (1)
📒 Files selected for processing (3)
✅ Files skipped from review due to trivial changes (3)
⏰ Context from checks skipped due to timeout of 90000ms (8)
Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
Documentation and Community
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (2)
transport/api/src/network_notifier.rs (2)
64-64
: Consider preserving detailed error informationThe parameter type change to
core_network::errors::Result
improves error handling, but the conversion toResult<Duration, ()>
loses error details.Consider propagating the specific error information to improve debugging capabilities.
67-70
: Simplify error conversion logicThe match expression can be simplified while maintaining the same functionality.
- let result = match &result { - Ok(duration) => Ok(*duration), - Err(_) => Err(()), - }; + let result = result.map(|d| *d).map_err(|_| ());
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (8)
hoprd/hoprd/example_cfg.yaml
(2 hunks)hoprd/rest-api/src/peers.rs
(1 hunks)transport/api/src/network_notifier.rs
(1 hunks)transport/network/src/config.rs
(2 hunks)transport/network/src/constants.rs
(1 hunks)transport/network/src/ping.rs
(16 hunks)transport/protocol/src/heartbeat/config.rs
(1 hunks)transport/protocol/src/lib.rs
(1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (8)
- GitHub Check: hopli / docker
- GitHub Check: hoprd / docker
- GitHub Check: Docs / Rust docs
- GitHub Check: tests-smoke-websocket
- GitHub Check: tests-unit
- GitHub Check: tests-smart-contracts
- GitHub Check: tests-unit-nightly
- GitHub Check: Linter
🔇 Additional comments (8)
transport/protocol/src/heartbeat/config.rs (1)
13-13
: Verify if 6 seconds is sufficient for network traversalThe reduction in timeout from 15s to 6s is significant. While this could make the system more responsive, we should verify that 6s is sufficient for:
- Network traversal across different geographical regions
- NAT traversal scenarios
- High-latency network conditions
Let's analyze the historical ping latencies to validate this change:
✅ Verification successful
6 second timeout is reasonable for heartbeat rounds
The timeout is specifically for heartbeat rounds, not individual pings. Individual ping operations maintain their own 30-second timeout for handling worst-case network scenarios. The heartbeat timeout can be more aggressive as it's a continuous health check where timeouts are expected and handled gracefully.
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Search for any logging of ping latencies or timeout errors rg -A 2 "ping.*latency|timeout.*ping"Length of output: 8049
transport/network/src/constants.rs (1)
13-13
: Consider resource implications of increased parallel pingsThe increase from 14 to 25 parallel pings could significantly impact system resources. Please verify:
- Memory usage under peak load
- Network bandwidth consumption
- System stability with increased concurrent operations
Let's check for any existing resource constraints or related issues:
✅ Verification successful
Parallel ping limit increase is properly implemented and safe
The increase in DEFAULT_MAX_PARALLEL_PINGS is safe based on:
- Implementation enforces the limit correctly via split_off mechanism
- Covered by test cases verifying parallel behavior
- Used for network health probing with proper timeout controls
- No evidence of resource issues in the codebase
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Search for memory/resource related issues with parallel pings rg -A 3 "out.*of.*memory|resource.*exhausted|too.*many.*parallel"Length of output: 84751
Script:
#!/bin/bash # Search for ping-related implementations and configurations rg -g '!vendor' -A 3 -B 3 'parallel.*ping|ping.*limit|max.*ping' # Look for ping implementation files fd -g '!vendor' -e rs ping # Search for ping-related tests fd -g '!vendor' -e rs test.*pingLength of output: 5176
hoprd/rest-api/src/peers.rs (1)
142-144
: LGTM! Error handling improvementThe addition of explicit timeout error handling improves the API's error reporting capabilities, which aligns well with the PR objectives. The use of
StatusCode::UNPROCESSABLE_ENTITY
is appropriate for this scenario.transport/protocol/src/lib.rs (1)
373-373
: Improved error logging format.The change enhances error visibility by explicitly labeling the error variable in the structured logging format, making debugging easier.
transport/network/src/ping.rs (2)
139-159
: Enhanced error handling with NetworkingError.The error handling has been improved by:
- Using specific NetworkingError types instead of generic errors
- Adding detailed error messages for different failure scenarios
- Properly categorizing timeout vs internal errors
63-63
: Verify impact of increased parallel pings.The default max_parallel_pings has been increased from 14 to 25, which could impact system resources in larger networks.
Run the following script to check for any resource-related issues in the codebase:
✅ Verification successful
Parallel pings increase is safe and within system limits
The increase from 14 to 25 parallel pings is well within the established system limits, as the codebase already handles much higher concurrency:
- libp2p concurrent dial limit: 255
- Max concurrent streams: 1024
- Max negotiating inbound streams: 255
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Description: Check for resource management and performance considerations. # Search for resource management patterns rg -A 3 "max_parallel|concurrent|resource|memory|cpu" # Look for performance-related comments or issues rg -i "performance|optimization|bottleneck|resource"Length of output: 74457
transport/network/src/config.rs (1)
51-52
: Consider impact of reduced ignore timeframe.The ignore_timeframe has been reduced from 4 minutes to 2 minutes. While this makes the system more responsive to network changes, it could lead to increased network traffic due to more frequent retries.
Run the following script to analyze the usage of ignore_timeframe in the codebase:
Also applies to: 164-165
hoprd/hoprd/example_cfg.yaml (1)
204-204
: Configuration aligns with code changes.The configuration values have been updated to match the code changes:
- ignore_timeframe reduced to 120 seconds (matching the 2-minute default in config.rs)
- heartbeat timeout reduced to 6 seconds for faster network responsiveness
Also applies to: 227-227
f93eca8
to
1a764a6
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (1)
hoprd/rest-api/src/lib.rs (1)
405-405
: LGTM! Consider adding documentation.The addition of
PingError(String)
variant aligns well with the PR objectives to improve error output for ping operations. The implementation follows Rust best practices and integrates seamlessly with the existing error handling infrastructure.Consider adding documentation for the new variant to explain when this error occurs and what the String parameter represents. Example:
enum ApiErrorStatus { // ... + /// Represents errors that occur during ping operations. + /// The String parameter contains detailed information about the specific ping failure. PingError(String), // ... }
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
hoprd/rest-api/src/lib.rs
(1 hunks)hoprd/rest-api/src/peers.rs
(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- hoprd/rest-api/src/peers.rs
⏰ Context from checks skipped due to timeout of 90000ms (9)
- GitHub Check: tests-smoke-websocket
- GitHub Check: tests-unit-nightly
- GitHub Check: tests-smart-contracts
- GitHub Check: tests-unit
- GitHub Check: hoprd / docker
- GitHub Check: hopli / docker
- GitHub Check: Docs / Rust docs
- GitHub Check: Add PR labels
- GitHub Check: Linter
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (1)
hoprd/hoprd/example_cfg.yaml (1)
Line range hint
180-227
: Monitor network stability after deploying these aggressive timing changes.The combined effect of these configuration changes (increased parallel probes, reduced ignore timeframe, and shorter timeout) represents a more aggressive network management strategy. While this should improve responsiveness in larger networks, it's important to:
- Monitor false disconnections in slower network segments
- Track resource utilization under peak load
- Observe reconnection patterns for potentially problematic nodes
Consider implementing gradual rollout or A/B testing to validate these changes in production.
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
hoprd/hoprd/example_cfg.yaml
(3 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (8)
- GitHub Check: hoprd / docker
- GitHub Check: hopli / docker
- GitHub Check: tests-unit-nightly
- GitHub Check: Docs / Rust docs
- GitHub Check: tests-smoke-websocket
- GitHub Check: tests-unit
- GitHub Check: Linter
- GitHub Check: tests-smart-contracts
🔇 Additional comments (3)
hoprd/hoprd/example_cfg.yaml (3)
180-180
: Consider network size and resource implications of increased parallel probes.Increasing
max_parallel_probes
from 14 to 25 will help cover more peers in larger networks during each heartbeat round. However, this also means:
- Higher concurrent network connections
- Increased memory usage
- More CPU utilization for probe processing
Please verify resource utilization under peak load with this configuration:
✅ Verification successful
Change aligns with codebase defaults and implementation constraints
The increase to 25 parallel probes matches the default value defined in
transport/network/src/constants.rs
. The implementation includes validation to prevent negative values and shows no hard resource limits that would be impacted by this change.🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Search for resource monitoring or performance testing code rg -A 5 "resource|performance|memory|cpu" --type rustLength of output: 109558
Script:
#!/bin/bash # Search for probe implementation and related code rg -A 5 "parallel.*probe|probe.*parallel|max_parallel_probes" --type rust # Search for probe-related tests and documentation rg -A 5 "test.*probe|probe.*test" --type rust # Search for any resource limits or constraints in probe code rg -A 5 "limit.*probe|probe.*limit|max.*probe|probe.*max" --type rustLength of output: 21042
227-227
: Evaluate timeout reduction impact on slower network segments.Decreasing the heartbeat timeout from 15s to 6s will speed up failure detection, but:
- May cause false negatives in networks with higher latency
- Could lead to unnecessary peer disconnections in congested conditions
Let's analyze the timeout handling implementation:
✅ Verification successful
Timeout reduction is safe due to layered quality assessment
The reduction of heartbeat timeout from 15s to 6s is safe because:
- Individual ping failures don't immediately impact peer quality (uses 25-sample moving average)
- Quality degradation is gradual (0.1 step per failure)
- Network has built-in resilience with backoff mechanisms
- Multiple successful pings are required to significantly change peer quality
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Search for timeout handling code ast-grep --pattern 'timeout|deadline|expired'Length of output: 186
Script:
#!/bin/bash # Search for timeout configuration usage rg -g '!vendor' -g '!target' -A 5 'timeout:\s+\d+' # Search for heartbeat-related code rg -g '!vendor' -g '!target' -A 10 'heartbeat|peer.*connection|peer.*timeout' # Find all yaml config files for comparison fd -e yaml -e yml --exec head -n 50Length of output: 66036
Script:
#!/bin/bash # Search for network quality configurations and thresholds rg -g '!vendor' -g '!target' -A 5 'quality.*threshold|network.*quality|minimum.*quality' # Look for any network-related configuration defaults rg -g '!vendor' -g '!target' -A 5 'DEFAULT_NETWORK|network.*config'Length of output: 34401
204-204
: Verify impact of reduced ignore timeframe on network stability.Reducing
ignore_timeframe
from 240s to 120s means nodes will be retried more frequently after failures. While this can help recover from temporary issues faster, consider:
- Could lead to more frequent connection attempts to genuinely problematic nodes
- May increase network chatter in unstable environments
Let's check if there are any existing retry limits or backoff mechanisms:
✅ Verification successful
Reduced ignore timeframe is safe due to existing protective measures
The change from 240s to 120s is safe because the system implements:
- Exponential backoff (exponent: 1.5) with minimum delay of 2.0s
- Quality tracking with 25-sample moving average window
- Proper peer status management with connection rate limiting
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Search for backoff or retry logic in the codebase ast-grep --pattern 'backoff|retry|attempt'Length of output: 183
Script:
#!/bin/bash # Search for ignore_timeframe usage and related connection handling rg -i "ignore_timeframe" -A 5 -B 5 # Look for connection handling patterns ast-grep --pattern 'fn $_(connection|peer|node|retry|backoff)' # Check for documentation fd README.md --exec cat {}Length of output: 6833
The latest dAppNode package for DAppNodePackage-Hopr has been built and uploaded with IPFS hash: /ipfs/QmSwz6agoirRMGi6Ja66f7wWqEajkxRJhUX19Z4SqXkbbY |
The latest dAppNode package for DAppNodePackage-Hopr has been built and uploaded with IPFS hash: /ipfs/QmZNuTZiyxnnk1txCCCxjg3X6xDWDUDBLpH8XqwprNC91P |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Crate version bumps missing
The latest dAppNode package for DAppNodePackage-Hopr has been built and uploaded with IPFS hash: /ipfs/QmSVp66j8mZz9LyLxaMR7BUJbYyg8CCHTVZ9UGvEAP8XH9 |
The latest dAppNode package for DAppNodePackage-Hopr has been built and uploaded with IPFS hash: /ipfs/QmRrZu7VaSABNkkFuFLqfuik5Dv52mmmsacrCj8Aau1ccp |
(cherry picked from commit 81946c2)
Uh oh!
There was an error while loading. Please reload this page.