Skip to content

Conversation

Teebor-Choka
Copy link
Contributor

@Teebor-Choka Teebor-Choka commented Jan 20, 2025

  • Improve the error output on ping operations
  • Change default config values for larger networks like dufour to heartbeat as many peers in a single heartbeat round as possible
  • Fix timeout and network error signaling in the hoprd API

Copy link
Contributor

coderabbitai bot commented Jan 20, 2025

📝 Walkthrough

Walkthrough

This pull request introduces modifications to network configuration and error handling across multiple components of the HOPR protocol. The changes primarily focus on adjusting timeout settings, network management parameters, and error handling mechanisms. Key modifications include increasing the max_parallel_probes from 14 to 25, reducing the ignore_timeframe from 240 to 120 seconds, and decreasing the heartbeat timeout from 15 to 6 seconds. Additionally, error handling in the ping operations has been refined, and a new error variant has been added to the API.

Changes

File Change Summary
hoprd/hoprd/example_cfg.yaml - Increased max_parallel_probes from 14 to 25
- Reduced ignore_timeframe from 240 to 120 seconds
- Decreased heartbeat timeout from 15 to 6 seconds
hoprd/rest-api/src/peers.rs - Added error handling for HoprLibError::TransportError with NetworkingError::Timeout and NetworkingError::PingerError
transport/api/src/network_notifier.rs - Updated on_finished_ping method signature to use reference-based result type
transport/network/src/config.rs - Changed default ignore_timeframe from 4 minutes to 2 minutes
transport/network/src/constants.rs - Increased DEFAULT_MAX_PARALLEL_PINGS from 14 to 25
transport/network/src/ping.rs - Introduced custom NetworkingError type
- Updated error handling and result types
- Modified ping configuration and result handling
transport/protocol/src/heartbeat/config.rs - Reduced default heartbeat timeout from 15 to 6 seconds
transport/protocol/src/lib.rs - Updated error logging format for message processing
hoprd/rest-api/src/lib.rs - Added PingError(String) variant to ApiErrorStatus enum
hoprd/rest-api/Cargo.toml - Version updated from "3.11.0" to "3.11.1"
- Description modified for clarity
transport/api/Cargo.toml - Version updated from "0.8.2" to "0.8.3"
transport/network/Cargo.toml - Version updated from "0.6.4" to "0.7.0"

Possibly related PRs

Suggested Labels

bug, status:in-progress

Suggested Reviewers

  • NumberFour8

📜 Recent review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 999f7d2 and 49fc2ba.

⛔ Files ignored due to path filters (1)
  • Cargo.lock is excluded by !**/*.lock
📒 Files selected for processing (3)
  • hoprd/rest-api/Cargo.toml (1 hunks)
  • transport/api/Cargo.toml (1 hunks)
  • transport/network/Cargo.toml (1 hunks)
✅ Files skipped from review due to trivial changes (3)
  • transport/api/Cargo.toml
  • transport/network/Cargo.toml
  • hoprd/rest-api/Cargo.toml
⏰ Context from checks skipped due to timeout of 90000ms (8)
  • GitHub Check: tests-unit
  • GitHub Check: tests-unit-nightly
  • GitHub Check: tests-smart-contracts
  • GitHub Check: tests-smoke-websocket
  • GitHub Check: hopli / docker
  • GitHub Check: hoprd / docker
  • GitHub Check: Docs / Rust docs
  • GitHub Check: Linter

Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR. (Beta)
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

@Teebor-Choka Teebor-Choka requested review from jeandemeusy and a team January 20, 2025 11:14
@Teebor-Choka Teebor-Choka self-assigned this Jan 20, 2025
@Teebor-Choka Teebor-Choka added this to the 2.2.0 milestone Jan 20, 2025
@ausias-armesto ausias-armesto modified the milestones: 2.2.0, 2.2.1 Jan 20, 2025
@Teebor-Choka Teebor-Choka marked this pull request as ready for review January 20, 2025 11:18
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (2)
transport/api/src/network_notifier.rs (2)

64-64: Consider preserving detailed error information

The parameter type change to core_network::errors::Result improves error handling, but the conversion to Result<Duration, ()> loses error details.

Consider propagating the specific error information to improve debugging capabilities.


67-70: Simplify error conversion logic

The match expression can be simplified while maintaining the same functionality.

-        let result = match &result {
-            Ok(duration) => Ok(*duration),
-            Err(_) => Err(()),
-        };
+        let result = result.map(|d| *d).map_err(|_| ());
📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between c9bb870 and f93eca8.

📒 Files selected for processing (8)
  • hoprd/hoprd/example_cfg.yaml (2 hunks)
  • hoprd/rest-api/src/peers.rs (1 hunks)
  • transport/api/src/network_notifier.rs (1 hunks)
  • transport/network/src/config.rs (2 hunks)
  • transport/network/src/constants.rs (1 hunks)
  • transport/network/src/ping.rs (16 hunks)
  • transport/protocol/src/heartbeat/config.rs (1 hunks)
  • transport/protocol/src/lib.rs (1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (8)
  • GitHub Check: hopli / docker
  • GitHub Check: hoprd / docker
  • GitHub Check: Docs / Rust docs
  • GitHub Check: tests-smoke-websocket
  • GitHub Check: tests-unit
  • GitHub Check: tests-smart-contracts
  • GitHub Check: tests-unit-nightly
  • GitHub Check: Linter
🔇 Additional comments (8)
transport/protocol/src/heartbeat/config.rs (1)

13-13: Verify if 6 seconds is sufficient for network traversal

The reduction in timeout from 15s to 6s is significant. While this could make the system more responsive, we should verify that 6s is sufficient for:

  1. Network traversal across different geographical regions
  2. NAT traversal scenarios
  3. High-latency network conditions

Let's analyze the historical ping latencies to validate this change:

✅ Verification successful

6 second timeout is reasonable for heartbeat rounds

The timeout is specifically for heartbeat rounds, not individual pings. Individual ping operations maintain their own 30-second timeout for handling worst-case network scenarios. The heartbeat timeout can be more aggressive as it's a continuous health check where timeouts are expected and handled gracefully.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Search for any logging of ping latencies or timeout errors
rg -A 2 "ping.*latency|timeout.*ping" 

Length of output: 8049

transport/network/src/constants.rs (1)

13-13: Consider resource implications of increased parallel pings

The increase from 14 to 25 parallel pings could significantly impact system resources. Please verify:

  1. Memory usage under peak load
  2. Network bandwidth consumption
  3. System stability with increased concurrent operations

Let's check for any existing resource constraints or related issues:

✅ Verification successful

Parallel ping limit increase is properly implemented and safe

The increase in DEFAULT_MAX_PARALLEL_PINGS is safe based on:

  • Implementation enforces the limit correctly via split_off mechanism
  • Covered by test cases verifying parallel behavior
  • Used for network health probing with proper timeout controls
  • No evidence of resource issues in the codebase
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Search for memory/resource related issues with parallel pings
rg -A 3 "out.*of.*memory|resource.*exhausted|too.*many.*parallel"

Length of output: 84751


Script:

#!/bin/bash
# Search for ping-related implementations and configurations
rg -g '!vendor' -A 3 -B 3 'parallel.*ping|ping.*limit|max.*ping'

# Look for ping implementation files
fd -g '!vendor' -e rs ping

# Search for ping-related tests
fd -g '!vendor' -e rs test.*ping

Length of output: 5176

hoprd/rest-api/src/peers.rs (1)

142-144: LGTM! Error handling improvement

The addition of explicit timeout error handling improves the API's error reporting capabilities, which aligns well with the PR objectives. The use of StatusCode::UNPROCESSABLE_ENTITY is appropriate for this scenario.

transport/protocol/src/lib.rs (1)

373-373: Improved error logging format.

The change enhances error visibility by explicitly labeling the error variable in the structured logging format, making debugging easier.

transport/network/src/ping.rs (2)

139-159: Enhanced error handling with NetworkingError.

The error handling has been improved by:

  1. Using specific NetworkingError types instead of generic errors
  2. Adding detailed error messages for different failure scenarios
  3. Properly categorizing timeout vs internal errors

63-63: Verify impact of increased parallel pings.

The default max_parallel_pings has been increased from 14 to 25, which could impact system resources in larger networks.

Run the following script to check for any resource-related issues in the codebase:

✅ Verification successful

Parallel pings increase is safe and within system limits

The increase from 14 to 25 parallel pings is well within the established system limits, as the codebase already handles much higher concurrency:

  • libp2p concurrent dial limit: 255
  • Max concurrent streams: 1024
  • Max negotiating inbound streams: 255
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Check for resource management and performance considerations.

# Search for resource management patterns
rg -A 3 "max_parallel|concurrent|resource|memory|cpu"

# Look for performance-related comments or issues
rg -i "performance|optimization|bottleneck|resource"

Length of output: 74457

transport/network/src/config.rs (1)

51-52: Consider impact of reduced ignore timeframe.

The ignore_timeframe has been reduced from 4 minutes to 2 minutes. While this makes the system more responsive to network changes, it could lead to increased network traffic due to more frequent retries.

Run the following script to analyze the usage of ignore_timeframe in the codebase:

Also applies to: 164-165

hoprd/hoprd/example_cfg.yaml (1)

204-204: Configuration aligns with code changes.

The configuration values have been updated to match the code changes:

  1. ignore_timeframe reduced to 120 seconds (matching the 2-minute default in config.rs)
  2. heartbeat timeout reduced to 6 seconds for faster network responsiveness

Also applies to: 227-227

@Teebor-Choka Teebor-Choka force-pushed the kauki/bug/fix-network-related-issues branch from f93eca8 to 1a764a6 Compare January 20, 2025 11:22
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
hoprd/rest-api/src/lib.rs (1)

405-405: LGTM! Consider adding documentation.

The addition of PingError(String) variant aligns well with the PR objectives to improve error output for ping operations. The implementation follows Rust best practices and integrates seamlessly with the existing error handling infrastructure.

Consider adding documentation for the new variant to explain when this error occurs and what the String parameter represents. Example:

 enum ApiErrorStatus {
     // ...
+    /// Represents errors that occur during ping operations.
+    /// The String parameter contains detailed information about the specific ping failure.
     PingError(String),
     // ...
 }
📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between f93eca8 and 1a764a6.

📒 Files selected for processing (2)
  • hoprd/rest-api/src/lib.rs (1 hunks)
  • hoprd/rest-api/src/peers.rs (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • hoprd/rest-api/src/peers.rs
⏰ Context from checks skipped due to timeout of 90000ms (9)
  • GitHub Check: tests-smoke-websocket
  • GitHub Check: tests-unit-nightly
  • GitHub Check: tests-smart-contracts
  • GitHub Check: tests-unit
  • GitHub Check: hoprd / docker
  • GitHub Check: hopli / docker
  • GitHub Check: Docs / Rust docs
  • GitHub Check: Add PR labels
  • GitHub Check: Linter

@Teebor-Choka Teebor-Choka added the package:DAppNodePackage-Hopr Build dAppNode package for dufour label Jan 20, 2025
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
hoprd/hoprd/example_cfg.yaml (1)

Line range hint 180-227: Monitor network stability after deploying these aggressive timing changes.

The combined effect of these configuration changes (increased parallel probes, reduced ignore timeframe, and shorter timeout) represents a more aggressive network management strategy. While this should improve responsiveness in larger networks, it's important to:

  1. Monitor false disconnections in slower network segments
  2. Track resource utilization under peak load
  3. Observe reconnection patterns for potentially problematic nodes

Consider implementing gradual rollout or A/B testing to validate these changes in production.

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 1a764a6 and 8693567.

📒 Files selected for processing (1)
  • hoprd/hoprd/example_cfg.yaml (3 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (8)
  • GitHub Check: hoprd / docker
  • GitHub Check: hopli / docker
  • GitHub Check: tests-unit-nightly
  • GitHub Check: Docs / Rust docs
  • GitHub Check: tests-smoke-websocket
  • GitHub Check: tests-unit
  • GitHub Check: Linter
  • GitHub Check: tests-smart-contracts
🔇 Additional comments (3)
hoprd/hoprd/example_cfg.yaml (3)

180-180: Consider network size and resource implications of increased parallel probes.

Increasing max_parallel_probes from 14 to 25 will help cover more peers in larger networks during each heartbeat round. However, this also means:

  • Higher concurrent network connections
  • Increased memory usage
  • More CPU utilization for probe processing

Please verify resource utilization under peak load with this configuration:

✅ Verification successful

Change aligns with codebase defaults and implementation constraints

The increase to 25 parallel probes matches the default value defined in transport/network/src/constants.rs. The implementation includes validation to prevent negative values and shows no hard resource limits that would be impacted by this change.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Search for resource monitoring or performance testing code
rg -A 5 "resource|performance|memory|cpu" --type rust

Length of output: 109558


Script:

#!/bin/bash
# Search for probe implementation and related code
rg -A 5 "parallel.*probe|probe.*parallel|max_parallel_probes" --type rust

# Search for probe-related tests and documentation
rg -A 5 "test.*probe|probe.*test" --type rust

# Search for any resource limits or constraints in probe code
rg -A 5 "limit.*probe|probe.*limit|max.*probe|probe.*max" --type rust

Length of output: 21042


227-227: Evaluate timeout reduction impact on slower network segments.

Decreasing the heartbeat timeout from 15s to 6s will speed up failure detection, but:

  • May cause false negatives in networks with higher latency
  • Could lead to unnecessary peer disconnections in congested conditions

Let's analyze the timeout handling implementation:

✅ Verification successful

Timeout reduction is safe due to layered quality assessment

The reduction of heartbeat timeout from 15s to 6s is safe because:

  • Individual ping failures don't immediately impact peer quality (uses 25-sample moving average)
  • Quality degradation is gradual (0.1 step per failure)
  • Network has built-in resilience with backoff mechanisms
  • Multiple successful pings are required to significantly change peer quality
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Search for timeout handling code
ast-grep --pattern 'timeout|deadline|expired'

Length of output: 186


Script:

#!/bin/bash
# Search for timeout configuration usage
rg -g '!vendor' -g '!target' -A 5 'timeout:\s+\d+' 

# Search for heartbeat-related code
rg -g '!vendor' -g '!target' -A 10 'heartbeat|peer.*connection|peer.*timeout'

# Find all yaml config files for comparison
fd -e yaml -e yml --exec head -n 50

Length of output: 66036


Script:

#!/bin/bash
# Search for network quality configurations and thresholds
rg -g '!vendor' -g '!target' -A 5 'quality.*threshold|network.*quality|minimum.*quality'

# Look for any network-related configuration defaults
rg -g '!vendor' -g '!target' -A 5 'DEFAULT_NETWORK|network.*config'

Length of output: 34401


204-204: Verify impact of reduced ignore timeframe on network stability.

Reducing ignore_timeframe from 240s to 120s means nodes will be retried more frequently after failures. While this can help recover from temporary issues faster, consider:

  • Could lead to more frequent connection attempts to genuinely problematic nodes
  • May increase network chatter in unstable environments

Let's check if there are any existing retry limits or backoff mechanisms:

✅ Verification successful

Reduced ignore timeframe is safe due to existing protective measures

The change from 240s to 120s is safe because the system implements:

  • Exponential backoff (exponent: 1.5) with minimum delay of 2.0s
  • Quality tracking with 25-sample moving average window
  • Proper peer status management with connection rate limiting
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Search for backoff or retry logic in the codebase
ast-grep --pattern 'backoff|retry|attempt' 

Length of output: 183


Script:

#!/bin/bash
# Search for ignore_timeframe usage and related connection handling
rg -i "ignore_timeframe" -A 5 -B 5

# Look for connection handling patterns
ast-grep --pattern 'fn $_(connection|peer|node|retry|backoff)' 

# Check for documentation
fd README.md --exec cat {}

Length of output: 6833

Copy link
Contributor

The latest dAppNode package for DAppNodePackage-Hopr has been built and uploaded with IPFS hash: /ipfs/QmSwz6agoirRMGi6Ja66f7wWqEajkxRJhUX19Z4SqXkbbY

Copy link
Contributor

The latest dAppNode package for DAppNodePackage-Hopr has been built and uploaded with IPFS hash: /ipfs/QmZNuTZiyxnnk1txCCCxjg3X6xDWDUDBLpH8XqwprNC91P

@Teebor-Choka Teebor-Choka enabled auto-merge (squash) January 20, 2025 17:24
@ausias-armesto ausias-armesto marked this pull request as draft January 20, 2025 18:03
auto-merge was automatically disabled January 20, 2025 18:03

Pull request was converted to draft

@ausias-armesto ausias-armesto marked this pull request as ready for review January 20, 2025 18:03
Copy link
Contributor

@tolbrino tolbrino left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Crate version bumps missing

Copy link
Contributor

The latest dAppNode package for DAppNodePackage-Hopr has been built and uploaded with IPFS hash: /ipfs/QmSVp66j8mZz9LyLxaMR7BUJbYyg8CCHTVZ9UGvEAP8XH9

@github-actions github-actions bot added the dependencies Pull requests that update a dependency file label Jan 20, 2025
Copy link
Contributor

The latest dAppNode package for DAppNodePackage-Hopr has been built and uploaded with IPFS hash: /ipfs/QmRrZu7VaSABNkkFuFLqfuik5Dv52mmmsacrCj8Aau1ccp

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2.2.0 if we ping a node that is currently offline, we get a 422 error with UNKNOWN_FAILURE instead of TIMEOUT
3 participants