Skip to content

Fix the network heartbeat behavior to be more robust and work with /node/peers #6583

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

Teebor-Choka
Copy link
Contributor

Fix network behavior inconsistency related to improper ignoring of network peers and closing peer connections for offline peers.

  • Remove overriding variables
  • Fix the network behavior
  • Proper /node/peers output fix

Notes

Fixes #6529

@Teebor-Choka Teebor-Choka added this to the 2.2.0-rc.1 milestone Oct 30, 2024
@Teebor-Choka Teebor-Choka requested a review from a team October 30, 2024 11:51
@Teebor-Choka Teebor-Choka self-assigned this Oct 30, 2024
@Teebor-Choka Teebor-Choka marked this pull request as ready for review October 30, 2024 11:51
Copy link
Contributor

coderabbitai bot commented Oct 30, 2024

📝 Walkthrough

Walkthrough

The changes in this pull request involve updates to documentation and configuration files for the HOPR project. The README.md file has been revised for clarity and expanded installation instructions. Configuration files, including example_cfg.yaml, tests/barebone-lower-win-prob.cfg.yaml, and others, have introduced new parameters related to heartbeat functionality and adjusted network quality thresholds. Additionally, modifications in Rust source files have refined peer management and session handling in the transport layer, enhancing the overall clarity and functionality of the codebase.

Changes

File Path Change Summary
README.md Updated introduction, revised "About" section, expanded installation instructions for Docker and Nix, added new environment variables, corrected example execution command, added "Using Docker Compose" section, and rephrased "Testnet accessibility" and "Migrating between releases" sections.
hoprd/hoprd/example_cfg.yaml Updated quality_bad_threshold from 0.2 to 0.1 and quality_offline_threshold from 0.5 to 0.0 in network_options.
hoprd/rest-api/src/node.rs Refactored quality check logic in peers function, enhancing readability and streamlining peer information handling.
tests/barebone-lower-win-prob.cfg.yaml Added heartbeat configuration with interval: 10, threshold: 10, variance: 1, and ignore_timeframe: 5 in network_options.
tests/barebone.cfg.yaml Added heartbeat configuration with interval: 10, threshold: 10, variance: 1, and ignore_timeframe: 5 in network_options.
tests/default.cfg.yaml Added heartbeat configuration with interval: 10, threshold: 10, variance: 1, and ignore_timeframe: 5 in network_options.
tests/node.py Removed environment variable assignments for heartbeat configuration from custom_env in setup method.
transport/api/src/lib.rs Updated network_connected_peers and network_peer_info methods to simplify logic and enhance error handling in session management.
transport/network/src/config.rs Updated DEFAULT_NETWORK_OFFLINE_QUALITY_THRESHOLD from 0.5 to 0.0 and DEFAULT_NETWORK_BAD_QUALITY_THRESHOLD from 0.2 to 0.1, and modified validation logic in NetworkConfig.
transport/network/src/network.rs Simplified initialization logic in new method, updated conditions in update method, added connected_peers method, and changed visibility of peer_filter method.

Assessment against linked issues

Objective Addressed Explanation
Ensure E2E tests pass consistently in CI (#6529)
Address potential race conditions in on-announcement ping (#6529)

Possibly related PRs

Suggested labels

bug, status:in-progress

Suggested reviewers

  • Teebor-Choka
  • NumberFour8

Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

@Teebor-Choka Teebor-Choka force-pushed the kauki/tests/fix-overriding-variables-in-the-node-setup branch from ecec8a0 to 8bad068 Compare October 30, 2024 12:12
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

🧹 Outside diff range comments (3)
hoprd/rest-api/src/node.rs (2)

Line range hint 182-191: Consider adding timeouts for network operations.

To improve network robustness, consider adding timeouts for the peer info and address resolution operations. This would prevent potential hangs when network peers are unresponsive.

use tokio::time::timeout;
use std::time::Duration;

// Example timeout duration
const PEER_INFO_TIMEOUT: Duration = Duration::from_secs(5);

// Add timeout
if let Ok(Some(info)) = timeout(PEER_INFO_TIMEOUT, hopr.network_peer_info(&peer)).await.transpose()? {
    // ... existing code ...
}

Line range hint 193-204: Consider enhancing HeartbeatInfo for better diagnostics.

To support better network diagnostics, consider expanding the HeartbeatInfo structure to include additional metrics like:

  • Last heartbeat attempt timestamp
  • Average response time
  • Failure reasons

This would help diagnose network issues mentioned in issue #6529.

 #[derive(Debug, Default, Clone, Serialize, utoipa::ToSchema)]
 pub(crate) struct HeartbeatInfo {
     sent: u64,
     success: u64,
+    last_attempt: Option<u128>,
+    avg_response_time: Option<u64>,
+    last_failure_reason: Option<String>,
 }
README.md (1)

Line range hint 214-216: Fix typos and network name inconsistency.

There are a few issues in this section:

  1. "enabels" should be "enables"
  2. "doufur" should be "dufour" to match the network name used in other sections
-A basic reasonable setup is that uses a custom identity and enabels a REST API of the `hoprd` could look like:
+A basic reasonable setup that uses a custom identity and enables a REST API of the `hoprd` could look like:

-hoprd --identity /app/hoprd-db/.hopr-identity --password switzerland --init --announce --host "0.0.0.0:9091" --apiToken <MY_TOKEN> --network doufur
+hoprd --identity /app/hoprd-db/.hopr-identity --password switzerland --init --announce --host "0.0.0.0:9091" --apiToken <MY_TOKEN> --network dufour
🧰 Tools
🪛 LanguageTool

[style] ~198-~198: Consider a shorter alternative to avoid wordiness.
Context: ...owing environment variables can be used in order to tweak the node functionality: - `HOPRD...

(IN_ORDER_TO_PREMIUM)

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: ASSERTIVE

📥 Commits

Files that changed from the base of the PR and between 9430c2f and 8bad068.

📒 Files selected for processing (10)
  • README.md (1 hunks)
  • hoprd/hoprd/example_cfg.yaml (1 hunks)
  • hoprd/rest-api/src/node.rs (1 hunks)
  • tests/barebone-lower-win-prob.cfg.yaml (1 hunks)
  • tests/barebone.cfg.yaml (1 hunks)
  • tests/default.cfg.yaml (1 hunks)
  • tests/node.py (0 hunks)
  • transport/api/src/lib.rs (1 hunks)
  • transport/network/src/config.rs (2 hunks)
  • transport/network/src/network.rs (4 hunks)
💤 Files with no reviewable changes (1)
  • tests/node.py
🧰 Additional context used
🪛 LanguageTool
README.md

[style] ~198-~198: Consider a shorter alternative to avoid wordiness.
Context: ...owing environment variables can be used in order to tweak the node functionality: - `HOPRD...

(IN_ORDER_TO_PREMIUM)

🔇 Additional comments (15)
tests/barebone.cfg.yaml (2)

3-7: 🧹 Nitpick (assertive)

Consider adjusting heartbeat threshold for better stability.

While the frequent heartbeat interval (10) is good for testing, having the threshold equal to the interval (10) provides no buffer for temporary network instability. Consider increasing the threshold to provide more resilience against transient failures.

A common pattern is to set the threshold to 1.5-2x the interval. For example:

  heartbeat:
    interval: 10
-   threshold: 10
+   threshold: 15
    variance: 1

8-10: Verify if ignore_timeframe is sufficient for test stability.

The ignore_timeframe (5) being shorter than the heartbeat interval (10) could lead to premature reconnection attempts. This might contribute to the race conditions mentioned in issue #6529.

Run the following to check if other test configurations use different ratios:

✅ Verification successful

Let me analyze the test configurations further to understand the relationship between these timing parameters and their usage in the codebase.


The ignore_timeframe setting is appropriate for test configurations

The verification reveals that:

  • Test configurations consistently use a 5-second ignore_timeframe with a 10-second heartbeat interval
  • Production configurations (in hoprd/example_cfg.yaml and local-cluster) use a much larger 600-second ignore_timeframe
  • The codebase implementation in transport/network/src/network.rs shows that ignore_timeframe is used as a cooldown period before reconnection attempts

The shorter ignore_timeframe in test configurations is intentional and appropriate because:

  1. Tests need faster feedback cycles than production
  2. The implementation properly handles the timing relationship, using saturating subtraction to prevent underflow
  3. The race conditions mentioned in issue #6529 are unrelated to this timing ratio, as the heartbeat and ignore mechanisms work independently
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Compare heartbeat and ignore_timeframe settings across test configs
# Expected: Find configurations that might provide insights into stable ratios

echo "Searching for heartbeat and ignore_timeframe configurations in test files..."
rg -A 10 'heartbeat:|ignore_timeframe:' tests/

Length of output: 2477


Script:

#!/bin/bash
# Search for code that handles reconnection logic related to these timing parameters
ast-grep --pattern 'ignore_timeframe' 

# Also search for any comments or code discussing these timing parameters
rg -g '!*.{yaml,yml}' -B 2 -A 2 'ignore_timeframe|heartbeat.*interval'

Length of output: 6284

tests/barebone-lower-win-prob.cfg.yaml (2)

8-10: 🧹 Nitpick (assertive)

Verify ignore_timeframe in relation to heartbeat cycle.

The ignore_timeframe: 5 setting is documented as being "shorter than heartbeat cycle" (10s), which aligns with addressing potential race conditions in peer announcements.

However, we should ensure this value:

  1. Is sufficient to prevent false negatives in peer connectivity
  2. Doesn't mask actual connectivity issues

Let's verify the impact on peer connectivity:

#!/bin/bash
# Description: Check for any hardcoded timeouts or related timing constants
# that might conflict with this ignore_timeframe

echo "Checking for related timing constants:"
rg -A 2 "timeout|delay|interval" "src/**/*.rs"

Consider documenting:

  1. The rationale behind choosing 5 seconds
  2. How this value interacts with the heartbeat interval
  3. Any minimum/maximum bounds for this value

3-7: 🧹 Nitpick (assertive)

Verify heartbeat timing parameters against network latency.

The heartbeat configuration introduces frequent checks with:

  • 10-second interval
  • 10-second threshold
  • 1-second variance

While these values might help address the E2E test failures by increasing responsiveness, we should verify if they're appropriate for CI environment network conditions.

Let's check other config files for consistency:

Consider adding comments explaining:

  1. Why these specific timing values were chosen
  2. How they relate to the CI environment requirements
  3. The relationship between interval and threshold
✅ Verification successful

Heartbeat configuration values are consistent with other test configurations

The heartbeat settings in tests/barebone-lower-win-prob.cfg.yaml match the values used in other test configuration files:

  • Multiple test configs use identical values (interval: 10, threshold: 10, variance: 1)
  • These aggressive timing values appear to be specifically chosen for test environments
  • The production example config uses more conservative values (interval: 60)

The chosen values are appropriate for test scenarios where we want quick failure detection and faster test execution. The consistency across test configurations suggests these are well-established values for the test environment.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Check heartbeat settings across all config files
# Expected: Similar heartbeat settings in related test configs

echo "Checking heartbeat settings in all config files:"
rg -A 3 "heartbeat:" "tests/*.yaml" "example*.yaml"

Length of output: 286


Script:

#!/bin/bash
# Let's try with fd to find yaml files first and then search in them
echo "Finding all yaml/yml files:"
fd -e yaml -e yml

echo -e "\nChecking heartbeat settings in these files:"
fd -e yaml -e yml -x rg -A 3 "heartbeat:"

Length of output: 5807

tests/default.cfg.yaml (2)

3-7: 🧹 Nitpick (assertive)

Review heartbeat threshold configuration.

The heartbeat threshold being equal to the interval (both set to 10) might be too strict. If a single heartbeat is delayed or missed, it could immediately trigger unwanted peer disconnections. Consider increasing the threshold to allow for some network latency and jitter.

Suggested approach:

  • Set threshold to at least 2-3 times the interval (e.g., threshold: 20 or 30)
  • This provides better tolerance for temporary network issues while still detecting genuinely offline peers

8-10: 🧹 Nitpick (assertive)

Verify ignore_timeframe value relative to heartbeat cycle.

The ignore_timeframe (5) is set to half the heartbeat interval (10). This could lead to race conditions where legitimate network events are ignored if they occur just before a heartbeat check.

Let's check if other configuration files maintain consistent timing relationships:

Consider:

  • Setting ignore_timeframe closer to the heartbeat interval (e.g., 8-9)
  • Or adding documentation explaining why 5 is the optimal value for preventing the race conditions mentioned in issue #6529
transport/network/src/config.rs (1)

10-11: Verify the impact of reduced quality thresholds on network stability.

The significant reduction in quality thresholds (OFFLINE: 0.5→0.0, BAD_QUALITY: 0.2→0.1) makes the network more lenient in handling poor connections. While this may help with test stability, please verify that:

  1. These thresholds don't mask actual network issues
  2. The network remains stable with potentially more poor-quality connections
✅ Verification successful

Let's examine the usage of these thresholds to better understand their impact.


Let's examine the network quality handling logic more closely.


Based on the evidence gathered, I can now provide a final response about the quality threshold changes:

The reduced quality thresholds are safe and align with the network's behavior

The changes to quality thresholds (OFFLINE: 0.5→0.0, BAD_QUALITY: 0.2→0.1) are well-justified:

  1. The network code shows that connections are only closed when quality <= offline_threshold (0.0), meaning nodes maintain connections as long as there's any measurable quality
  2. The bad_quality threshold (0.1) is used for metrics and peer stats, not for connection decisions
  3. Test cases demonstrate the network remains healthy even with low-quality connections
  4. The thresholds are consistently used across the codebase with proper validation

These changes make the network more resilient by maintaining connections longer while still monitoring connection quality through metrics.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Search for test files that verify network stability
rg -l "network.*stability|quality.*threshold" --type rust

Length of output: 282


Script:

#!/bin/bash
# Check how these thresholds are used in network quality decisions
rg -A 5 "DEFAULT_NETWORK_.*QUALITY_THRESHOLD" --type rust

# Look for quality-related tests
rg -B 2 -A 5 "test.*quality" --type rust

# Search for network quality assessment logic
ast-grep --pattern 'fn $_(quality: $_, $$$) { $$$ }'

Length of output: 23546


Script:

#!/bin/bash
# Check how network quality affects connection decisions
rg -B 2 -A 5 "quality.*threshold" transport/network/src/network.rs

# Look for quality-related state handling
ast-grep --pattern 'fn $_(quality: f64, $$$) { $$$ }'

# Check for quality validation logic
rg -B 2 -A 5 "validate.*quality" --type rust

Length of output: 3237

hoprd/hoprd/example_cfg.yaml (2)

189-189: LGTM: Reasonable adjustment to bad quality threshold.

The reduction of quality_bad_threshold from 0.2 to 0.1 provides a more lenient threshold for marking node connectivity as "bad", which aligns with making the network behavior more robust while still maintaining reasonable quality standards.


191-191: Verify impact of aggressive offline threshold.

While setting quality_offline_threshold to 0.0 helps quickly identify truly offline nodes and may improve E2E test reliability, it's worth verifying that this aggressive threshold doesn't cause issues in production environments with temporary network instability.

The surrounding configuration provides safeguards:

  • quality_step: 0.1 ensures gradual quality adjustments
  • quality_avg_window_size: 25 provides smoothing effect
  • Backoff parameters allow for gradual reconnection attempts

Let's verify the E2E test improvements with these new thresholds:

hoprd/rest-api/src/node.rs (1)

174-175: LGTM! Good improvement to code readability.

Extracting the quality calculation into a variable improves code clarity and prevents potential double calculation.

README.md (1)

Line range hint 201-206: Document recommended values for libp2p connection limits.

Given that this PR addresses network stability and E2E test failures (issue #6529), it would be helpful to document recommended values for:

  • HOPR_INTERNAL_LIBP2P_MAX_CONCURRENTLY_DIALED_PEER_COUNT
  • HOPR_INTERNAL_LIBP2P_MAX_NEGOTIATING_INBOUND_STREAM_COUNT

These limits could help prevent race conditions during peer connections.

Let's check if there are any recommended values in the codebase:

✅ Verification successful

Let me search for the default constants mentioned in the code:


Document default values for libp2p connection limits

The environment variables have default values set in the codebase:

  • HOPR_INTERNAL_LIBP2P_MAX_CONCURRENTLY_DIALED_PEER_COUNT: defaults to 255
  • HOPR_INTERNAL_LIBP2P_MAX_NEGOTIATING_INBOUND_STREAM_COUNT: defaults to 255

Consider documenting these default values in the README to help users understand the baseline configuration.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Search for default values or documentation of these variables
rg -A 3 "HOPR_INTERNAL_LIBP2P_MAX"

Length of output: 1204


Script:

#!/bin/bash
# Search for the default constants
rg "HOPR_SWARM_CONCURRENTLY" transport/p2p/src/constants.rs

Length of output: 215

🧰 Tools
🪛 LanguageTool

[style] ~198-~198: Consider a shorter alternative to avoid wordiness.
Context: ...owing environment variables can be used in order to tweak the node functionality: - `HOPRD...

(IN_ORDER_TO_PREMIUM)

transport/network/src/network.rs (3)

240-241: LGTM: Improved peer quality threshold handling

The simplified logic for peer quality thresholds makes the code more robust and less prone to race conditions. The changes align well with the PR objectives to fix network heartbeat behavior.

Also applies to: 254-258


292-296: LGTM: Well-implemented connected peers retrieval

The new connected_peers method effectively filters peers based on the quality threshold, supporting the PR's goal to fix the /node/peers endpoint. The implementation is concise and leverages existing functionality.


299-299: LGTM: Appropriate visibility restriction

Reducing the visibility of peer_filter to pub(crate) is a good practice as it minimizes the public API surface while maintaining necessary access within the crate.

transport/api/src/lib.rs (1)

1269-1269: LGTM!

The method simplification enhances readability and aligns with the updated network interaction logic.

@Teebor-Choka Teebor-Choka merged commit 8e68cec into master Oct 30, 2024
26 checks passed
@Teebor-Choka Teebor-Choka deleted the kauki/tests/fix-overriding-variables-in-the-node-setup branch October 30, 2024 13:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Ping in E2E tests failing consistently
3 participants