Skip to content

Conversation

tolbrino
Copy link
Contributor

@tolbrino tolbrino commented Nov 26, 2024

The indexer hot-path crashed during the processing of logs. The hot-path has been improved to do fewer db operations and be more performant.

Moreover, fast-sync resume has been added. Previously, if the node crashed during fast-sync, it would end up with an incomplete index and never complete it.

Lastly, the http client has been configured to accept brotli and zstd content encoding, drastically reducing the transmitted data, and keep TCP connections open for 30 seconds.

Fixes #6665
Fixes #6664

Copy link
Contributor

coderabbitai bot commented Nov 26, 2024

📝 Walkthrough

Walkthrough

The changes in this pull request include updates to documentation in README.md, version increments in multiple Cargo.toml files, and modifications to several Rust source files. Key updates involve enhancements to the fast sync process in the block.rs file, improved error handling and logging in the indexer.rs file, and changes to method signatures in the logs.rs files. These updates aim to clarify operational logic, improve functionality, and enhance error reporting across the HOPR project.

Changes

File Change Summary
README.md Updated "Using Fast Sync" section to clarify behavior regarding index data during fast sync.
chain/indexer/Cargo.toml Version updated from 0.5.3 to 0.6.0.
chain/indexer/src/block.rs Enhanced Indexer struct methods for error handling, added parameters for checksum fetching, refined fast sync logic.
chain/rpc/Cargo.toml Version updated from 0.3.1 to 0.3.2.
chain/rpc/src/indexer.rs Improved error handling and logging in stream_logs and try_stream_logs methods.
db/api/Cargo.toml Version updated from 0.5.2 to 0.5.3.
db/api/src/logs.rs Added processed parameter to get_logs_block_numbers, changed return type of update_logs_checksums.
db/sql/Cargo.toml Version updated from 0.8.3 to 0.8.4.
db/sql/src/logs.rs Updated get_logs_block_numbers and update_logs_checksums methods to include new parameters and return types.
hoprd/hoprd/src/main.rs Modified Tokio runtime configuration and logging setup; enhanced signal handling.

Assessment against linked issues

Objective Addressed Explanation
Using fast sync will have incomplete index (#6665)
Nodes using the same RPC endpoint end up with different index data (#6664) No changes related to RPC endpoint behavior.

Possibly related PRs

Suggested labels

bug, improvement, crate:hoprd-api

Suggested reviewers

  • NumberFour8

Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

@tolbrino tolbrino changed the title Tb/20241125 stack size 2 Fix tokio stack overflow crash Nov 26, 2024
@github-actions github-actions bot added the dependencies Pull requests that update a dependency file label Nov 26, 2024
@tolbrino tolbrino force-pushed the tb/20241125-stack-size-2 branch from 5eee8a1 to 390ff64 Compare November 26, 2024 19:30
@tolbrino tolbrino marked this pull request as ready for review November 26, 2024 19:30
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Outside diff range and nitpick comments (9)
hoprd/hoprd/src/main.rs (1)

163-166: LGTM! Consider adding monitoring and documentation.

The change from thread stack size to max blocking threads configuration should help resolve the stack overflow crash by providing more threads for blocking operations. The increase to 1024 threads (from default 512) aligns with the throughput requirements discussed in the past review comments.

Recommendations:

  1. Monitor thread usage and system resources to ensure the increased thread count doesn't cause resource pressure.
  2. Consider adding a comment explaining why 1024 was chosen as the optimal value.

Add a comment explaining the rationale:

 let res = tokio::runtime::Builder::new_multi_thread()
     .enable_all()
+    // Increased from default 512 to handle more concurrent blocking operations
+    // and prevent stack overflow crashes in the indexer hot-path
     .max_blocking_threads(1024)
     .build()
chain/rpc/src/indexer.rs (5)

Line range hint 73-83: Enhance error context when fetching logs fails

The current error logging provides the block range but could include additional context such as the specific error code or message to aid in debugging.

Consider updating the error log to include the error code:

 error!(
     from = ?subrange.get_from_block(),
     to = ?subrange.get_to_block(),
+    error_code = ?e.code(),
     error = %e,
     "failed to fetch logs in block subrange"
 );

Line range hint 101-114: Avoid panicking when the start block number exceeds the latest block number

Panicking within the try_stream_logs method can cause the entire application to crash unexpectedly. It's recommended to handle this scenario by returning a structured error so that the calling code can decide how to handle it gracefully.

Replace the panic with an error return:

- panic!("indexer start block number {from_block} is greater than the chain latest block number {latest_block} (diff {past_diff}) =>
- possible causes: chain reorg, RPC provider out of sync, corrupted DB =>
- possible solutions: change the RPC provider, reinitialize the DB");
+ return Err(RpcError::IndexerStartBlockTooHigh {
+     from_block,
+     latest_block,
+     past_diff,
+     message: "Possible causes include chain reorg, RPC provider out of sync, or corrupted DB. Possible solutions: change the RPC provider or reinitialize the DB.".to_string(),
+ });

Ensure that RpcError includes a variant IndexerStartBlockTooHigh to accommodate this error.


Line range hint 155-167: Handle unrecoverable RPC errors without panicking

Using panic! after exceeding MAX_LOOP_FAILURES can abruptly terminate the application. Instead, return an error to allow higher-level code to handle the situation appropriately.

Modify the code to return an error:

- panic!("!!! Cannot advance the chain indexing due to unrecoverable RPC errors.
-
- The RPC provider does not seem to be working correctly.
-
- The last encountered error was: {e}");
+ return Err(RpcError::UnrecoverableRpcError {
+     message: "Cannot advance the chain indexing due to unrecoverable RPC errors. The RPC provider may not be working correctly.".to_string(),
+     last_error: Box::new(e),
+ });

Update RpcError to include an UnrecoverableRpcError variant if it doesn't already exist.


Line range hint 200-202: Check for potential integer overflow when calculating delay

In the line where the delay is calculated:

futures_timer::Delay::new(past_diff as u32 * self.cfg.expected_block_time / 3).await;

Casting past_diff to u32 might cause an overflow if past_diff exceeds u32::MAX. This can happen if the block difference is large.

Consider using u64 for the calculation to prevent overflow:

- futures_timer::Delay::new(past_diff as u32 * self.cfg.expected_block_time / 3).await;
+ futures_timer::Delay::new(Duration::from_secs(past_diff * self.cfg.expected_block_time.as_secs() / 3)).await;

Line range hint 235-237: Use tokio::time::sleep instead of futures_timer::Delay

The futures_timer crate is no longer maintained. It's recommended to use tokio::time::sleep for compatibility and maintenance purposes.

Update the delay implementation:

- futures_timer::Delay::new(self.cfg.expected_block_time).await;
+ tokio::time::sleep(self.cfg.expected_block_time).await;

Ensure that the tokio features are enabled and that the rest of the codebase is compatible with tokio.

chain/indexer/src/block.rs (3)

341-346: Refactor to eliminate the boolean flag fetch_checksum_from_db

Passing a boolean flag like fetch_checksum_from_db to control function behavior can make the code harder to read and maintain. Consider refactoring by splitting process_block into two separate functions or using a strategy pattern to handle the different behaviors. This would improve code clarity and maintainability.


361-370: Simplify checksum handling to reduce complexity

The conditional logic based on fetch_checksum_from_db adds complexity to the checksum handling. Consider unifying the approach to obtaining last_log_checksum, possibly always fetching it from the database or restructuring the code to avoid the need for the flag altogether. This would simplify the code and reduce potential errors.


744-867: Refactor the test function to remove code duplication

The test_indexer_fast_sync_full_with_resume function contains duplicated code for "Run 1" and "Run 2". Extract common setup and assertions into helper functions or utilize test parameterization to reduce duplication. This refactoring will enhance readability and make the tests easier to maintain.

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between 6ef8549 and 390ff64.

⛔ Files ignored due to path filters (1)
  • Cargo.lock is excluded by !**/*.lock
📒 Files selected for processing (10)
  • README.md (1 hunks)
  • chain/indexer/Cargo.toml (1 hunks)
  • chain/indexer/src/block.rs (9 hunks)
  • chain/rpc/Cargo.toml (1 hunks)
  • chain/rpc/src/indexer.rs (1 hunks)
  • db/api/Cargo.toml (1 hunks)
  • db/api/src/logs.rs (4 hunks)
  • db/sql/Cargo.toml (1 hunks)
  • db/sql/src/logs.rs (5 hunks)
  • hoprd/hoprd/src/main.rs (1 hunks)
✅ Files skipped from review due to trivial changes (5)
  • README.md
  • chain/indexer/Cargo.toml
  • chain/rpc/Cargo.toml
  • db/api/Cargo.toml
  • db/sql/Cargo.toml
🔇 Additional comments (9)
db/api/src/logs.rs (3)

3-3: LGTM: Import addition is appropriate

The new Hash import supports the updated return type of update_logs_checksums.


137-138: Verify all call sites of update_logs_checksums

The return type change from Result<()> to Result<Hash> is a breaking change that improves functionality by providing the computed hash. This aligns with the PR objectives for better index consistency tracking.

Let's verify that all call sites have been updated to handle the new return type:

✅ Verification successful

Let me analyze the results. The search shows:

  1. All calls to update_logs_checksums() are properly handling the returned Result<Hash> by:

    • Using is_ok() in assertions (5 occurrences in chain/indexer/src/block.rs)
    • Properly unwrapping and using the hash value in one case: Ok(last_log_checksum) => { ... }
    • Using unwrap() in test code (db/sql/src/logs.rs)
  2. The implementation in db/sql/src/logs.rs correctly returns Result<Hash> and contains the complete logic for computing and returning the hash.

All call sites properly handle the new return type

The return type change from Result<()> to Result<Hash> has been properly implemented and all call sites have been updated to handle the new return type. Most callers only check for success/failure which is backward compatible, while one caller actually uses the returned hash value.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Search for all calls to update_logs_checksums
rg -A 2 "update_logs_checksums\(\)" 

# Search for implementations to ensure they return Hash
ast-grep --pattern 'impl $_ for $_ {
  $$$
  async fn update_logs_checksums(&self) -> Result<Hash> {
    $$$
  }
}'

Length of output: 23387


Line range hint 79-88: Verify implementations of the new processed parameter

The addition of the processed filter parameter is well-documented and aligns with the PR objectives for improving fast-sync functionality.

Let's verify that all implementations handle this new parameter correctly:

✅ Verification successful

Implementation of new processed parameter is correctly handled

The verification shows that the processed parameter is properly implemented and used throughout the codebase:

  • The trait definition in db/api/src/logs.rs declares the parameter
  • The implementation in db/sql/src/logs.rs handles the parameter correctly
  • The parameter is used appropriately in chain/indexer/src/block.rs for filtering processed/unprocessed logs
  • Test cases in db/sql/src/logs.rs verify the functionality with various combinations of the parameter:
    • Filtering for processed logs (Some(true))
    • Filtering for unprocessed logs (Some(false))
    • No filtering (None)
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Search for trait implementations to ensure they've been updated
ast-grep --pattern 'impl $_ for $_ {
  $$$
  async fn get_logs_block_numbers(
    $$$
    processed: Option<bool>
  ) $$$
}'

Length of output: 126


Script:

#!/bin/bash
# Let's try a different approach to find implementations
# First, find the trait definition
echo "=== Searching for trait definition ==="
rg "trait.*get_logs_block_numbers" -A 5

echo -e "\n=== Searching for implementations ==="
# Search for impl blocks containing get_logs_block_numbers
rg "impl.*\{" -A 20 | rg -A 20 "get_logs_block_numbers"

Length of output: 275


Script:

#!/bin/bash
# Let's try a simpler search approach to find any mentions of get_logs_block_numbers
echo "=== Searching for get_logs_block_numbers mentions ==="
rg "get_logs_block_numbers" -B 2 -A 10

echo -e "\n=== Searching for files containing get_logs_block_numbers ==="
fd . -t f -x grep -l "get_logs_block_numbers" {}

Length of output: 10578

chain/rpc/src/indexer.rs (1)

Line range hint 223-230: Ensure logs are ordered before processing

The comment mentions that logs should be ordered by block number, and a panic occurs if they are not. Relying on external data being perfectly ordered can be risky.

Check if the logs returned by prov_clone.get_logs(&subrange).await are guaranteed to be ordered. If not, consider sorting them before processing.

db/sql/src/logs.rs (3)

706-744: Enhance test coverage in test_get_logs_block_numbers

The addition of log_3 and corresponding assertions improves test coverage by checking the behavior of get_logs_block_numbers with different processed statuses. This ensures that the method correctly filters block numbers based on the processed parameter in various scenarios.


173-187: Addition of processed filter in get_logs_block_numbers

The introduction of the processed: Option<bool> parameter enhances the functionality by allowing filtering of block numbers based on their processed status. Ensure that all calls to get_logs_block_numbers are updated to accommodate the new parameter and that the default behavior remains consistent when processed is None.

Run the following script to verify all calls to get_logs_block_numbers handle the processed parameter:

✅ Verification successful

All calls to get_logs_block_numbers correctly handle the new processed parameter

Based on the search results, all calls to get_logs_block_numbers have been updated to handle the new processed parameter:

  • Test cases in db/sql/src/logs.rs demonstrate proper usage with None, Some(true), and Some(false)
  • Production usage in chain/indexer/src/block.rs shows consistent handling with both Some(false) and None
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Find all calls to `get_logs_block_numbers` and check if they include the `processed` parameter.

rg 'get_logs_block_numbers\(' --type rust -A 2

Length of output: 3677


Line range hint 288-341: Update return type of update_logs_checksums to Result<Hash>

Changing the return type of update_logs_checksums to Result<Hash> allows the method to return the last computed checksum, which can be useful for verification and logging purposes. Please ensure that all callers of this function are updated to handle the new return type appropriately to prevent any unintended runtime errors.

Run the following script to find all usages of update_logs_checksums and verify they handle the new return type:

chain/indexer/src/block.rs (2)

4-4: Logging imports correctly updated

The addition of the tracing macros (debug, error, info, trace) is appropriate for enhanced logging functionality throughout the indexer. Good job incorporating detailed logging.


149-154: Verify processing of unprocessed logs during fast sync continuation

When fast sync is enabled and the index database is not empty, the code processes existing unprocessed logs. Ensure that this logic handles unprocessed logs correctly without causing data inconsistency or duplication. It might be beneficial to verify that already processed logs are not reprocessed.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Outside diff range and nitpick comments (2)
chain/rpc/src/client.rs (2)

489-490: Consider using debug log level instead of info in SurfRequestor::new

The log statement info!(?cfg, "creating surf client"); provides useful information during debugging but may be too verbose at the info level in production environments. Consider changing it to debug! to reduce log verbosity during normal operation.


556-557: Consider using debug log level instead of info in ReqwestRequestor::new

The log statement info!(?cfg, "creating reqwest client"); provides useful debugging information but might be too verbose at the info level for production use. Changing it to debug! can help keep the logs cleaner while still providing necessary details during development.

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between 390ff64 and c4373d9.

⛔ Files ignored due to path filters (1)
  • Cargo.lock is excluded by !**/*.lock
📒 Files selected for processing (3)
  • chain/rpc/Cargo.toml (2 hunks)
  • chain/rpc/src/client.rs (3 hunks)
  • hoprd/hoprd/src/main.rs (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
  • chain/rpc/Cargo.toml
  • hoprd/hoprd/src/main.rs
🔇 Additional comments (2)
chain/rpc/src/client.rs (2)

562-564: Verify the necessity of setting tcp_keepalive to 30 seconds

The .tcp_keepalive(Some(Duration::from_secs(30))) configuration sets the TCP keep-alive interval to 30 seconds. Ensure that this value aligns with your network environment and does not conflict with server expectations or cause unintended side effects.


567-568: Ensure server compatibility with zstd and brotli encodings

Enabling .zstd(true) and .brotli(true) allows the client to handle responses compressed with these encodings. Verify that the RPC server supports zstd and brotli compression to prevent issues with response processing.

@tolbrino tolbrino added this to the 2.2.0-rc.1 milestone Nov 27, 2024
@tolbrino tolbrino changed the title Fix tokio stack overflow crash Fix tokio stack overflow crash during indexing Nov 27, 2024
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Outside diff range and nitpick comments (1)
db/sql/src/logs.rs (1)

173-187: Improved query performance in get_logs_block_numbers

The changes to use LogStatus table directly with DISTINCT clause and optional processing status filter improve database performance by:

  • Reducing JOIN operations by querying LogStatus directly
  • Preventing duplicate block numbers
  • Adding flexibility to filter by processing status

Consider adding an index on (processed, block_number) columns if not already present, to further optimize this query pattern.

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between c4373d9 and 26aa3cb.

📒 Files selected for processing (3)
  • README.md (1 hunks)
  • chain/rpc/Cargo.toml (2 hunks)
  • db/sql/src/logs.rs (5 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
  • README.md
  • chain/rpc/Cargo.toml
🧰 Additional context used
📓 Learnings (1)
db/sql/src/logs.rs (1)
Learnt from: NumberFour8
PR: hoprnet/hoprnet#6142
File: hopr/hopr-lib/tests/chain_integration_tests.rs:40-56
Timestamp: 2024-11-12T10:56:01.095Z
Learning: The use of `unwrap` is acceptable in test scenarios within the `hoprnet` project, as it helps directly identify issues when tests fail.
🔇 Additional comments (3)
db/sql/src/logs.rs (3)

Line range hint 288-341: Enhanced checksum tracking capability

The change to return the last computed checksum improves the function's utility by allowing callers to track the chain of checksums, which is valuable for verifying data integrity during indexing.


335-338: Improved error logging with structured fields

The error logging now uses structured fields and provides clearer context about the failed operation, which will help with debugging issues in production.


Line range hint 688-744: Comprehensive test coverage for new functionality

The new test cases thoroughly cover:

  • Filtering by processed/unprocessed status
  • Various combinations of block number ranges
  • Edge cases and result ordering

@tolbrino tolbrino enabled auto-merge November 27, 2024 18:01
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Outside diff range and nitpick comments (2)
chain/indexer/src/block.rs (2)

149-154: Consider adding more detailed logging for fast sync resume

The implementation correctly handles fast sync resume, but could benefit from additional logging to help diagnose issues.

-info!("Fast sync is enabled, but the index database is not empty. Fast sync will continue on existing unprocessed logs.");
+info!(
+    unprocessed_logs = ?log_block_numbers.len(),
+    "Fast sync is enabled, but the index database is not empty. Fast sync will continue on existing unprocessed logs."
+);

738-860: LGTM! Consider adding content verification

The test implementation thoroughly covers fast sync resume functionality. Consider adding assertions to verify the content of processed logs, not just their count.

 assert_eq!(db.get_logs_block_numbers(None, None, Some(true)).await?.len(), 4);
 assert_eq!(db.get_logs_block_numbers(None, None, Some(false)).await?.len(), 0);
+
+// Verify the content of processed logs
+let processed_logs = db.get_logs(None, None).await?;
+assert!(processed_logs.iter().all(|log| log.processed == Some(true)), "All logs should be processed");
+assert!(processed_logs.iter().any(|log| log.block_number == 3), "Should contain log from block 3");
+assert!(processed_logs.iter().any(|log| log.block_number == 4), "Should contain log from block 4");
📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between 26aa3cb and 516b66b.

📒 Files selected for processing (1)
  • chain/indexer/src/block.rs (8 hunks)
🧰 Additional context used
📓 Learnings (1)
chain/indexer/src/block.rs (1)
Learnt from: tolbrino
PR: hoprnet/hoprnet#6675
File: chain/indexer/src/block.rs:319-320
Timestamp: 2024-11-27T17:22:40.429Z
Learning: In the Rust file `chain/indexer/src/block.rs`, within the `process_block_by_id` function, if there is a block number mismatch in logs retrieved from the database, we use `panic!` to handle this unrecoverable error.

@Teebor-Choka Teebor-Choka merged commit f1d908c into master Nov 27, 2024
30 checks passed
@Teebor-Choka Teebor-Choka deleted the tb/20241125-stack-size-2 branch November 27, 2024 22:34
@coderabbitai coderabbitai bot mentioned this pull request Feb 25, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Using fast sync in Rotsee leads to an incomplete index RPC sync in rotsee leads to diffent index results
3 participants