-
Notifications
You must be signed in to change notification settings - Fork 97
Fix tokio stack overflow crash during indexing #6675
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
📝 WalkthroughWalkthroughThe changes in this pull request include updates to documentation in Changes
Assessment against linked issues
Possibly related PRs
Suggested labels
Suggested reviewers
Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
Documentation and Community
|
This reverts commit f03c140.
5eee8a1
to
390ff64
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
🧹 Outside diff range and nitpick comments (9)
hoprd/hoprd/src/main.rs (1)
163-166
: LGTM! Consider adding monitoring and documentation.The change from thread stack size to max blocking threads configuration should help resolve the stack overflow crash by providing more threads for blocking operations. The increase to 1024 threads (from default 512) aligns with the throughput requirements discussed in the past review comments.
Recommendations:
- Monitor thread usage and system resources to ensure the increased thread count doesn't cause resource pressure.
- Consider adding a comment explaining why 1024 was chosen as the optimal value.
Add a comment explaining the rationale:
let res = tokio::runtime::Builder::new_multi_thread() .enable_all() + // Increased from default 512 to handle more concurrent blocking operations + // and prevent stack overflow crashes in the indexer hot-path .max_blocking_threads(1024) .build()chain/rpc/src/indexer.rs (5)
Line range hint
73-83
: Enhance error context when fetching logs failsThe current error logging provides the block range but could include additional context such as the specific error code or message to aid in debugging.
Consider updating the error log to include the error code:
error!( from = ?subrange.get_from_block(), to = ?subrange.get_to_block(), + error_code = ?e.code(), error = %e, "failed to fetch logs in block subrange" );
Line range hint
101-114
: Avoid panicking when the start block number exceeds the latest block numberPanicking within the
try_stream_logs
method can cause the entire application to crash unexpectedly. It's recommended to handle this scenario by returning a structured error so that the calling code can decide how to handle it gracefully.Replace the panic with an error return:
- panic!("indexer start block number {from_block} is greater than the chain latest block number {latest_block} (diff {past_diff}) => - possible causes: chain reorg, RPC provider out of sync, corrupted DB => - possible solutions: change the RPC provider, reinitialize the DB"); + return Err(RpcError::IndexerStartBlockTooHigh { + from_block, + latest_block, + past_diff, + message: "Possible causes include chain reorg, RPC provider out of sync, or corrupted DB. Possible solutions: change the RPC provider or reinitialize the DB.".to_string(), + });Ensure that
RpcError
includes a variantIndexerStartBlockTooHigh
to accommodate this error.
Line range hint
155-167
: Handle unrecoverable RPC errors without panickingUsing
panic!
after exceedingMAX_LOOP_FAILURES
can abruptly terminate the application. Instead, return an error to allow higher-level code to handle the situation appropriately.Modify the code to return an error:
- panic!("!!! Cannot advance the chain indexing due to unrecoverable RPC errors. - - The RPC provider does not seem to be working correctly. - - The last encountered error was: {e}"); + return Err(RpcError::UnrecoverableRpcError { + message: "Cannot advance the chain indexing due to unrecoverable RPC errors. The RPC provider may not be working correctly.".to_string(), + last_error: Box::new(e), + });Update
RpcError
to include anUnrecoverableRpcError
variant if it doesn't already exist.
Line range hint
200-202
: Check for potential integer overflow when calculating delayIn the line where the delay is calculated:
futures_timer::Delay::new(past_diff as u32 * self.cfg.expected_block_time / 3).await;Casting
past_diff
tou32
might cause an overflow ifpast_diff
exceedsu32::MAX
. This can happen if the block difference is large.Consider using
u64
for the calculation to prevent overflow:- futures_timer::Delay::new(past_diff as u32 * self.cfg.expected_block_time / 3).await; + futures_timer::Delay::new(Duration::from_secs(past_diff * self.cfg.expected_block_time.as_secs() / 3)).await;
Line range hint
235-237
: Usetokio::time::sleep
instead offutures_timer::Delay
The
futures_timer
crate is no longer maintained. It's recommended to usetokio::time::sleep
for compatibility and maintenance purposes.Update the delay implementation:
- futures_timer::Delay::new(self.cfg.expected_block_time).await; + tokio::time::sleep(self.cfg.expected_block_time).await;Ensure that the
tokio
features are enabled and that the rest of the codebase is compatible withtokio
.chain/indexer/src/block.rs (3)
341-346
: Refactor to eliminate the boolean flagfetch_checksum_from_db
Passing a boolean flag like
fetch_checksum_from_db
to control function behavior can make the code harder to read and maintain. Consider refactoring by splittingprocess_block
into two separate functions or using a strategy pattern to handle the different behaviors. This would improve code clarity and maintainability.
361-370
: Simplify checksum handling to reduce complexityThe conditional logic based on
fetch_checksum_from_db
adds complexity to the checksum handling. Consider unifying the approach to obtaininglast_log_checksum
, possibly always fetching it from the database or restructuring the code to avoid the need for the flag altogether. This would simplify the code and reduce potential errors.
744-867
: Refactor the test function to remove code duplicationThe
test_indexer_fast_sync_full_with_resume
function contains duplicated code for "Run 1" and "Run 2". Extract common setup and assertions into helper functions or utilize test parameterization to reduce duplication. This refactoring will enhance readability and make the tests easier to maintain.
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
⛔ Files ignored due to path filters (1)
Cargo.lock
is excluded by!**/*.lock
📒 Files selected for processing (10)
README.md
(1 hunks)chain/indexer/Cargo.toml
(1 hunks)chain/indexer/src/block.rs
(9 hunks)chain/rpc/Cargo.toml
(1 hunks)chain/rpc/src/indexer.rs
(1 hunks)db/api/Cargo.toml
(1 hunks)db/api/src/logs.rs
(4 hunks)db/sql/Cargo.toml
(1 hunks)db/sql/src/logs.rs
(5 hunks)hoprd/hoprd/src/main.rs
(1 hunks)
✅ Files skipped from review due to trivial changes (5)
- README.md
- chain/indexer/Cargo.toml
- chain/rpc/Cargo.toml
- db/api/Cargo.toml
- db/sql/Cargo.toml
🔇 Additional comments (9)
db/api/src/logs.rs (3)
3-3
: LGTM: Import addition is appropriate
The new Hash import supports the updated return type of update_logs_checksums
.
137-138
: Verify all call sites of update_logs_checksums
The return type change from Result<()>
to Result<Hash>
is a breaking change that improves functionality by providing the computed hash. This aligns with the PR objectives for better index consistency tracking.
Let's verify that all call sites have been updated to handle the new return type:
✅ Verification successful
Let me analyze the results. The search shows:
-
All calls to
update_logs_checksums()
are properly handling the returnedResult<Hash>
by:- Using
is_ok()
in assertions (5 occurrences inchain/indexer/src/block.rs
) - Properly unwrapping and using the hash value in one case:
Ok(last_log_checksum) => { ... }
- Using
unwrap()
in test code (db/sql/src/logs.rs
)
- Using
-
The implementation in
db/sql/src/logs.rs
correctly returnsResult<Hash>
and contains the complete logic for computing and returning the hash.
All call sites properly handle the new return type
The return type change from Result<()>
to Result<Hash>
has been properly implemented and all call sites have been updated to handle the new return type. Most callers only check for success/failure which is backward compatible, while one caller actually uses the returned hash value.
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Search for all calls to update_logs_checksums
rg -A 2 "update_logs_checksums\(\)"
# Search for implementations to ensure they return Hash
ast-grep --pattern 'impl $_ for $_ {
$$$
async fn update_logs_checksums(&self) -> Result<Hash> {
$$$
}
}'
Length of output: 23387
Line range hint 79-88
: Verify implementations of the new processed parameter
The addition of the processed
filter parameter is well-documented and aligns with the PR objectives for improving fast-sync functionality.
Let's verify that all implementations handle this new parameter correctly:
✅ Verification successful
Implementation of new processed parameter is correctly handled
The verification shows that the processed
parameter is properly implemented and used throughout the codebase:
- The trait definition in
db/api/src/logs.rs
declares the parameter - The implementation in
db/sql/src/logs.rs
handles the parameter correctly - The parameter is used appropriately in
chain/indexer/src/block.rs
for filtering processed/unprocessed logs - Test cases in
db/sql/src/logs.rs
verify the functionality with various combinations of the parameter:- Filtering for processed logs (
Some(true)
) - Filtering for unprocessed logs (
Some(false)
) - No filtering (
None
)
- Filtering for processed logs (
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Search for trait implementations to ensure they've been updated
ast-grep --pattern 'impl $_ for $_ {
$$$
async fn get_logs_block_numbers(
$$$
processed: Option<bool>
) $$$
}'
Length of output: 126
Script:
#!/bin/bash
# Let's try a different approach to find implementations
# First, find the trait definition
echo "=== Searching for trait definition ==="
rg "trait.*get_logs_block_numbers" -A 5
echo -e "\n=== Searching for implementations ==="
# Search for impl blocks containing get_logs_block_numbers
rg "impl.*\{" -A 20 | rg -A 20 "get_logs_block_numbers"
Length of output: 275
Script:
#!/bin/bash
# Let's try a simpler search approach to find any mentions of get_logs_block_numbers
echo "=== Searching for get_logs_block_numbers mentions ==="
rg "get_logs_block_numbers" -B 2 -A 10
echo -e "\n=== Searching for files containing get_logs_block_numbers ==="
fd . -t f -x grep -l "get_logs_block_numbers" {}
Length of output: 10578
chain/rpc/src/indexer.rs (1)
Line range hint 223-230
: Ensure logs are ordered before processing
The comment mentions that logs should be ordered by block number, and a panic occurs if they are not. Relying on external data being perfectly ordered can be risky.
Check if the logs returned by prov_clone.get_logs(&subrange).await
are guaranteed to be ordered. If not, consider sorting them before processing.
db/sql/src/logs.rs (3)
706-744
: Enhance test coverage in test_get_logs_block_numbers
The addition of log_3
and corresponding assertions improves test coverage by checking the behavior of get_logs_block_numbers
with different processed
statuses. This ensures that the method correctly filters block numbers based on the processed
parameter in various scenarios.
173-187
: Addition of processed
filter in get_logs_block_numbers
The introduction of the processed: Option<bool>
parameter enhances the functionality by allowing filtering of block numbers based on their processed status. Ensure that all calls to get_logs_block_numbers
are updated to accommodate the new parameter and that the default behavior remains consistent when processed
is None
.
Run the following script to verify all calls to get_logs_block_numbers
handle the processed
parameter:
✅ Verification successful
All calls to get_logs_block_numbers
correctly handle the new processed
parameter
Based on the search results, all calls to get_logs_block_numbers
have been updated to handle the new processed
parameter:
- Test cases in
db/sql/src/logs.rs
demonstrate proper usage withNone
,Some(true)
, andSome(false)
- Production usage in
chain/indexer/src/block.rs
shows consistent handling with bothSome(false)
andNone
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Find all calls to `get_logs_block_numbers` and check if they include the `processed` parameter.
rg 'get_logs_block_numbers\(' --type rust -A 2
Length of output: 3677
Line range hint 288-341
: Update return type of update_logs_checksums
to Result<Hash>
Changing the return type of update_logs_checksums
to Result<Hash>
allows the method to return the last computed checksum, which can be useful for verification and logging purposes. Please ensure that all callers of this function are updated to handle the new return type appropriately to prevent any unintended runtime errors.
Run the following script to find all usages of update_logs_checksums
and verify they handle the new return type:
chain/indexer/src/block.rs (2)
4-4
: Logging imports correctly updated
The addition of the tracing
macros (debug
, error
, info
, trace
) is appropriate for enhanced logging functionality throughout the indexer. Good job incorporating detailed logging.
149-154
: Verify processing of unprocessed logs during fast sync continuation
When fast sync is enabled and the index database is not empty, the code processes existing unprocessed logs. Ensure that this logic handles unprocessed logs correctly without causing data inconsistency or duplication. It might be beneficial to verify that already processed logs are not reprocessed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Outside diff range and nitpick comments (2)
chain/rpc/src/client.rs (2)
489-490
: Consider usingdebug
log level instead ofinfo
inSurfRequestor::new
The log statement
info!(?cfg, "creating surf client");
provides useful information during debugging but may be too verbose at theinfo
level in production environments. Consider changing it todebug!
to reduce log verbosity during normal operation.
556-557
: Consider usingdebug
log level instead ofinfo
inReqwestRequestor::new
The log statement
info!(?cfg, "creating reqwest client");
provides useful debugging information but might be too verbose at theinfo
level for production use. Changing it todebug!
can help keep the logs cleaner while still providing necessary details during development.
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
⛔ Files ignored due to path filters (1)
Cargo.lock
is excluded by!**/*.lock
📒 Files selected for processing (3)
chain/rpc/Cargo.toml
(2 hunks)chain/rpc/src/client.rs
(3 hunks)hoprd/hoprd/src/main.rs
(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
- chain/rpc/Cargo.toml
- hoprd/hoprd/src/main.rs
🔇 Additional comments (2)
chain/rpc/src/client.rs (2)
562-564
: Verify the necessity of setting tcp_keepalive
to 30 seconds
The .tcp_keepalive(Some(Duration::from_secs(30)))
configuration sets the TCP keep-alive interval to 30 seconds. Ensure that this value aligns with your network environment and does not conflict with server expectations or cause unintended side effects.
567-568
: Ensure server compatibility with zstd
and brotli
encodings
Enabling .zstd(true)
and .brotli(true)
allows the client to handle responses compressed with these encodings. Verify that the RPC server supports zstd
and brotli
compression to prevent issues with response processing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Outside diff range and nitpick comments (1)
db/sql/src/logs.rs (1)
173-187
: Improved query performance in get_logs_block_numbersThe changes to use
LogStatus
table directly with DISTINCT clause and optional processing status filter improve database performance by:
- Reducing JOIN operations by querying
LogStatus
directly- Preventing duplicate block numbers
- Adding flexibility to filter by processing status
Consider adding an index on (processed, block_number) columns if not already present, to further optimize this query pattern.
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
📒 Files selected for processing (3)
README.md
(1 hunks)chain/rpc/Cargo.toml
(2 hunks)db/sql/src/logs.rs
(5 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
- README.md
- chain/rpc/Cargo.toml
🧰 Additional context used
📓 Learnings (1)
db/sql/src/logs.rs (1)
Learnt from: NumberFour8
PR: hoprnet/hoprnet#6142
File: hopr/hopr-lib/tests/chain_integration_tests.rs:40-56
Timestamp: 2024-11-12T10:56:01.095Z
Learning: The use of `unwrap` is acceptable in test scenarios within the `hoprnet` project, as it helps directly identify issues when tests fail.
🔇 Additional comments (3)
db/sql/src/logs.rs (3)
Line range hint 288-341
: Enhanced checksum tracking capability
The change to return the last computed checksum improves the function's utility by allowing callers to track the chain of checksums, which is valuable for verifying data integrity during indexing.
335-338
: Improved error logging with structured fields
The error logging now uses structured fields and provides clearer context about the failed operation, which will help with debugging issues in production.
Line range hint 688-744
: Comprehensive test coverage for new functionality
The new test cases thoroughly cover:
- Filtering by processed/unprocessed status
- Various combinations of block number ranges
- Edge cases and result ordering
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Outside diff range and nitpick comments (2)
chain/indexer/src/block.rs (2)
149-154
: Consider adding more detailed logging for fast sync resumeThe implementation correctly handles fast sync resume, but could benefit from additional logging to help diagnose issues.
-info!("Fast sync is enabled, but the index database is not empty. Fast sync will continue on existing unprocessed logs."); +info!( + unprocessed_logs = ?log_block_numbers.len(), + "Fast sync is enabled, but the index database is not empty. Fast sync will continue on existing unprocessed logs." +);
738-860
: LGTM! Consider adding content verificationThe test implementation thoroughly covers fast sync resume functionality. Consider adding assertions to verify the content of processed logs, not just their count.
assert_eq!(db.get_logs_block_numbers(None, None, Some(true)).await?.len(), 4); assert_eq!(db.get_logs_block_numbers(None, None, Some(false)).await?.len(), 0); + +// Verify the content of processed logs +let processed_logs = db.get_logs(None, None).await?; +assert!(processed_logs.iter().all(|log| log.processed == Some(true)), "All logs should be processed"); +assert!(processed_logs.iter().any(|log| log.block_number == 3), "Should contain log from block 3"); +assert!(processed_logs.iter().any(|log| log.block_number == 4), "Should contain log from block 4");
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
📒 Files selected for processing (1)
chain/indexer/src/block.rs
(8 hunks)
🧰 Additional context used
📓 Learnings (1)
chain/indexer/src/block.rs (1)
Learnt from: tolbrino
PR: hoprnet/hoprnet#6675
File: chain/indexer/src/block.rs:319-320
Timestamp: 2024-11-27T17:22:40.429Z
Learning: In the Rust file `chain/indexer/src/block.rs`, within the `process_block_by_id` function, if there is a block number mismatch in logs retrieved from the database, we use `panic!` to handle this unrecoverable error.
The indexer hot-path crashed during the processing of logs. The hot-path has been improved to do fewer db operations and be more performant.
Moreover, fast-sync resume has been added. Previously, if the node crashed during fast-sync, it would end up with an incomplete index and never complete it.
Lastly, the http client has been configured to accept brotli and zstd content encoding, drastically reducing the transmitted data, and keep TCP connections open for 30 seconds.
Fixes #6665
Fixes #6664