Skip to content

Conversation

mzumsande
Copy link
Contributor

@mzumsande mzumsande commented Mar 31, 2025

As described in #32179, pindexLastCommonBlock is updated later than necessary
in master.
In case of a linear chain with no forks, it can be moved forward at the beginning of
FindNextBlocksToDownload, so that the updated value can be used to better estimate nWindowEnd.
This helps the node to request all blocks from peers within the correct 1024-block-window and avoids peers being incorrectly marked as stallers.

The second commit removes a special case introduced in 49d569c which is not necessary, because the more general logic also covers this case.

I also changed p2p_ibd_stalling.py to cover the situation after a resolved situation, making sure that no additional peers are marked for stalling.

Fixes #32179

@DrahtBot
Copy link
Contributor

DrahtBot commented Mar 31, 2025

The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.

Code Coverage & Benchmarks

For details see: https://corecheck.dev/bitcoin/bitcoin/pulls/32180.

Reviews

See the guideline for information on the review process.

Type Reviewers
ACK furszy, stringintech
Concept ACK yuvicc
Stale ACK vasild

If your review is incorrectly listed, please react with 👎 to this comment and the bot will ignore it on the next update.

Conflicts

No conflicts as of last run.

@mzumsande mzumsande marked this pull request as draft March 31, 2025 22:40
@DrahtBot DrahtBot added the P2P label Mar 31, 2025
@mzumsande mzumsande marked this pull request as ready for review April 2, 2025 21:15
@mzumsande mzumsande force-pushed the 202403_ibd_lastcommonblock branch from bb5aa0a to 205e651 Compare April 7, 2025 21:56
Copy link
Contributor

@stringintech stringintech left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Concept ACK.

Comment on lines 1374 to 1408
// If our tip has advanced beyond pindexLastCommonBlock, move it ahead to the tip. We don't need to download any blocks in between, and skipping ahead here
// allows us to determine nWindowEnd better.
if (m_chainman.ActiveHeight() > state->pindexLastCommonBlock->nHeight && state->pindexBestKnownBlock->GetAncestor(m_chainman.ActiveHeight()) == m_chainman.ActiveTip()) {
state->pindexLastCommonBlock = m_chainman.ActiveTip();
}

// If the peer reorganized, our previous pindexLastCommonBlock may not be an ancestor
// of its current tip anymore. Go back enough to fix that.
state->pindexLastCommonBlock = LastCommonAncestor(state->pindexLastCommonBlock, state->pindexBestKnownBlock);
Copy link
Contributor

@stringintech stringintech Apr 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could there be an issue if both nodes reorganize to the same chain, but the last common block is still on another fork?

If our new tip height is less than the outdated pindexLastCommonBlock height (like when reorging to a shorter chain with more work), we'd skip the optimization when we shouldn't.

Since LastCommonAncestor() brings back the pindexLastCommonBlock on the peer's best chain, maybe the new code should come after this call?

Not sure if this edge case matters in practice, but thought I'd mention it.​​​​​​​​​​​​​​​​

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good observation!
I also think that this edge case shouldn't matter much.
It's absolutely essential that pindexLastCommonBlock is an ancestor of pindexBestKnownBlock when calling FindNextBlocks, which is why I originally left that check last. But the added check can't change that since it only sets pindexLastCommonBlock to the tip if that tip is an ancestor of pindexBestKnownBlock.
So I've changed the order.

@yuvicc
Copy link
Contributor

yuvicc commented Apr 11, 2025

Concept ACK

Thanks for this PR.

I have made some notes regarding this PR and will add some guide for testing as well very soon.

@mzumsande mzumsande force-pushed the 202403_ibd_lastcommonblock branch from 205e651 to c69ee2d Compare April 11, 2025 19:46
Copy link
Contributor

@vasild vasild left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ACK c69ee2d

The code changes look reasonable. I tested that p2p_ibd_stalling.py fails without commit 1d8504c p2p: During block download, adjust pindexLastCommonBlock right away. It failed 7 times and passed 1 time. Is it expected to pass sometimes?
I also checked that the code inside the modified ifs is executed during p2p_ibd_stalling.py.

Comment on lines +1380 to +1416
// If our tip has advanced beyond pindexLastCommonBlock, move it ahead to the tip. We don't need to download any blocks in between, and skipping ahead here
// allows us to determine nWindowEnd better.
if (m_chainman.ActiveHeight() > state->pindexLastCommonBlock->nHeight && state->pindexBestKnownBlock->GetAncestor(m_chainman.ActiveHeight()) == m_chainman.ActiveTip()) {
state->pindexLastCommonBlock = m_chainman.ActiveTip();
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change assumes their tip is higher than our tip. What if their tip is lower? Maybe worth advancing pindexLastCommonBlock in that case as well. Something like this:

// If our tip has advanced beyond pindexLastCommonBlock, move it ahead to the tip. We don't need to
// download any blocks in between, and skipping ahead here allows us to determine nWindowEnd better.
const auto our_tip{m_chainman.ActiveTip()};
const auto peer_best_known{state->pindexBestKnownBlock};
if (our_tip->nHeight > state->pindexLastCommonBlock->nHeight) {
    if (peer_best_known->nHeight >= our_tip->nHeight) { 
        if (peer_best_known->GetAncestor(our_tip->nHeight) == our_tip) {
            state->pindexLastCommonBlock = our_tip;
        }     
    } else {                        
        if (our_tip->GetAncestor(peer_best_known->nHeight) == peer_best_known) {
            state->pindexLastCommonBlock = peer_best_known;
        }
        // FindNextBlocks() would be a noop below, so return here.
        return;
    }     
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then we would have returned earlier because there are no blocks from this peer that would help us with our IBD:

bitcoin/src/net_processing.cpp

Lines 1384 to 1387 in 80e6ad9

if (state->pindexBestKnownBlock == nullptr || state->pindexBestKnownBlock->nChainWork < m_chainman.ActiveChain().Tip()->nChainWork || state->pindexBestKnownBlock->nChainWork < m_chainman.MinimumChainWork()) {
// This peer has nothing interesting.
return;
}
(second condition)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it make sense to advance pindexLastCommonBlock in that case? I am just curious, it is definitely not the aim of this PR.

Copy link
Contributor Author

@mzumsande mzumsande Apr 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd say probably not because pindexLastCommonBlock isn't really used for anything but assigning blocks during IBD.
If we will never need any blocks from this peer (until it sends us more headers, but then pindexLastCommonBlock will get updated. I'll also ignore weird invalidateblock scenarios here), I don't see what benefits advancing it anyway would have.

Comment on lines 69 to 70
stall_block = blocks[0].sha256
second_stall = blocks[500].sha256 # another block we don't provide immediately
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: could use an array here:

stall_blocks = [blocks[0].sha256, blocks[500].sha256]

which would make it possible to have the following code like:

- peers.append(node.add_outbound_p2p_connection(P2PStaller([stall_block, second_stall]), p2p_idx=id, connection_type="outbound-full-relay"))
+ peers.append(node.add_outbound_p2p_connection(P2PStaller(stall_blocks), p2p_idx=id, connection_type="outbound-full-relay"))

- self.wait_until(lambda: sum(len(peer['inflight']) for peer in node.getpeerinfo()) == 2)
+ self.wait_until(lambda: sum(len(peer['inflight']) for peer in node.getpeerinfo()) == len(stall_blocks))

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

self.log.info("Check that all outstanding blocks get connected")
self.wait_until(lambda: node.getblockcount() == NUM_BLOCKS)
self.log.info("Check that all outstanding blocks up to the second stall block get connected")
self.wait_until(lambda: node.getblockcount() == 500)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: could use a variable second_stall_block_index = 500 or something like that to avoid repeating the number 500.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed to use one var for each index, plus one array.

@DrahtBot DrahtBot requested a review from stringintech April 25, 2025 12:24
@mzumsande mzumsande force-pushed the 202403_ibd_lastcommonblock branch from c69ee2d to b7ff6a6 Compare April 28, 2025 19:56
@mzumsande
Copy link
Contributor Author

mzumsande commented Apr 28, 2025

c69ee2d to b7ff6a6: addressed feedback

Is it expected to pass sometimes?

It wasn't - but I have tracked this down now. It's a bit involved, I hope the following explanation makes sense:

The initial situation is that there are 2 remaining connected peers, the node has just synced blocks up to height 500, but pindexLastCommonBlock is still at 0. The second stall block at height 501 is already requested from one peer, let's say peer0.

If SendMessages is invoked for peer0, it will request no blocks, update pindexLastCommonBlock to 500 and won't mark anyone as a staller, because it's already requested from the same peer.
If after that, peer1 would be processed, peer0 would be marked as a staller.
But if peer0 is processed twice in a row (which can happen due to the randomization in ThreadMessageHandler), we could request block 1025 from it.
If peer1 is processed then (for which pindexLastCommonBlock is still at 0), all 1024+1 blocks in the window will already be requested, so we never get to this point because we always continue here.
I think this could be fixed by exchanging the order of this and this code block. I'm not sure if this is something that needs fixing though because the current patch should avoid this problem too...

Copy link
Contributor

@vasild vasild left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ACK b7ff6a6

But if peer0 is processed twice in a row (which can happen due to the randomization in ThreadMessageHandler), we could request block 1025 from it

I confirm this, made it reproducible:

  • git show 855b1fc2b9 |git apply -R
  • observe p2p_ibd_stalling.py is failing (mostly).
  • apply the patch below to mess with the order in which nodes are processed
  • observe p2p_ibd_stalling.py succeeding
[patch] change order in which nodes are processed
--- i/src/net.cpp
+++ w/src/net.cpp
@@ -3029,12 +3029,14 @@ void CConnman::OpenNetworkConnection(const CAddress& addrConnect, bool fCountFai
         pnode->ConnectedThroughNetwork(),
         GetNodeCount(ConnectionDirection::Out));
 }
 
 Mutex NetEventsInterface::g_msgproc_mutex;
 
+static size_t seq{0};
+
 void CConnman::ThreadMessageHandler()
 {
     LOCK(NetEventsInterface::g_msgproc_mutex);
 
     while (!flagInterruptMsgProc)
     {
@@ -3043,13 +3045,22 @@ void CConnman::ThreadMessageHandler()
         {
             // Randomize the order in which we process messages from/to our peers.
             // This prevents attacks in which an attacker exploits having multiple
             // consecutive connections in the m_nodes list.
             const NodesSnapshot snap{*this, /*shuffle=*/true};
 
-            for (CNode* pnode : snap.Nodes()) {
+            auto nodes = snap.Nodes();
+            std::ranges::sort(nodes, [](CNode* a, CNode* b) {
+                return seq % 3 == 0 ? a->GetId() > b->GetId() : a->GetId() < b->GetId();
+            });
+
+            ++seq;
+
+            LogInfo("===== begin");
+            for (CNode* pnode : nodes) {
+                LogInfo("===== node id: %d, disconnect: %d", pnode->GetId(), pnode->fDisconnect);
                 if (pnode->fDisconnect)
                     continue;
 
                 // Receive messages
                 bool fMoreNodeWork = m_msgproc->ProcessMessages(pnode, flagInterruptMsgProc);
                 fMoreWork |= (fMoreNodeWork && !pnode->fPauseSend);
@@ -3058,12 +3069,13 @@ void CConnman::ThreadMessageHandler()
                 // Send messages
                 m_msgproc->SendMessages(pnode);
 
                 if (flagInterruptMsgProc)
                     return;
             }
+            LogInfo("===== end");
         }
 
         WAIT_LOCK(mutexMsgProc, lock);
         if (!fMoreWork) {
             condMsgProc.wait_until(lock, std::chrono::steady_clock::now() + std::chrono::milliseconds(100), [this]() EXCLUSIVE_LOCKS_REQUIRED(mutexMsgProc) { return fMsgProcWake; });
         }

This allows us to calculate nWndowEnd better.
Before, it would be calculated based on an old pindexLastCommonBlock,
which could result in blocks not requested for download that could have been
requested, and peers being wrongly marked as staller.
This reverts commit 49d569c
which introduced extra logic for when a snapshot was loaded.
With the previous commit, this is not longer necessary
because the more general logic of advancing pindexLastCommonBlock
also covers this case.
getpeerinfo provides a list of blocks that are inflight, which can be used
instead.
This test (which would fail without the previous commit) checks
that after the stalling block was received, we don't incorrectly
mark another peer as a staller immediately.
@mzumsande mzumsande force-pushed the 202403_ibd_lastcommonblock branch from b7ff6a6 to 3bdff9a Compare July 21, 2025 15:26
@mzumsande
Copy link
Contributor Author

b7ff6a6 to 3bdff9a: rebased due to conflict with #32868

Copy link
Member

@furszy furszy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ACK 3bdff9a

@@ -1411,6 +1411,12 @@ void PeerManagerImpl::FindNextBlocksToDownload(const Peer& peer, unsigned int co
if (state->pindexLastCommonBlock == state->pindexBestKnownBlock)
return;

// If our tip has advanced beyond pindexLastCommonBlock, move it ahead to the tip. We don't need to download any blocks in between, and skipping ahead here
// allows us to determine nWindowEnd better.
if (m_chainman.ActiveHeight() > state->pindexLastCommonBlock->nHeight && state->pindexBestKnownBlock->GetAncestor(m_chainman.ActiveHeight()) == m_chainman.ActiveTip()) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nano non-blocking nits:
Could extract tip = ActiveTip() and use it everywhere here to make it slightly more readable.

Also, this isn’t introduced by this PR, but the second check has always puzzled me a bit. It would be nice to have some util function like IsBlockInChain(const CBlockIndex* block_to_test, const CBlockIndex* chain_tip).

@DrahtBot DrahtBot requested a review from vasild July 23, 2025 20:07
@stringintech
Copy link
Contributor

stringintech commented Jul 30, 2025

ACK 3bdff9a

I think it would be nice to include how the newly added logic would also cover the assumeutxo case in the commit description; if my understanding is correct, adding something like:

After snapshot loading, our tip becomes the snapshot block. Since peers are verified to have the snapshot in their chain, our tip is an ancestor of the peer's best block, hence the general advancement logic will move pindexLastCommonBlock from any pre-snapshot position to the snapshot height automatically.

@mzumsande
Copy link
Contributor Author

Thanks! I'll address the nits if / when I have to push again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

p2p: Inefficiency in block download / stalling algorithm
6 participants