Connect to a new outbound peer if our tip is stale #11560

sdaftuar · 2017-10-25T19:44:17Z

This is an alternative approach to #11534. Rather than disconnect an outbound peer when our tip looks stale, instead try to connect to an additional outbound peer.

Periodically, check to see if we have more outbound peers than we target (ie if any extra peers are in use), and if so, disconnect the one that least recently announced a new block (breaking ties by choosing the newest peer that we connected to).

sdaftuar · 2017-10-25T19:44:46Z

I'm still working on a unit test for this, but wanted to get some eyes on it for consideration in 0.15.0.2.

gmaxwell · 2017-10-25T19:50:08Z

src/net_processing.cpp

+        // us a new block, with ties broken by choosing the more recent
+        // connection (higher node id)
+        NodeId worst_peer = -1;
+        int64_t oldest_block_announcement = time_in_seconds;


I'd prefer to see this be a int_64 max, so I don't need to reason about how monotone our clocks aren't.

gmaxwell · 2017-10-26T01:17:17Z

Remind me how the scheduler works... when blocks are slow is this going to result in all hosts on the network triggering their extra peer at basically the same time? Or do we not need to insert a bit of randomness in order to spread them out over the checking interval?

sdaftuar · 2017-10-26T15:58:54Z

Regarding the scheduler, I mentioned this on IRC to @gmaxwell:

gmaxwell: my understanding of the scheduler is that it'll start scheduling callbacks at some point after startup, so i wouldn't expect the network to be completely synced
gmaxwell: also there's random drift, since the scheduler schedules the next callback N milliseconds after the prior one finishes
but worth discussing whether the interval is short enough that there still might be too much concentration of everyone trying to find a new peer

I updated this PR with a unit test.

sdaftuar · 2017-10-26T20:11:05Z

Needed rebase after #11490.

TheBlueMatt

Didnt get very far yet, but given deadlines, figured I'd post early and often :p.

TheBlueMatt · 2017-10-26T22:30:12Z

src/net.h

+
+    // Return the number of outbound peers we have in excess of our target (eg,
+    // if we previously called SetTryNewOutboundPeer(true), and have since set
+    // to false, we may have extra peers that we wish to disconnect).


You may want to note the (IMO) critical part of the cpp comment:
"Also exclude peers that haven't finished initial connection handshake yet" ie "this may return a value less than the number of outbound connections - the number of normal outbound connection slots in cases where some outbound connections are not yet fully connected".

TheBlueMatt · 2017-10-26T22:55:36Z

src/net.cpp

@@ -1068,7 +1068,7 @@ void CConnman::AcceptConnection(const ListenSocket& hListenSocket) {
    SOCKET hSocket = accept(hListenSocket.socket, (struct sockaddr*)&sockaddr, &len);
    CAddress addr;
    int nInbound = 0;
-    int nMaxInbound = nMaxConnections - (nMaxOutbound + nMaxFeeler);
+    int nMaxInbound = nMaxConnections - (nMaxOutbound + nMaxFeeler + nMaxExtraOutbound);


I'm confused - it looks like we wont ever have both a feeler and an extra outbound open at one time, why change this?

it looks like we wont ever have both a feeler and an extra outbound at one time

We will in general have both a feeler and an extra outbound at the same time. There was an incorrect comment (from a first iteration of this patch) describing the extra outbound peer as "stealing" the feeler connection, but that is not actually true. I'll delete that comment.

sdaftuar · 2017-10-27T18:25:41Z

src/net.h

@@ -413,6 +426,10 @@ class CConnman
    std::thread threadOpenAddedConnections;
    std::thread threadOpenConnections;
    std::thread threadMessageHandler;
+
+    /** flag for stealing a feeler connection to find a new outbound peer */


whoops, this comment is wrong, will fix.

sdaftuar · 2017-10-27T19:03:55Z

Pushed up a few small fixups.

TheBlueMatt

Didnt review test changes yet.

TheBlueMatt · 2017-10-27T18:17:57Z

src/net.cpp

+
+        int desired_outbound = nMaxOutbound + (extra_outbound ? nMaxExtraOutbound : 0);
+
+        if (nOutbound >= desired_outbound) {


Care to add a comment here (or in the comment block above) noting that we will not make a new connection if we have our "extra slot" used by either a feeler or an "extra outbound", both are not allowed.

Not sure if this comment was made before we chatted offline, but it is possible to have both a feeler and an "extra" outbound peer simultaneously (just not in the same loop iteration, but that's nonsensical anyway since we only initiate a single connection in a single loop iteration, regardless).

src/net_processing.cpp

TheBlueMatt · 2017-10-27T18:55:29Z

src/net_processing.cpp

+    return g_last_tip_update < GetTime() - consensusParams.nPowTargetSpacing * 3 && mapBlocksInFlight.empty();
+}
+
+// Requires cs_main


Please no more locking documentation - use the self-documenting AssertLockHeld or GUARDED_BY.

Sounds good

TheBlueMatt · 2017-10-27T19:08:16Z

src/net_processing.cpp

+                // Only disconnect a peer that has been connected to us for
+                // some reasonable fraction of our check-frequency, to give
+                // it time for new information to have arrived.
+                if (time_in_seconds - pnode->nTimeConnected > STALE_CHECK_INTERVAL/2) {


Shouldnt this almost always be true (because we only call this function once every STALE_CHECK_INTERVAL)? Also, I think we can turn this down way more than 5 minutes, if we think we're behind, and we connect to someone, it shouldn't take them 5 minutes to send us a header, and if it does, we probably weren't behind and our existing peers are fine, so we should just disconnect them.

Maybe almost always, but definitely not always -- our outbound peers can initiate disconnects, too, and it might take us a while to find one.

I've pushed up changes to split the peer eviction onto a shorter timer (2.5 minutes), where we only require 30 seconds of connectivity to disconnect.

sdaftuar · 2017-10-27T20:43:14Z

@TheBlueMatt raised concerns about holding the 9th connection for too long. One concern would be if we are reducing the number of available inbound slots on the network by too much. Another potentially more serious concern is that if we're holding the 9th connection long enough for a new block to be found and relayed, then over time we'll be selecting for peers with the fastest connectivity.

In the extreme, you could imagine that this algorithm would reduce to nodes grinding away, trying to connect to the peer that was closest to the miners (or the FIBRE network or similar), which is probably not what we want.

In this case, we only intend for the 9th connection to tell us whether we're really behind (eg because our existing peers are all broken) or if the network is just slow. We can determine that very quickly after connection, so I think I'm going to change this so that we disconnect extra peers more quickly (ie on a faster timer than every 10 minutes, and requiring a shorter connection time), and reset the TryExtraOutboundPeer flag in CConnman after disconnecting an extra peer, essentially waiting until we decide at the next stale-tip-check that we're still stale, before trying another 9th peer.

sdaftuar · 2017-10-27T21:36:20Z

Addressed @TheBlueMatt's nits.

theuni · 2017-10-28T00:12:33Z

src/net.cpp

+// Also exclude peers that haven't finished initial connection handshake yet
+// (so that we don't decide we're over our desired connection limit, and then
+// evict some peer that has finished the handshake)
+int CConnman::GetExtraOutboundCount()


I think this might be greatly simplified by extending CSemaphore to have an adjustable threshold.

Any thoughts on @TheBlueMatt's suggestion (mentioned on IRC yesterday) to just use the feeler connection for the extra peer, rather than initiate a potential 10th connection?

https://botbot.me/freenode/bitcoin-core-dev/2017-10-27/?msg=92828477&page=2

I'm leaning towards doing that, so nMaxExtraOutbound would go away. I guess this function wouldn't change though.

sipa · 2017-10-28T18:22:14Z

Needs rebase.

sdaftuar · 2017-10-28T19:26:14Z

Squashed and rebased.

theuni · 2017-10-28T19:50:03Z

Without seeing the code, I think that would probably be simpler and more obvious. I wonder if we could actually combine the behavior with feelers. Launch a feeler every 20 min or so as usual, and rather than disconnecting immediately, replace a worse connection if necessary, and upgrade from feeler to full outbound.

…

On Oct 28, 2017 11:11 AM, "Suhas Daftuar" ***@***.***> wrote: ***@***.**** commented on this pull request. ------------------------------ In src/net.cpp <#11560 (comment)>: > + return m_try_another_outbound_peer; +} + +void CConnman::SetTryNewOutboundPeer(bool flag) +{ + LOCK(m_cs_outbound_peer); + m_try_another_outbound_peer = flag; +} + +// Return the number of peers we have over our outbound connection limit +// Exclude peers that are marked for disconnect, or are going to be +// disconnected soon (eg one-shots and feelers) +// Also exclude peers that haven't finished initial connection handshake yet +// (so that we don't decide we're over our desired connection limit, and then +// evict some peer that has finished the handshake) +int CConnman::GetExtraOutboundCount() Any thoughts on @TheBlueMatt <https://github.com/thebluematt>'s suggestion (mentioned on IRC yesterday) to just use the feeler connection for the extra peer, rather than initiate a potential 10th connection? https://botbot.me/freenode/bitcoin-core-dev/2017-10-27/? msg=92828477&page=2 I'm leaning towards doing that, so nMaxExtraOutbound would go away. I guess this function wouldn't change though. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#11560 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAZdEzhjrlXqpR_iPo_-CIcNtvbYKR8cks5sw0QdgaJpZM4QGjZX> .

sdaftuar · 2017-10-29T13:41:32Z

Fixed a bug and changed the extra peer connection to be exclusive with a feeler, rather than in addition to a feeler.

gmaxwell · 2017-10-29T17:40:00Z

src/net_processing.cpp

+        } else {
+            connman->SetTryNewOutboundPeer(false);
+        }
+        m_stale_tip_check_time += STALE_CHECK_INTERVAL;


If this loop fails to run when it should-- e.g. host it suspended, then later when it does run it will run additional times. I believe this should be = time_in_seconds + STALE_CHECK_INTERVAL.

Doubly so because nothing appears to initialize it to a time to begin with...

Thanks for catching that; fixed.

src/net.cpp

TheBlueMatt

utACK ff5f5cd (minus test changes and I still prefer a shorter check interval).

src/net.cpp

TheBlueMatt · 2017-10-29T22:47:28Z

src/net_processing.cpp

+                // it time for new information to have arrived.
+                // Also don't disconnect any peer we're trying to download a
+                // block from.
+                constexpr int64_t min_connect_time = EXTRA_PEER_CHECK_INTERVAL / 5;


Why check if we need to disconnect every 2.5 minutes and only require it being online 30 seconds? Why not just make both 30 seconds (or maybe 45 and 30) so that we dont always end up holding our extra peer for 2.5 minutes?

I'm just not really sure how frequent is too-frequent for scheduled callbacks; there's a tradeoff between wasting time on calls (especially in this framework, where we always have the callback fire to check for eviction, even if we've never called SetExtraOutboundPeer()) and gaining precision with how long we allow the extra peer to be connected, and it's not clear to me how to figure out the optimal point....

The callback in the connman->GetExtraOutboundCount() == 0 case should be almost free, no?

Alright, changed to 45 seconds.

jimpo · 2017-10-30T16:02:45Z

src/net_processing.h

+    void CheckForStaleTipAndEvictPeers(const Consensus::Params &consensusParams);
+    void EvictExtraOutboundPeers(int64_t time_in_seconds);
+
+    int64_t m_stale_tip_check_time;


Should this be private?

Sounds good, done.

sdaftuar · 2017-10-30T17:04:40Z

Updated to address a nit from @jimpo, and reduce the check-for-eviction interval down to 45 seconds (as @TheBlueMatt suggested).

Squashed https://github.com/sdaftuar/bitcoin/commits/11560.0 -> 8a27fdf

src/net_processing.cpp

TheBlueMatt · 2017-10-30T23:11:16Z

re-utACK 8a27fdf

sdaftuar · 2017-11-02T16:42:34Z

Squashed https://github.com/sdaftuar/bitcoin/commits/11560.4 -> 6262915

$ git diff 626291508c433488439b662f2e88882048fb59fb 7d9415396b049dd1477febb71930a70870c5a541
diff --git a/src/net_processing.h b/src/net_processing.h
index 0a49972eed..c1f1274f80 100644
--- a/src/net_processing.h
+++ b/src/net_processing.h
@@ -37,7 +37,7 @@ static constexpr int64_t MINIMUM_CONNECT_TIME = 30;
 
 class PeerLogicValidation : public CValidationInterface, public NetEventsInterface {
 private:
-    CConnman* const connman;
+    CConnman* connman;
 
 public:
     explicit PeerLogicValidation(CConnman* connman, CScheduler &scheduler);

gmaxwell · 2017-11-02T17:28:02Z

re-ACK.

theuni · 2017-11-02T18:52:24Z

utACK 6262915

6262915 Add unit test for stale tip checking (Suhas Daftuar) 83df257 Add CConnmanTest to mutate g_connman in tests (João Barbosa) ac7b37c Connect to an extra outbound peer if our tip is stale (Suhas Daftuar) db32a65 Track tip update time and last new block announcement from each peer (Suhas Daftuar) 2d4327d net: Allow connecting to extra outbound peers (Suhas Daftuar) Pull request description: This is an alternative approach to #11534. Rather than disconnect an outbound peer when our tip looks stale, instead try to connect to an additional outbound peer. Periodically, check to see if we have more outbound peers than we target (ie if any extra peers are in use), and if so, disconnect the one that least recently announced a new block (breaking ties by choosing the newest peer that we connected to). Tree-SHA512: 8f19e910e0bb36867f81783e020af225f356451899adfc7ade1895d6d3bd5afe51c83759610dfd10c62090c4fe404efa0283b2f63fde0bd7da898a1aaa7fb281

Github-Pull: bitcoin#11560 Rebased-From: 2d4327d

Github-Pull: bitcoin#11560 Rebased-From: db32a65

If our tip hasn't updated in a while, that may be because our peers are not relaying blocks to us that we would consider valid. Allow connection to an additional outbound peer in that circumstance. Also, periodically check to see if we are exceeding our target number of outbound peers, and disconnect the one which has least recently announced a new block to us (choosing the newest such peer in the case of tie). Github-Pull: bitcoin#11560 Rebased-From: ac7b37c

Github-Pull: bitcoin#11560 Rebased-From: 83df257

Github-Pull: bitcoin#11560 Rebased-From: 6262915

ghost · 2018-02-24T01:20:11Z

I had a slow sync issue which may have been related to and/or fixed by this... See #12514

I eventually did get the recompile of the GitHub code but the version # did not change.

I hope my issue/resolution helps in fixing this issue or confirming it is fixed :)

6262915 Add unit test for stale tip checking (Suhas Daftuar) 83df257 Add CConnmanTest to mutate g_connman in tests (João Barbosa) ac7b37c Connect to an extra outbound peer if our tip is stale (Suhas Daftuar) db32a65 Track tip update time and last new block announcement from each peer (Suhas Daftuar) 2d4327d net: Allow connecting to extra outbound peers (Suhas Daftuar) Pull request description: This is an alternative approach to bitcoin#11534. Rather than disconnect an outbound peer when our tip looks stale, instead try to connect to an additional outbound peer. Periodically, check to see if we have more outbound peers than we target (ie if any extra peers are in use), and if so, disconnect the one that least recently announced a new block (breaking ties by choosing the newest peer that we connected to). Tree-SHA512: 8f19e910e0bb36867f81783e020af225f356451899adfc7ade1895d6d3bd5afe51c83759610dfd10c62090c4fe404efa0283b2f63fde0bd7da898a1aaa7fb281

gmaxwell reviewed Oct 25, 2017

View reviewed changes

fanquake added the P2P label Oct 25, 2017

fanquake added this to the 0.15.0.2 milestone Oct 25, 2017

maflcko added the Needs backport label Oct 25, 2017

sdaftuar mentioned this pull request Oct 26, 2017

Evict outbound peers if tip is stale #11534

Closed

sdaftuar force-pushed the 2017-10-stale-tip-new-peer branch from a1a1b8d to c44d6bd Compare October 26, 2017 20:10

TheBlueMatt reviewed Oct 26, 2017

View reviewed changes

sdaftuar commented Oct 27, 2017

View reviewed changes

TheBlueMatt reviewed Oct 27, 2017

View reviewed changes

theuni reviewed Oct 28, 2017

View reviewed changes

sdaftuar force-pushed the 2017-10-stale-tip-new-peer branch from f0216b4 to fb74617 Compare October 28, 2017 19:25

gmaxwell reviewed Oct 29, 2017

View reviewed changes

TheBlueMatt reviewed Oct 29, 2017

View reviewed changes

src/net.cpp Show resolved Hide resolved

TheBlueMatt reviewed Oct 29, 2017

View reviewed changes

jimpo reviewed Oct 30, 2017

View reviewed changes

sdaftuar force-pushed the 2017-10-stale-tip-new-peer branch from 9a6ebea to 8a27fdf Compare October 30, 2017 17:03

achow101 reviewed Oct 30, 2017

View reviewed changes

src/net_processing.cpp Show resolved Hide resolved

promag and others added 2 commits November 2, 2017 12:39

Add CConnmanTest to mutate g_connman in tests

83df257

Add unit test for stale tip checking

6262915

sdaftuar force-pushed the 2017-10-stale-tip-new-peer branch from f0d5fa1 to 6262915 Compare November 2, 2017 16:40

morcos mentioned this pull request Nov 2, 2017

0.15: Backports #11592

Merged

laanwj merged commit 6262915 into bitcoin:master Nov 2, 2017

maflcko pushed a commit to maflcko/bitcoin-core that referenced this pull request Nov 2, 2017

net: Allow connecting to extra outbound peers

49bf090

Github-Pull: bitcoin#11560 Rebased-From: 2d4327d

maflcko pushed a commit to maflcko/bitcoin-core that referenced this pull request Nov 2, 2017

Track tip update time and last new block announcement from each peer

459f2db

Github-Pull: bitcoin#11560 Rebased-From: db32a65

maflcko pushed a commit to maflcko/bitcoin-core that referenced this pull request Nov 2, 2017

Add CConnmanTest to mutate g_connman in tests

2ed0647

Github-Pull: bitcoin#11560 Rebased-From: 83df257

maflcko pushed a commit to maflcko/bitcoin-core that referenced this pull request Nov 2, 2017

Add unit test for stale tip checking

fc308a6

Github-Pull: bitcoin#11560 Rebased-From: 6262915

maflcko removed the Needs backport label Nov 2, 2017

ghost mentioned this pull request Feb 23, 2018

Slow Sync & 2 GUI bugs #12514

Closed

sdaftuar mentioned this pull request Sep 12, 2018

Test coverage of our networking code #14210

Open

sdaftuar mentioned this pull request Jun 11, 2019

p2p: Add 2 outbound block-relay-only connections #15759

Merged

sdaftuar mentioned this pull request Sep 12, 2019

Syncing headers with feeler-peers #16859

Open

UdjinM6 mentioned this pull request Sep 26, 2019

Backport same PRs as done in bitcoin#11592 dashpay/dash#3113

Merged

bitcoin locked as resolved and limited conversation to collaborators Feb 15, 2022


		int desired_outbound = nMaxOutbound + (extra_outbound ? nMaxExtraOutbound : 0);

		if (nOutbound >= desired_outbound) {

Connect to a new outbound peer if our tip is stale #11560

Connect to a new outbound peer if our tip is stale #11560

Uh oh!

Conversation

sdaftuar commented Oct 25, 2017

Uh oh!

sdaftuar commented Oct 25, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gmaxwell commented Oct 26, 2017

Uh oh!

sdaftuar commented Oct 26, 2017

Uh oh!

sdaftuar commented Oct 26, 2017

Uh oh!

TheBlueMatt left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sdaftuar commented Oct 27, 2017

Uh oh!

TheBlueMatt left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sdaftuar commented Oct 27, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sdaftuar commented Oct 27, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sipa commented Oct 28, 2017

Uh oh!

sdaftuar commented Oct 28, 2017

Uh oh!

theuni commented Oct 28, 2017 via email

Uh oh!

sdaftuar commented Oct 29, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

TheBlueMatt left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

sdaftuar commented Oct 27, 2017 •

edited

Loading

sdaftuar commented Oct 30, 2017 •

edited

Loading