Skip to content

Conversation

sdaftuar
Copy link
Member

This is an alternative approach to #11534. Rather than disconnect an outbound peer when our tip looks stale, instead try to connect to an additional outbound peer.

Periodically, check to see if we have more outbound peers than we target (ie if any extra peers are in use), and if so, disconnect the one that least recently announced a new block (breaking ties by choosing the newest peer that we connected to).

@sdaftuar
Copy link
Member Author

I'm still working on a unit test for this, but wanted to get some eyes on it for consideration in 0.15.0.2.

// us a new block, with ties broken by choosing the more recent
// connection (higher node id)
NodeId worst_peer = -1;
int64_t oldest_block_announcement = time_in_seconds;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer to see this be a int_64 max, so I don't need to reason about how monotone our clocks aren't.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@fanquake fanquake added the P2P label Oct 25, 2017
@fanquake fanquake added this to the 0.15.0.2 milestone Oct 25, 2017
@gmaxwell
Copy link
Contributor

Remind me how the scheduler works... when blocks are slow is this going to result in all hosts on the network triggering their extra peer at basically the same time? Or do we not need to insert a bit of randomness in order to spread them out over the checking interval?

@sdaftuar
Copy link
Member Author

Regarding the scheduler, I mentioned this on IRC to @gmaxwell:

gmaxwell: my understanding of the scheduler is that it'll start scheduling callbacks at some point after startup, so i wouldn't expect the network to be completely synced
gmaxwell: also there's random drift, since the scheduler schedules the next callback N milliseconds after the prior one finishes
but worth discussing whether the interval is short enough that there still might be too much concentration of everyone trying to find a new peer

I updated this PR with a unit test.

@sdaftuar sdaftuar force-pushed the 2017-10-stale-tip-new-peer branch from a1a1b8d to c44d6bd Compare October 26, 2017 20:10
@sdaftuar
Copy link
Member Author

Needed rebase after #11490.

Copy link
Contributor

@TheBlueMatt TheBlueMatt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Didnt get very far yet, but given deadlines, figured I'd post early and often :p.

src/net.h Outdated

// Return the number of outbound peers we have in excess of our target (eg,
// if we previously called SetTryNewOutboundPeer(true), and have since set
// to false, we may have extra peers that we wish to disconnect).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You may want to note the (IMO) critical part of the cpp comment:
"Also exclude peers that haven't finished initial connection handshake yet" ie "this may return a value less than the number of outbound connections - the number of normal outbound connection slots in cases where some outbound connections are not yet fully connected".

src/net.cpp Outdated
@@ -1068,7 +1068,7 @@ void CConnman::AcceptConnection(const ListenSocket& hListenSocket) {
SOCKET hSocket = accept(hListenSocket.socket, (struct sockaddr*)&sockaddr, &len);
CAddress addr;
int nInbound = 0;
int nMaxInbound = nMaxConnections - (nMaxOutbound + nMaxFeeler);
int nMaxInbound = nMaxConnections - (nMaxOutbound + nMaxFeeler + nMaxExtraOutbound);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm confused - it looks like we wont ever have both a feeler and an extra outbound open at one time, why change this?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it looks like we wont ever have both a feeler and an extra outbound at one time

We will in general have both a feeler and an extra outbound at the same time. There was an incorrect comment (from a first iteration of this patch) describing the extra outbound peer as "stealing" the feeler connection, but that is not actually true. I'll delete that comment.

src/net.h Outdated
@@ -413,6 +426,10 @@ class CConnman
std::thread threadOpenAddedConnections;
std::thread threadOpenConnections;
std::thread threadMessageHandler;

/** flag for stealing a feeler connection to find a new outbound peer */
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

whoops, this comment is wrong, will fix.

@sdaftuar
Copy link
Member Author

Pushed up a few small fixups.

Copy link
Contributor

@TheBlueMatt TheBlueMatt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Didnt review test changes yet.

src/net.cpp Outdated

int desired_outbound = nMaxOutbound + (extra_outbound ? nMaxExtraOutbound : 0);

if (nOutbound >= desired_outbound) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Care to add a comment here (or in the comment block above) noting that we will not make a new connection if we have our "extra slot" used by either a feeler or an "extra outbound", both are not allowed.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if this comment was made before we chatted offline, but it is possible to have both a feeler and an "extra" outbound peer simultaneously (just not in the same loop iteration, but that's nonsensical anyway since we only initiate a single connection in a single loop iteration, regardless).

return g_last_tip_update < GetTime() - consensusParams.nPowTargetSpacing * 3 && mapBlocksInFlight.empty();
}

// Requires cs_main
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please no more locking documentation - use the self-documenting AssertLockHeld or GUARDED_BY.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good

// Only disconnect a peer that has been connected to us for
// some reasonable fraction of our check-frequency, to give
// it time for new information to have arrived.
if (time_in_seconds - pnode->nTimeConnected > STALE_CHECK_INTERVAL/2) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldnt this almost always be true (because we only call this function once every STALE_CHECK_INTERVAL)? Also, I think we can turn this down way more than 5 minutes, if we think we're behind, and we connect to someone, it shouldn't take them 5 minutes to send us a header, and if it does, we probably weren't behind and our existing peers are fine, so we should just disconnect them.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe almost always, but definitely not always -- our outbound peers can initiate disconnects, too, and it might take us a while to find one.

I've pushed up changes to split the peer eviction onto a shorter timer (2.5 minutes), where we only require 30 seconds of connectivity to disconnect.

@sdaftuar
Copy link
Member Author

sdaftuar commented Oct 27, 2017

@TheBlueMatt raised concerns about holding the 9th connection for too long. One concern would be if we are reducing the number of available inbound slots on the network by too much. Another potentially more serious concern is that if we're holding the 9th connection long enough for a new block to be found and relayed, then over time we'll be selecting for peers with the fastest connectivity.

In the extreme, you could imagine that this algorithm would reduce to nodes grinding away, trying to connect to the peer that was closest to the miners (or the FIBRE network or similar), which is probably not what we want.

In this case, we only intend for the 9th connection to tell us whether we're really behind (eg because our existing peers are all broken) or if the network is just slow. We can determine that very quickly after connection, so I think I'm going to change this so that we disconnect extra peers more quickly (ie on a faster timer than every 10 minutes, and requiring a shorter connection time), and reset the TryExtraOutboundPeer flag in CConnman after disconnecting an extra peer, essentially waiting until we decide at the next stale-tip-check that we're still stale, before trying another 9th peer.

@sdaftuar
Copy link
Member Author

Addressed @TheBlueMatt's nits.

// Also exclude peers that haven't finished initial connection handshake yet
// (so that we don't decide we're over our desired connection limit, and then
// evict some peer that has finished the handshake)
int CConnman::GetExtraOutboundCount()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this might be greatly simplified by extending CSemaphore to have an adjustable threshold.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any thoughts on @TheBlueMatt's suggestion (mentioned on IRC yesterday) to just use the feeler connection for the extra peer, rather than initiate a potential 10th connection?

https://botbot.me/freenode/bitcoin-core-dev/2017-10-27/?msg=92828477&page=2

I'm leaning towards doing that, so nMaxExtraOutbound would go away. I guess this function wouldn't change though.

@sipa
Copy link
Member

sipa commented Oct 28, 2017

Needs rebase.

@sdaftuar sdaftuar force-pushed the 2017-10-stale-tip-new-peer branch from f0216b4 to fb74617 Compare October 28, 2017 19:25
@sdaftuar
Copy link
Member Author

Squashed and rebased.

@theuni
Copy link
Member

theuni commented Oct 28, 2017 via email

@sdaftuar
Copy link
Member Author

Fixed a bug and changed the extra peer connection to be exclusive with a feeler, rather than in addition to a feeler.

} else {
connman->SetTryNewOutboundPeer(false);
}
m_stale_tip_check_time += STALE_CHECK_INTERVAL;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this loop fails to run when it should-- e.g. host it suspended, then later when it does run it will run additional times. I believe this should be = time_in_seconds + STALE_CHECK_INTERVAL.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doubly so because nothing appears to initialize it to a time to begin with...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for catching that; fixed.

Copy link
Contributor

@TheBlueMatt TheBlueMatt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

utACK ff5f5cd (minus test changes and I still prefer a shorter check interval).

// it time for new information to have arrived.
// Also don't disconnect any peer we're trying to download a
// block from.
constexpr int64_t min_connect_time = EXTRA_PEER_CHECK_INTERVAL / 5;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why check if we need to disconnect every 2.5 minutes and only require it being online 30 seconds? Why not just make both 30 seconds (or maybe 45 and 30) so that we dont always end up holding our extra peer for 2.5 minutes?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm just not really sure how frequent is too-frequent for scheduled callbacks; there's a tradeoff between wasting time on calls (especially in this framework, where we always have the callback fire to check for eviction, even if we've never called SetExtraOutboundPeer()) and gaining precision with how long we allow the extra peer to be connected, and it's not clear to me how to figure out the optimal point....

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The callback in the connman->GetExtraOutboundCount() == 0 case should be almost free, no?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright, changed to 45 seconds.

void CheckForStaleTipAndEvictPeers(const Consensus::Params &consensusParams);
void EvictExtraOutboundPeers(int64_t time_in_seconds);

int64_t m_stale_tip_check_time;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be private?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good, done.

@sdaftuar sdaftuar force-pushed the 2017-10-stale-tip-new-peer branch from 9a6ebea to 8a27fdf Compare October 30, 2017 17:03
@sdaftuar
Copy link
Member Author

sdaftuar commented Oct 30, 2017

Updated to address a nit from @jimpo, and reduce the check-for-eviction interval down to 45 seconds (as @TheBlueMatt suggested).

Squashed https://github.com/sdaftuar/bitcoin/commits/11560.0 -> 8a27fdf

@TheBlueMatt
Copy link
Contributor

re-utACK 8a27fdf

@sdaftuar sdaftuar force-pushed the 2017-10-stale-tip-new-peer branch from f0d5fa1 to 6262915 Compare November 2, 2017 16:40
@sdaftuar
Copy link
Member Author

sdaftuar commented Nov 2, 2017

Squashed https://github.com/sdaftuar/bitcoin/commits/11560.4 -> 6262915

$ git diff 626291508c433488439b662f2e88882048fb59fb 7d9415396b049dd1477febb71930a70870c5a541
diff --git a/src/net_processing.h b/src/net_processing.h
index 0a49972eed..c1f1274f80 100644
--- a/src/net_processing.h
+++ b/src/net_processing.h
@@ -37,7 +37,7 @@ static constexpr int64_t MINIMUM_CONNECT_TIME = 30;
 
 class PeerLogicValidation : public CValidationInterface, public NetEventsInterface {
 private:
-    CConnman* const connman;
+    CConnman* connman;
 
 public:
     explicit PeerLogicValidation(CConnman* connman, CScheduler &scheduler);

@morcos morcos mentioned this pull request Nov 2, 2017
@gmaxwell
Copy link
Contributor

gmaxwell commented Nov 2, 2017

re-ACK.

@theuni
Copy link
Member

theuni commented Nov 2, 2017

utACK 6262915

@laanwj laanwj merged commit 6262915 into bitcoin:master Nov 2, 2017
laanwj added a commit that referenced this pull request Nov 2, 2017
6262915 Add unit test for stale tip checking (Suhas Daftuar)
83df257 Add CConnmanTest to mutate g_connman in tests (João Barbosa)
ac7b37c Connect to an extra outbound peer if our tip is stale (Suhas Daftuar)
db32a65 Track tip update time and last new block announcement from each peer (Suhas Daftuar)
2d4327d net: Allow connecting to extra outbound peers (Suhas Daftuar)

Pull request description:

  This is an alternative approach to #11534.  Rather than disconnect an outbound peer when our tip looks stale, instead try to connect to an additional outbound peer.

  Periodically, check to see if we have more outbound peers than we target (ie if any extra peers are in use), and if so, disconnect the one that least recently announced a new block (breaking ties by choosing the newest peer that we connected to).

Tree-SHA512: 8f19e910e0bb36867f81783e020af225f356451899adfc7ade1895d6d3bd5afe51c83759610dfd10c62090c4fe404efa0283b2f63fde0bd7da898a1aaa7fb281
maflcko pushed a commit to maflcko/bitcoin-core that referenced this pull request Nov 2, 2017
maflcko pushed a commit to maflcko/bitcoin-core that referenced this pull request Nov 2, 2017
maflcko pushed a commit to maflcko/bitcoin-core that referenced this pull request Nov 2, 2017
If our tip hasn't updated in a while, that may be because our peers are
not relaying blocks to us that we would consider valid. Allow connection
to an additional outbound peer in that circumstance.

Also, periodically check to see if we are exceeding our target number of
outbound peers, and disconnect the one which has least recently
announced a new block to us (choosing the newest such peer in the case
of tie).

Github-Pull: bitcoin#11560
Rebased-From: ac7b37c
maflcko pushed a commit to maflcko/bitcoin-core that referenced this pull request Nov 2, 2017
maflcko pushed a commit to maflcko/bitcoin-core that referenced this pull request Nov 2, 2017
@ghost ghost mentioned this pull request Feb 23, 2018
@ghost
Copy link

ghost commented Feb 24, 2018

I had a slow sync issue which may have been related to and/or fixed by this... See #12514

I eventually did get the recompile of the GitHub code but the version # did not change.

I hope my issue/resolution helps in fixing this issue or confirming it is fixed :)

codablock pushed a commit to codablock/dash that referenced this pull request Sep 26, 2019
6262915 Add unit test for stale tip checking (Suhas Daftuar)
83df257 Add CConnmanTest to mutate g_connman in tests (João Barbosa)
ac7b37c Connect to an extra outbound peer if our tip is stale (Suhas Daftuar)
db32a65 Track tip update time and last new block announcement from each peer (Suhas Daftuar)
2d4327d net: Allow connecting to extra outbound peers (Suhas Daftuar)

Pull request description:

  This is an alternative approach to bitcoin#11534.  Rather than disconnect an outbound peer when our tip looks stale, instead try to connect to an additional outbound peer.

  Periodically, check to see if we have more outbound peers than we target (ie if any extra peers are in use), and if so, disconnect the one that least recently announced a new block (breaking ties by choosing the newest peer that we connected to).

Tree-SHA512: 8f19e910e0bb36867f81783e020af225f356451899adfc7ade1895d6d3bd5afe51c83759610dfd10c62090c4fe404efa0283b2f63fde0bd7da898a1aaa7fb281
codablock pushed a commit to codablock/dash that referenced this pull request Sep 29, 2019
6262915 Add unit test for stale tip checking (Suhas Daftuar)
83df257 Add CConnmanTest to mutate g_connman in tests (João Barbosa)
ac7b37c Connect to an extra outbound peer if our tip is stale (Suhas Daftuar)
db32a65 Track tip update time and last new block announcement from each peer (Suhas Daftuar)
2d4327d net: Allow connecting to extra outbound peers (Suhas Daftuar)

Pull request description:

  This is an alternative approach to bitcoin#11534.  Rather than disconnect an outbound peer when our tip looks stale, instead try to connect to an additional outbound peer.

  Periodically, check to see if we have more outbound peers than we target (ie if any extra peers are in use), and if so, disconnect the one that least recently announced a new block (breaking ties by choosing the newest peer that we connected to).

Tree-SHA512: 8f19e910e0bb36867f81783e020af225f356451899adfc7ade1895d6d3bd5afe51c83759610dfd10c62090c4fe404efa0283b2f63fde0bd7da898a1aaa7fb281
barrystyle pushed a commit to PACGlobalOfficial/PAC that referenced this pull request Jan 22, 2020
6262915 Add unit test for stale tip checking (Suhas Daftuar)
83df257 Add CConnmanTest to mutate g_connman in tests (João Barbosa)
ac7b37c Connect to an extra outbound peer if our tip is stale (Suhas Daftuar)
db32a65 Track tip update time and last new block announcement from each peer (Suhas Daftuar)
2d4327d net: Allow connecting to extra outbound peers (Suhas Daftuar)

Pull request description:

  This is an alternative approach to bitcoin#11534.  Rather than disconnect an outbound peer when our tip looks stale, instead try to connect to an additional outbound peer.

  Periodically, check to see if we have more outbound peers than we target (ie if any extra peers are in use), and if so, disconnect the one that least recently announced a new block (breaking ties by choosing the newest peer that we connected to).

Tree-SHA512: 8f19e910e0bb36867f81783e020af225f356451899adfc7ade1895d6d3bd5afe51c83759610dfd10c62090c4fe404efa0283b2f63fde0bd7da898a1aaa7fb281
@bitcoin bitcoin locked as resolved and limited conversation to collaborators Feb 15, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.