-
Notifications
You must be signed in to change notification settings - Fork 37.7k
Connect to a new outbound peer if our tip is stale #11560
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
I'm still working on a unit test for this, but wanted to get some eyes on it for consideration in 0.15.0.2. |
src/net_processing.cpp
Outdated
// us a new block, with ties broken by choosing the more recent | ||
// connection (higher node id) | ||
NodeId worst_peer = -1; | ||
int64_t oldest_block_announcement = time_in_seconds; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd prefer to see this be a int_64 max, so I don't need to reason about how monotone our clocks aren't.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
Remind me how the scheduler works... when blocks are slow is this going to result in all hosts on the network triggering their extra peer at basically the same time? Or do we not need to insert a bit of randomness in order to spread them out over the checking interval? |
Regarding the scheduler, I mentioned this on IRC to @gmaxwell:
I updated this PR with a unit test. |
a1a1b8d
to
c44d6bd
Compare
Needed rebase after #11490. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Didnt get very far yet, but given deadlines, figured I'd post early and often :p.
src/net.h
Outdated
|
||
// Return the number of outbound peers we have in excess of our target (eg, | ||
// if we previously called SetTryNewOutboundPeer(true), and have since set | ||
// to false, we may have extra peers that we wish to disconnect). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You may want to note the (IMO) critical part of the cpp comment:
"Also exclude peers that haven't finished initial connection handshake yet" ie "this may return a value less than the number of outbound connections - the number of normal outbound connection slots in cases where some outbound connections are not yet fully connected".
src/net.cpp
Outdated
@@ -1068,7 +1068,7 @@ void CConnman::AcceptConnection(const ListenSocket& hListenSocket) { | |||
SOCKET hSocket = accept(hListenSocket.socket, (struct sockaddr*)&sockaddr, &len); | |||
CAddress addr; | |||
int nInbound = 0; | |||
int nMaxInbound = nMaxConnections - (nMaxOutbound + nMaxFeeler); | |||
int nMaxInbound = nMaxConnections - (nMaxOutbound + nMaxFeeler + nMaxExtraOutbound); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm confused - it looks like we wont ever have both a feeler and an extra outbound open at one time, why change this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it looks like we wont ever have both a feeler and an extra outbound at one time
We will in general have both a feeler and an extra outbound at the same time. There was an incorrect comment (from a first iteration of this patch) describing the extra outbound peer as "stealing" the feeler connection, but that is not actually true. I'll delete that comment.
src/net.h
Outdated
@@ -413,6 +426,10 @@ class CConnman | |||
std::thread threadOpenAddedConnections; | |||
std::thread threadOpenConnections; | |||
std::thread threadMessageHandler; | |||
|
|||
/** flag for stealing a feeler connection to find a new outbound peer */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
whoops, this comment is wrong, will fix.
Pushed up a few small fixups. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Didnt review test changes yet.
src/net.cpp
Outdated
|
||
int desired_outbound = nMaxOutbound + (extra_outbound ? nMaxExtraOutbound : 0); | ||
|
||
if (nOutbound >= desired_outbound) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Care to add a comment here (or in the comment block above) noting that we will not make a new connection if we have our "extra slot" used by either a feeler or an "extra outbound", both are not allowed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if this comment was made before we chatted offline, but it is possible to have both a feeler and an "extra" outbound peer simultaneously (just not in the same loop iteration, but that's nonsensical anyway since we only initiate a single connection in a single loop iteration, regardless).
src/net_processing.cpp
Outdated
return g_last_tip_update < GetTime() - consensusParams.nPowTargetSpacing * 3 && mapBlocksInFlight.empty(); | ||
} | ||
|
||
// Requires cs_main |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please no more locking documentation - use the self-documenting AssertLockHeld or GUARDED_BY.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good
src/net_processing.cpp
Outdated
// Only disconnect a peer that has been connected to us for | ||
// some reasonable fraction of our check-frequency, to give | ||
// it time for new information to have arrived. | ||
if (time_in_seconds - pnode->nTimeConnected > STALE_CHECK_INTERVAL/2) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldnt this almost always be true (because we only call this function once every STALE_CHECK_INTERVAL)? Also, I think we can turn this down way more than 5 minutes, if we think we're behind, and we connect to someone, it shouldn't take them 5 minutes to send us a header, and if it does, we probably weren't behind and our existing peers are fine, so we should just disconnect them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe almost always, but definitely not always -- our outbound peers can initiate disconnects, too, and it might take us a while to find one.
I've pushed up changes to split the peer eviction onto a shorter timer (2.5 minutes), where we only require 30 seconds of connectivity to disconnect.
@TheBlueMatt raised concerns about holding the 9th connection for too long. One concern would be if we are reducing the number of available inbound slots on the network by too much. Another potentially more serious concern is that if we're holding the 9th connection long enough for a new block to be found and relayed, then over time we'll be selecting for peers with the fastest connectivity. In the extreme, you could imagine that this algorithm would reduce to nodes grinding away, trying to connect to the peer that was closest to the miners (or the FIBRE network or similar), which is probably not what we want. In this case, we only intend for the 9th connection to tell us whether we're really behind (eg because our existing peers are all broken) or if the network is just slow. We can determine that very quickly after connection, so I think I'm going to change this so that we disconnect extra peers more quickly (ie on a faster timer than every 10 minutes, and requiring a shorter connection time), and reset the TryExtraOutboundPeer flag in CConnman after disconnecting an extra peer, essentially waiting until we decide at the next stale-tip-check that we're still stale, before trying another 9th peer. |
Addressed @TheBlueMatt's nits. |
// Also exclude peers that haven't finished initial connection handshake yet | ||
// (so that we don't decide we're over our desired connection limit, and then | ||
// evict some peer that has finished the handshake) | ||
int CConnman::GetExtraOutboundCount() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this might be greatly simplified by extending CSemaphore to have an adjustable threshold.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any thoughts on @TheBlueMatt's suggestion (mentioned on IRC yesterday) to just use the feeler connection for the extra peer, rather than initiate a potential 10th connection?
https://botbot.me/freenode/bitcoin-core-dev/2017-10-27/?msg=92828477&page=2
I'm leaning towards doing that, so nMaxExtraOutbound would go away. I guess this function wouldn't change though.
Needs rebase. |
f0216b4
to
fb74617
Compare
Squashed and rebased. |
Without seeing the code, I think that would probably be simpler and more
obvious.
I wonder if we could actually combine the behavior with feelers. Launch a
feeler every 20 min or so as usual, and rather than disconnecting
immediately, replace a worse connection if necessary, and upgrade from
feeler to full outbound.
…On Oct 28, 2017 11:11 AM, "Suhas Daftuar" ***@***.***> wrote:
***@***.**** commented on this pull request.
------------------------------
In src/net.cpp
<#11560 (comment)>:
> + return m_try_another_outbound_peer;
+}
+
+void CConnman::SetTryNewOutboundPeer(bool flag)
+{
+ LOCK(m_cs_outbound_peer);
+ m_try_another_outbound_peer = flag;
+}
+
+// Return the number of peers we have over our outbound connection limit
+// Exclude peers that are marked for disconnect, or are going to be
+// disconnected soon (eg one-shots and feelers)
+// Also exclude peers that haven't finished initial connection handshake yet
+// (so that we don't decide we're over our desired connection limit, and then
+// evict some peer that has finished the handshake)
+int CConnman::GetExtraOutboundCount()
Any thoughts on @TheBlueMatt <https://github.com/thebluematt>'s
suggestion (mentioned on IRC yesterday) to just use the feeler connection
for the extra peer, rather than initiate a potential 10th connection?
https://botbot.me/freenode/bitcoin-core-dev/2017-10-27/?
msg=92828477&page=2
I'm leaning towards doing that, so nMaxExtraOutbound would go away. I
guess this function wouldn't change though.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#11560 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AAZdEzhjrlXqpR_iPo_-CIcNtvbYKR8cks5sw0QdgaJpZM4QGjZX>
.
|
Fixed a bug and changed the extra peer connection to be exclusive with a feeler, rather than in addition to a feeler. |
src/net_processing.cpp
Outdated
} else { | ||
connman->SetTryNewOutboundPeer(false); | ||
} | ||
m_stale_tip_check_time += STALE_CHECK_INTERVAL; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this loop fails to run when it should-- e.g. host it suspended, then later when it does run it will run additional times. I believe this should be = time_in_seconds + STALE_CHECK_INTERVAL.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doubly so because nothing appears to initialize it to a time to begin with...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for catching that; fixed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
utACK ff5f5cd (minus test changes and I still prefer a shorter check interval).
src/net_processing.cpp
Outdated
// it time for new information to have arrived. | ||
// Also don't disconnect any peer we're trying to download a | ||
// block from. | ||
constexpr int64_t min_connect_time = EXTRA_PEER_CHECK_INTERVAL / 5; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why check if we need to disconnect every 2.5 minutes and only require it being online 30 seconds? Why not just make both 30 seconds (or maybe 45 and 30) so that we dont always end up holding our extra peer for 2.5 minutes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm just not really sure how frequent is too-frequent for scheduled callbacks; there's a tradeoff between wasting time on calls (especially in this framework, where we always have the callback fire to check for eviction, even if we've never called SetExtraOutboundPeer()) and gaining precision with how long we allow the extra peer to be connected, and it's not clear to me how to figure out the optimal point....
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The callback in the connman->GetExtraOutboundCount() == 0 case should be almost free, no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alright, changed to 45 seconds.
src/net_processing.h
Outdated
void CheckForStaleTipAndEvictPeers(const Consensus::Params &consensusParams); | ||
void EvictExtraOutboundPeers(int64_t time_in_seconds); | ||
|
||
int64_t m_stale_tip_check_time; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be private?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good, done.
9a6ebea
to
8a27fdf
Compare
Updated to address a nit from @jimpo, and reduce the check-for-eviction interval down to 45 seconds (as @TheBlueMatt suggested). Squashed https://github.com/sdaftuar/bitcoin/commits/11560.0 -> 8a27fdf |
re-utACK 8a27fdf |
f0d5fa1
to
6262915
Compare
Squashed https://github.com/sdaftuar/bitcoin/commits/11560.4 -> 6262915
|
re-ACK. |
utACK 6262915 |
6262915 Add unit test for stale tip checking (Suhas Daftuar) 83df257 Add CConnmanTest to mutate g_connman in tests (João Barbosa) ac7b37c Connect to an extra outbound peer if our tip is stale (Suhas Daftuar) db32a65 Track tip update time and last new block announcement from each peer (Suhas Daftuar) 2d4327d net: Allow connecting to extra outbound peers (Suhas Daftuar) Pull request description: This is an alternative approach to #11534. Rather than disconnect an outbound peer when our tip looks stale, instead try to connect to an additional outbound peer. Periodically, check to see if we have more outbound peers than we target (ie if any extra peers are in use), and if so, disconnect the one that least recently announced a new block (breaking ties by choosing the newest peer that we connected to). Tree-SHA512: 8f19e910e0bb36867f81783e020af225f356451899adfc7ade1895d6d3bd5afe51c83759610dfd10c62090c4fe404efa0283b2f63fde0bd7da898a1aaa7fb281
Github-Pull: bitcoin#11560 Rebased-From: 2d4327d
Github-Pull: bitcoin#11560 Rebased-From: db32a65
If our tip hasn't updated in a while, that may be because our peers are not relaying blocks to us that we would consider valid. Allow connection to an additional outbound peer in that circumstance. Also, periodically check to see if we are exceeding our target number of outbound peers, and disconnect the one which has least recently announced a new block to us (choosing the newest such peer in the case of tie). Github-Pull: bitcoin#11560 Rebased-From: ac7b37c
Github-Pull: bitcoin#11560 Rebased-From: 83df257
Github-Pull: bitcoin#11560 Rebased-From: 6262915
I had a slow sync issue which may have been related to and/or fixed by this... See #12514 I eventually did get the recompile of the GitHub code but the version # did not change. I hope my issue/resolution helps in fixing this issue or confirming it is fixed :) |
6262915 Add unit test for stale tip checking (Suhas Daftuar) 83df257 Add CConnmanTest to mutate g_connman in tests (João Barbosa) ac7b37c Connect to an extra outbound peer if our tip is stale (Suhas Daftuar) db32a65 Track tip update time and last new block announcement from each peer (Suhas Daftuar) 2d4327d net: Allow connecting to extra outbound peers (Suhas Daftuar) Pull request description: This is an alternative approach to bitcoin#11534. Rather than disconnect an outbound peer when our tip looks stale, instead try to connect to an additional outbound peer. Periodically, check to see if we have more outbound peers than we target (ie if any extra peers are in use), and if so, disconnect the one that least recently announced a new block (breaking ties by choosing the newest peer that we connected to). Tree-SHA512: 8f19e910e0bb36867f81783e020af225f356451899adfc7ade1895d6d3bd5afe51c83759610dfd10c62090c4fe404efa0283b2f63fde0bd7da898a1aaa7fb281
6262915 Add unit test for stale tip checking (Suhas Daftuar) 83df257 Add CConnmanTest to mutate g_connman in tests (João Barbosa) ac7b37c Connect to an extra outbound peer if our tip is stale (Suhas Daftuar) db32a65 Track tip update time and last new block announcement from each peer (Suhas Daftuar) 2d4327d net: Allow connecting to extra outbound peers (Suhas Daftuar) Pull request description: This is an alternative approach to bitcoin#11534. Rather than disconnect an outbound peer when our tip looks stale, instead try to connect to an additional outbound peer. Periodically, check to see if we have more outbound peers than we target (ie if any extra peers are in use), and if so, disconnect the one that least recently announced a new block (breaking ties by choosing the newest peer that we connected to). Tree-SHA512: 8f19e910e0bb36867f81783e020af225f356451899adfc7ade1895d6d3bd5afe51c83759610dfd10c62090c4fe404efa0283b2f63fde0bd7da898a1aaa7fb281
6262915 Add unit test for stale tip checking (Suhas Daftuar) 83df257 Add CConnmanTest to mutate g_connman in tests (João Barbosa) ac7b37c Connect to an extra outbound peer if our tip is stale (Suhas Daftuar) db32a65 Track tip update time and last new block announcement from each peer (Suhas Daftuar) 2d4327d net: Allow connecting to extra outbound peers (Suhas Daftuar) Pull request description: This is an alternative approach to bitcoin#11534. Rather than disconnect an outbound peer when our tip looks stale, instead try to connect to an additional outbound peer. Periodically, check to see if we have more outbound peers than we target (ie if any extra peers are in use), and if so, disconnect the one that least recently announced a new block (breaking ties by choosing the newest peer that we connected to). Tree-SHA512: 8f19e910e0bb36867f81783e020af225f356451899adfc7ade1895d6d3bd5afe51c83759610dfd10c62090c4fe404efa0283b2f63fde0bd7da898a1aaa7fb281
This is an alternative approach to #11534. Rather than disconnect an outbound peer when our tip looks stale, instead try to connect to an additional outbound peer.
Periodically, check to see if we have more outbound peers than we target (ie if any extra peers are in use), and if so, disconnect the one that least recently announced a new block (breaking ties by choosing the newest peer that we connected to).