-
Notifications
You must be signed in to change notification settings - Fork 37.7k
p2p: do not make automatic outbound connections to addnode peers #28895
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
p2p: do not make automatic outbound connections to addnode peers #28895
Conversation
to allocate our limited outbound slots correctly, and to ensure addnode connections benefit from their intended protections. Our addnode logic usually connects the addnode peers before the automatic outbound logic does, but not always, as a connection race can occur. If an addnode peer disconnects us and if it was the only one from its network, there can be a race between reconnecting to it with the addnode thread, and it being picked as automatic network-specific outbound peer. Or our internet connection or router, or the addnode peer, could be temporarily offline, and then return online during the automatic outbound thread. Or we could add a new manual peer using the addnode RPC at that time. The race can be more apparent when our node doesn't know many peers, or with networks like cjdns that currently have few bitcoin peers. When an addnode peer is connected as an automatic outbound peer and is the only connection we have to a network, it can be protected by our new outbound eviction logic and persist in the "wrong role". Examples on mainnet using logging added in the same pull request: 2023-08-12T14:51:05.681743Z [opencon] [net.cpp:1949] [ThreadOpenConnections] [net:debug] Not making automatic network-specific outbound-full-relay connection to i2p peer selected for manual (addnode) connection: [geh...odq.b32.i2p]:0 2023-08-13T03:59:28.050853Z [opencon] [net.cpp:1949] [ThreadOpenConnections] [net:debug] Not making automatic block-relay-only connection to onion peer selected for manual (addnode) connection: kpg...aid.onion:8333 2023-08-13T16:21:26.979052Z [opencon] [net.cpp:1949] [ThreadOpenConnections] [net:debug] Not making automatic network-specific outbound-full-relay connection to cjdns peer selected for manual (addnode) connection: [fcc...8ce]:8333 2023-08-14T20:43:53.401271Z [opencon] [net.cpp:1949] [ThreadOpenConnections] [net:debug] Not making automatic network-specific outbound-full-relay connection to cjdns peer selected for manual (addnode) connection: [fc7...59e]:8333 2023-08-15T00:10:01.894147Z [opencon] [net.cpp:1949] [ThreadOpenConnections] [net:debug] Not making automatic feeler connection to i2p peer selected for manual (addnode) connection: geh...odq.b32.i2p:8333 Finally, there does not seem to be a reason to make block-relay or short-lived feeler connections to addnode peers, as the addnode logic will ensure we connect to them if they are up, within the addnode connection limit. Fix these issues by checking if the address is an addnode peer in our automatic outbound connection logic.
The following sections might be updated with supplementary metadata relevant to reviewers and maintainers. Code CoverageFor detailed information about the code coverage, see the test coverage report. ReviewsSee the guideline for information on the review process.
If your review is incorrectly listed, please react with 👎 to this comment and the bot will ignore it on the next update. |
Reported in https://github.com/bitcoin/bitcoin/pull/27213/files#r1291926369 and extracted from an earlier version of #28248. The first commit ought to be feasible to backport, if desired. The second test commit would require #28155. |
Thank you for isolating this change and making it more clear what you were going for. From what I understand the "regression" is not that we might make automatic connections to addnode peers but rather that if that happens these connections might stick around as automatic connections for longer (due to the new eviction logic), while not being labeled as manual. So in the end a user might end up with less outbound connections than expected for longer than is currently possible, with some of their manual connections incorrectly labeled as automatic. If the above is correct, then I am not sure if this is worthy of a backport at this stage in the release cycle because: 1) This should be rare and 2) the manual connections are still being made and persisted (just labeled incorrectly). |
Empirically, the regression is that we're actively adding addnode peers as outbound ones, and keeping them there. Since a couple of years I run a node with all three of our privacy networks and addnode entries for each, along with clearnet, and I didn't see my addnode peers being connected to as non-manual outbounds. This is new behavior for my node, it's frequent, it removes the intended privileges that addnode entries benefit from, and it impacts precisely the nodes running multiple networks including ones with fewer peers that used addnode entries to ensure connection to them. |
Sure but logically, the regression is that the duration of these connections increased not that they are being made in the first place. I'm wondering why you only now noticed this, since the duration of these mislabeled connections prior to the new eviction logic (i assume) depended on regular outbound eviction behavior, which could also result in longer living connections. |
This stems from a change in this release where it was observed and reported. We did not previously automatically add outbound connections to addnode entries in sparse networks to the degree and regularity that we do now.
No longer rare, and not only about mislabeled connections. The new change is also misallocating rare outbound connection slots, and attributing incorrect peer privileges as addnode peers benefit from non-eviction, are not disconnected or discouraged for bad behavior, etc.
Observed and reported in August on the pull that made the change (there was no reaction). While it looks like it could occur somewhat rarely before, with the change it is now frequent, and more noticeable as the connections are protected. |
Tested ACK 5e7cc41 I tested this by running on cjdns and ipv4, and then manually disconnecting an addnode cjdns peer and observing the log - both on v26.0rc2 and this branch.
I think I disagree with that. They might now be protected from eviction due to network-specific extra peers, but that eviction is also new: Before #27213, we only had the stale-tip eviction, and I think that it would only extremely rarely evict such a peer because of the way it works (see this post for why I think so), so once an automatic connections was made, it was probably never evicted unless there were other causes like we got offline or they disconnected us. I think that the mechanism mentioned in the OP (race after getting disconnected) is probably what makes this most noticeable, because A completely unrelated problem is that this problem exists because cjdns has so few peers <10? on mainnet, which is a shame considering how much engineering effort went into it, but I'm not sure what to do about that. |
I also saw the network-specific issue with i2p peers, and when adding a new network, though cjdns is indeed a good way to test it. As to what we can do to further cjdns adoption, when one-click I2P support was added to Raspiblitz and Umbrel in December 2022 after this workshop and blog post, the number of recent peers returned by -addrinfo for me went from a couple hundred to quickly more than a thousand, and now a couple thousand. Cjdns being a friend-of-a-friend topology requires the additional step of finding a friend to connect to the network with, but we have a couple of stable ones in the hardcoded DNS seeds, and maybe I can encourage and work with the node-in-a-box packages to get support for it in (and do more related workshops). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ACK 5e7cc41
return (m_added_node_params.size() < 24 // bound the query to a reasonable limit | ||
&& std::any_of(m_added_node_params.cbegin(), m_added_node_params.cend(), | ||
[&](const auto& p) { return p.m_added_node == addr_str || p.m_added_node == addr_port_str; })); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(hit "submit" too early)
nit, non-blocker: I do not think that this < 24
cap is needed. What is the max number of added nodes expected?
I think it is not worth doing that, but in theory, if this is to be optimized - then we can observe that addr_port_str
begins with addr_str
. For example aaa:8333
and aaa
and if we are checking if string aaaa
is equal to either one, the way this is done now is that it would traverse the entire aaaa
twice whereas it can be done with a single traverse. But please ignore this :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the number of added nodes is up to the user without a limit, though 8 at a time is the addnode connection limit. We can change m_added_node_params
to an unordered set later on as you suggested in #28248 to make the query constant-time on average and drop the size check. Kept it simple here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
utACK 5e7cc41
I think it's important that nodes added by addnode
always get the elevated protection level.
A completely unrelated problem is that this problem exists because cjdns has so few peers <10? on mainnet, which is a shame considering how much engineering effort went into it, but I'm not sure what to do about that.
I'm still trying to wrap my head around CJDNS and how it can be set up in a docker based environment. If successful, I'll try creating an Umbrel app for it, which should hopefully help with growing the network.
utACK 5e7cc41 |
Added the fix to #28872 for backporting. |
to allocate our limited outbound slots correctly, and to ensure addnode connections benefit from their intended protections. Our addnode logic usually connects the addnode peers before the automatic outbound logic does, but not always, as a connection race can occur. If an addnode peer disconnects us and if it was the only one from its network, there can be a race between reconnecting to it with the addnode thread, and it being picked as automatic network-specific outbound peer. Or our internet connection or router, or the addnode peer, could be temporarily offline, and then return online during the automatic outbound thread. Or we could add a new manual peer using the addnode RPC at that time. The race can be more apparent when our node doesn't know many peers, or with networks like cjdns that currently have few bitcoin peers. When an addnode peer is connected as an automatic outbound peer and is the only connection we have to a network, it can be protected by our new outbound eviction logic and persist in the "wrong role". Examples on mainnet using logging added in the same pull request: 2023-08-12T14:51:05.681743Z [opencon] [net.cpp:1949] [ThreadOpenConnections] [net:debug] Not making automatic network-specific outbound-full-relay connection to i2p peer selected for manual (addnode) connection: [geh...odq.b32.i2p]:0 2023-08-13T03:59:28.050853Z [opencon] [net.cpp:1949] [ThreadOpenConnections] [net:debug] Not making automatic block-relay-only connection to onion peer selected for manual (addnode) connection: kpg...aid.onion:8333 2023-08-13T16:21:26.979052Z [opencon] [net.cpp:1949] [ThreadOpenConnections] [net:debug] Not making automatic network-specific outbound-full-relay connection to cjdns peer selected for manual (addnode) connection: [fcc...8ce]:8333 2023-08-14T20:43:53.401271Z [opencon] [net.cpp:1949] [ThreadOpenConnections] [net:debug] Not making automatic network-specific outbound-full-relay connection to cjdns peer selected for manual (addnode) connection: [fc7...59e]:8333 2023-08-15T00:10:01.894147Z [opencon] [net.cpp:1949] [ThreadOpenConnections] [net:debug] Not making automatic feeler connection to i2p peer selected for manual (addnode) connection: geh...odq.b32.i2p:8333 Finally, there does not seem to be a reason to make block-relay or short-lived feeler connections to addnode peers, as the addnode logic will ensure we connect to them if they are up, within the addnode connection limit. Fix these issues by checking if the address is an addnode peer in our automatic outbound connection logic. Github-Pull: bitcoin#28895 Rebased-From: cc62716
2f86d30 doc: update release notes for v26.0rc3 (fanquake) 3b6c7f2 doc: update manual pages for v26.0rc3 (fanquake) 3db4d1c build: bump version to v26.0rc3 (fanquake) 6045f38 build: Fix regression in "ARMv8 CRC32 intrinsics" test (Hennadii Stepanov) 5eaa179 ci: Avoid toolset ambiguity that MSVC can't handle (Hennadii Stepanov) 55af112 p2p: do not make automatic outbound connections to addnode peers (Jon Atack) 5e0bcc1 ci: Switch from `apt` to `apt-get` (Hennadii Stepanov) 437a531 ci: Update apt cache (Hennadii Stepanov) 1488648 pool: change memusage_test to use int64_t, add allocation check (Martin Leitner-Ankerl) bcc183c pool: make sure PoolAllocator uses the correct alignment (Martin Leitner-Ankerl) 7dda499 doc: regenerate example bitcoin.conf (fanquake) 5845331 doc: rewrite explanation for -par= (fanquake) Pull request description: Currently backports: * #28858 * #28895 (partial) * #28913 * #28905 * #28919 * #28925 Also includes changes for rc3, and reintegrating the release-notes. ACKs for top commit: hebasto: re-ACK 2f86d30, only #28919 backported since my [recent](#28872 (review)) review. TheCharlatan: ACK 2f86d30 Tree-SHA512: 43c91b344d37f582081ac184ac59cf76c741317b2b69a24fcd4287eefa8333e20c545e150798f4057d6f4ac8e70ed9cba1c8dd9777b11c1cf8992cce09108727
, bitcoin#27213, bitcoin#28189, bitcoin#28155, bitcoin#28895, partial bitcoin#26396 (networking backports: part 9) 09504bd merge bitcoin#28895: do not make automatic outbound connections to addnode peers (Kittywhiskers Van Gogh) 6cf206c merge bitcoin#28155: improves addnode / m_added_nodes logic (Kittywhiskers Van Gogh) 11d654a merge bitcoin#28189: diversify network outbounds release note (Kittywhiskers Van Gogh) 5dc52b3 merge bitcoin#27213: Diversify automatic outbound connections with respect to networks (Kittywhiskers Van Gogh) 291305b merge bitcoin#26497: Make ConsumeNetAddr always produce valid onion addresses (Kittywhiskers Van Gogh) 6a37934 partial bitcoin#26396: Avoid SetTxRelay for feeler connections (Kittywhiskers Van Gogh) 221a78e merge bitcoin#25156: Introduce PeerManagerImpl::RejectIncomingTxs (Kittywhiskers Van Gogh) cc694c2 merge bitcoin#22778: Reduce resource usage for inbound block-relay-only connections (Kittywhiskers Van Gogh) 6e6de54 net: use `Peer::m_can_relay_tx` when we mean `m_tx_relay != nullptr` (Kittywhiskers Van Gogh) 77526d2 net: rename `Peer::m_block_relay_only` to `Peer::m_can_tx_relay` (Kittywhiskers Van Gogh) Pull request description: ## Additional Information * Dependency for #6333 * When backporting [bitcoin#22778](bitcoin#22778), the `m_tx_inventory_known_filter.insert()` call in `queueAndMaybePushInv()` had to be moved out as `EXCLUSIVE_LOCKS_REQUIRED` does not seem to work with lambda parameters list (though it does work with capture list, [source](https://github.com/dashpay/dash/blob/4b5e39290c8f1c78305e597312408c84e2b06b64/src/net_processing.cpp#L5781)), see error below <details> <summary>Compiler error:</summary> ``` net_processing.cpp:5895:21: note: found near match 'tx_relay->m_tx_inventory_mutex' net_processing.cpp:5955:21: error: calling function 'operator()' requires holding mutex 'queueAndMaybePushInv.m_tx_inventory_mutex' exclusively [-Werror,-Wthread-safety-precise] queueAndMaybePushInv(tx_relay, CInv(nInvType, hash)); ^ net_processing.cpp:5955:21: note: found near match 'tx_relay->m_tx_inventory_mutex' net_processing.cpp:5977:17: error: calling function 'operator()' requires holding mutex 'queueAndMaybePushInv.m_tx_inventory_mutex' exclusively [-Werror,-Wthread-safety-precise] queueAndMaybePushInv(inv_relay, inv); ^ ``` </details> * Attempting to remove the `EXCLUSIVE_LOCKS_REQUIRED` or the `AssertLockHeld` are not options due to stricter thread sanitization checks being applied since [dash#6319](#6319) (and in general, removing annotations being inadvisable regardless) * We cannot simply lock `peer->GetInvRelay()->m_tx_inventory_mutex` as a) the caller already has a copy of the relay pointer (which means that `m_tx_relay_mutex` was already held) that b) they used to lock `m_tx_inventory_mutex` (which we were already asserting) so c) we cannot simply re-lock `m_tx_inventory_mutex` as it's already already held, just through a less circuitous path. * The reason locking is mentioned instead of asserting is that the compiler treats `peer->GetInvRelay()->m_tx_inventory_mutex` and `tx_relay->m_tx_inventory_mutex` as separate locks (despite being different ways of accessing the same thing) and would complain similarly to the error above if attempting to assert the former while holding the latter. * As `m_tx_relay` is always initialized for Dash, to mimic the behaviour _expected_ upstream, in [bitcoin#22778](bitcoin#22778), `SetTxRelay()` will _allow_ `GetTxRelay()` to return `m_can_tx_relay` (while Dash code that demands unconditional access can use `GetInvRelay()` instead) with the lack of `SetTxRelay()` resulting in `GetTxRelay()` feigning the absence of `m_can_tx_relay`. This allows us to retain the same style of checks used upstream instead of using proxies like `!CNode::IsBlockOnlyConn()` to determined if we _should_ use `m_tx_relay`. * Speaking of proxies, being a block-only connection is now no longer the only reason not to relay transactions In preparation for [bitcoin#26396](bitcoin#26396), proxy usage of `!CNode::IsBlockOnlyConn()` and `!Peer::m_block_relay_only` was replaced with the more explicit `Peer::m_can_tx_relay` before being used to gate the result of `GetTxRelay()`, to help it mimic upstream semantics. ## Breaking Changes None expected ## Checklist - [x] I have performed a self-review of my own code - [x] I have commented my code, particularly in hard-to-understand areas **(note: N/A)** - [x] I have added or updated relevant unit/integration/functional/e2e tests - [x] I have made corresponding changes to the documentation - [x] I have assigned this pull request to a milestone _(for repository code-owners and collaborators only)_ ACKs for top commit: UdjinM6: utACK 09504bd Tree-SHA512: f4f36f12f749b697dd4ad5521ed15f862c93ed4492a047759554aa80a3ce00dbd1bdc0242f7a4468f41f25925d5b79c8ab774d8489317437b1983f0a1277eecb
to allocate our limited outbound slots correctly, and to ensure addnode
connections benefit from their intended protections.
Our addnode logic usually connects the addnode peers before the automatic
outbound logic does, but not always, as a connection race can occur. If an
addnode peer disconnects us and if it was the only one from its network, there
can be a race between reconnecting to it with the addnode thread, and it being
picked as automatic network-specific outbound peer. Or our internet connection
or router or the addnode peer could be temporarily offline, and then return
online during the automatic outbound thread. Or we could add a new manual peer
using the addnode RPC at that time.
The race can be more apparent when our node doesn't know many peers, or with
networks like cjdns that currently have few bitcoin peers.
When an addnode peer is connected as an automatic outbound peer and is the only
connection we have to a network, it can be protected by our new outbound
eviction logic and persist in the "wrong role".
Finally, there does not seem to be a reason to make block-relay or short-lived
feeler connections to addnode peers, as the addnode logic will ensure we connect
to them if they are up, within the addnode connection limit.
Fix these issues by checking if the address is an addnode peer in our automatic
outbound connection logic.