Skip to content

Conversation

jonatack
Copy link
Member

to allocate our limited outbound slots correctly, and to ensure addnode
connections benefit from their intended protections.

Our addnode logic usually connects the addnode peers before the automatic
outbound logic does, but not always, as a connection race can occur. If an
addnode peer disconnects us and if it was the only one from its network, there
can be a race between reconnecting to it with the addnode thread, and it being
picked as automatic network-specific outbound peer. Or our internet connection
or router or the addnode peer could be temporarily offline, and then return
online during the automatic outbound thread. Or we could add a new manual peer
using the addnode RPC at that time.

The race can be more apparent when our node doesn't know many peers, or with
networks like cjdns that currently have few bitcoin peers.

When an addnode peer is connected as an automatic outbound peer and is the only
connection we have to a network, it can be protected by our new outbound
eviction logic and persist in the "wrong role".

Finally, there does not seem to be a reason to make block-relay or short-lived
feeler connections to addnode peers, as the addnode logic will ensure we connect
to them if they are up, within the addnode connection limit.

Fix these issues by checking if the address is an addnode peer in our automatic
outbound connection logic.

to allocate our limited outbound slots correctly, and to ensure addnode
connections benefit from their intended protections.

Our addnode logic usually connects the addnode peers before the automatic
outbound logic does, but not always, as a connection race can occur.  If an
addnode peer disconnects us and if it was the only one from its network, there
can be a race between reconnecting to it with the addnode thread, and it being
picked as automatic network-specific outbound peer.  Or our internet connection
or router, or the addnode peer, could be temporarily offline, and then return
online during the automatic outbound thread.  Or we could add a new manual peer
using the addnode RPC at that time.

The race can be more apparent when our node doesn't know many peers, or with
networks like cjdns that currently have few bitcoin peers.

When an addnode peer is connected as an automatic outbound peer and is the only
connection we have to a network, it can be protected by our new outbound
eviction logic and persist in the "wrong role".

Examples on mainnet using logging added in the same pull request:

2023-08-12T14:51:05.681743Z [opencon] [net.cpp:1949] [ThreadOpenConnections]
[net:debug] Not making automatic network-specific outbound-full-relay connection
to i2p peer selected for manual (addnode) connection: [geh...odq.b32.i2p]:0

2023-08-13T03:59:28.050853Z [opencon] [net.cpp:1949] [ThreadOpenConnections]
[net:debug] Not making automatic block-relay-only connection to onion peer
selected for manual (addnode) connection: kpg...aid.onion:8333

2023-08-13T16:21:26.979052Z [opencon] [net.cpp:1949] [ThreadOpenConnections]
[net:debug] Not making automatic network-specific outbound-full-relay connection
to cjdns peer selected for manual (addnode) connection: [fcc...8ce]:8333

2023-08-14T20:43:53.401271Z [opencon] [net.cpp:1949] [ThreadOpenConnections]
[net:debug] Not making automatic network-specific outbound-full-relay connection
to cjdns peer selected for manual (addnode) connection: [fc7...59e]:8333

2023-08-15T00:10:01.894147Z [opencon] [net.cpp:1949] [ThreadOpenConnections]
[net:debug] Not making automatic feeler connection to i2p peer selected for
manual (addnode) connection: geh...odq.b32.i2p:8333

Finally, there does not seem to be a reason to make block-relay or short-lived
feeler connections to addnode peers, as the addnode logic will ensure we connect
to them if they are up, within the addnode connection limit.

Fix these issues by checking if the address is an addnode peer in our automatic
outbound connection logic.
@DrahtBot
Copy link
Contributor

DrahtBot commented Nov 16, 2023

The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.

Code Coverage

For detailed information about the code coverage, see the test coverage report.

Reviews

See the guideline for information on the review process.

Type Reviewers
ACK mzumsande, vasild, guggero, brunoerg

If your review is incorrectly listed, please react with 👎 to this comment and the bot will ignore it on the next update.

@DrahtBot DrahtBot added the P2P label Nov 16, 2023
@jonatack
Copy link
Member Author

Reported in https://github.com/bitcoin/bitcoin/pull/27213/files#r1291926369 and extracted from an earlier version of #28248.

The first commit ought to be feasible to backport, if desired. The second test commit would require #28155.

@dergoegge
Copy link
Member

Thank you for isolating this change and making it more clear what you were going for.

From what I understand the "regression" is not that we might make automatic connections to addnode peers but rather that if that happens these connections might stick around as automatic connections for longer (due to the new eviction logic), while not being labeled as manual.

So in the end a user might end up with less outbound connections than expected for longer than is currently possible, with some of their manual connections incorrectly labeled as automatic.

If the above is correct, then I am not sure if this is worthy of a backport at this stage in the release cycle because: 1) This should be rare and 2) the manual connections are still being made and persisted (just labeled incorrectly).

@jonatack
Copy link
Member Author

Empirically, the regression is that we're actively adding addnode peers as outbound ones, and keeping them there. Since a couple of years I run a node with all three of our privacy networks and addnode entries for each, along with clearnet, and I didn't see my addnode peers being connected to as non-manual outbounds. This is new behavior for my node, it's frequent, it removes the intended privileges that addnode entries benefit from, and it impacts precisely the nodes running multiple networks including ones with fewer peers that used addnode entries to ensure connection to them.

@dergoegge
Copy link
Member

Empirically, the regression is that we're actively adding addnode peers as outbound ones, and keeping them there.

Sure but logically, the regression is that the duration of these connections increased not that they are being made in the first place.

I'm wondering why you only now noticed this, since the duration of these mislabeled connections prior to the new eviction logic (i assume) depended on regular outbound eviction behavior, which could also result in longer living connections.

@jonatack
Copy link
Member Author

jonatack commented Nov 16, 2023

From what I understand the "regression" is not that we might make automatic connections to addnode peers but rather that if that happens these connections might stick around as automatic connections for longer (due to the new eviction logic), while not being labeled as manual.

This stems from a change in this release where it was observed and reported. We did not previously automatically add outbound connections to addnode entries in sparse networks to the degree and regularity that we do now.

So in the end a user might end up with less outbound connections than expected for longer than is currently possible, with some of their manual connections incorrectly labeled as automatic.

Sure but logically, the regression is that the duration of these connections increased not that they are being made in the first place.

  1. This should be rare and 2) the manual connections are still being made and persisted (just labeled incorrectly).

No longer rare, and not only about mislabeled connections. The new change is also misallocating rare outbound connection slots, and attributing incorrect peer privileges as addnode peers benefit from non-eviction, are not disconnected or discouraged for bad behavior, etc.

I'm wondering why you only now noticed this, since the duration of these mislabeled connections prior to the new eviction logic (i assume) depended on regular outbound eviction behavior, which could also result in longer living connections.

Observed and reported in August on the pull that made the change (there was no reaction). While it looks like it could occur somewhat rarely before, with the change it is now frequent, and more noticeable as the connections are protected.

@mzumsande
Copy link
Contributor

mzumsande commented Nov 16, 2023

Tested ACK 5e7cc41

I tested this by running on cjdns and ipv4, and then manually disconnecting an addnode cjdns peer and observing the log - both on v26.0rc2 and this branch.

Sure but logically, the regression is that the duration of these connections increased not that they are being made in the first place.

I think I disagree with that. They might now be protected from eviction due to network-specific extra peers, but that eviction is also new: Before #27213, we only had the stale-tip eviction, and I think that it would only extremely rarely evict such a peer because of the way it works (see this post for why I think so), so once an automatic connections was made, it was probably never evicted unless there were other causes like we got offline or they disconnected us.

I think that the mechanism mentioned in the OP (race after getting disconnected) is probably what makes this most noticeable, because ThreadOpenAddedConnections() has a long sleep (60s) which it wouldn't interrupt, and even though EXTRA_NETWORK_PEER_INTERVAL = 5min is larger, it would immediately kick in after the only peer from a network disconnected us (and then the next time ~5 minutes later), so ThreadOpenConnection would usually be quicker than ThreadOpenAddedConnections().

A completely unrelated problem is that this problem exists because cjdns has so few peers <10? on mainnet, which is a shame considering how much engineering effort went into it, but I'm not sure what to do about that.

@jonatack
Copy link
Member Author

jonatack commented Nov 16, 2023

cjdns has so few peers <10? on mainnet, which is a shame considering how much engineering effort went into it, but I'm not sure what to do about that.

I also saw the network-specific issue with i2p peers, and when adding a new network, though cjdns is indeed a good way to test it. As to what we can do to further cjdns adoption, when one-click I2P support was added to Raspiblitz and Umbrel in December 2022 after this workshop and blog post, the number of recent peers returned by -addrinfo for me went from a couple hundred to quickly more than a thousand, and now a couple thousand. Cjdns being a friend-of-a-friend topology requires the additional step of finding a friend to connect to the network with, but we have a couple of stable ones in the hardcoded DNS seeds, and maybe I can encourage and work with the node-in-a-box packages to get support for it in (and do more related workshops).

Copy link
Contributor

@vasild vasild left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ACK 5e7cc41

Comment on lines +3474 to +3476
return (m_added_node_params.size() < 24 // bound the query to a reasonable limit
&& std::any_of(m_added_node_params.cbegin(), m_added_node_params.cend(),
[&](const auto& p) { return p.m_added_node == addr_str || p.m_added_node == addr_port_str; }));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(hit "submit" too early)

nit, non-blocker: I do not think that this < 24 cap is needed. What is the max number of added nodes expected?

I think it is not worth doing that, but in theory, if this is to be optimized - then we can observe that addr_port_str begins with addr_str. For example aaa:8333 and aaa and if we are checking if string aaaa is equal to either one, the way this is done now is that it would traverse the entire aaaa twice whereas it can be done with a single traverse. But please ignore this :)

Copy link
Member Author

@jonatack jonatack Nov 17, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the number of added nodes is up to the user without a limit, though 8 at a time is the addnode connection limit. We can change m_added_node_params to an unordered set later on as you suggested in #28248 to make the query constant-time on average and drop the size check. Kept it simple here.

Copy link
Contributor

@guggero guggero left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

utACK 5e7cc41

I think it's important that nodes added by addnode always get the elevated protection level.

A completely unrelated problem is that this problem exists because cjdns has so few peers <10? on mainnet, which is a shame considering how much engineering effort went into it, but I'm not sure what to do about that.

I'm still trying to wrap my head around CJDNS and how it can be set up in a docker based environment. If successful, I'll try creating an Umbrel app for it, which should hopefully help with growing the network.

@brunoerg
Copy link
Contributor

brunoerg commented Nov 17, 2023

utACK 5e7cc41

@DrahtBot DrahtBot requested review from brunoerg and removed request for brunoerg November 17, 2023 16:36
@fanquake fanquake merged commit 4374a87 into bitcoin:master Nov 22, 2023
@fanquake
Copy link
Member

Added the fix to #28872 for backporting.

fanquake pushed a commit to fanquake/bitcoin that referenced this pull request Nov 22, 2023
to allocate our limited outbound slots correctly, and to ensure addnode
connections benefit from their intended protections.

Our addnode logic usually connects the addnode peers before the automatic
outbound logic does, but not always, as a connection race can occur.  If an
addnode peer disconnects us and if it was the only one from its network, there
can be a race between reconnecting to it with the addnode thread, and it being
picked as automatic network-specific outbound peer.  Or our internet connection
or router, or the addnode peer, could be temporarily offline, and then return
online during the automatic outbound thread.  Or we could add a new manual peer
using the addnode RPC at that time.

The race can be more apparent when our node doesn't know many peers, or with
networks like cjdns that currently have few bitcoin peers.

When an addnode peer is connected as an automatic outbound peer and is the only
connection we have to a network, it can be protected by our new outbound
eviction logic and persist in the "wrong role".

Examples on mainnet using logging added in the same pull request:

2023-08-12T14:51:05.681743Z [opencon] [net.cpp:1949] [ThreadOpenConnections]
[net:debug] Not making automatic network-specific outbound-full-relay connection
to i2p peer selected for manual (addnode) connection: [geh...odq.b32.i2p]:0

2023-08-13T03:59:28.050853Z [opencon] [net.cpp:1949] [ThreadOpenConnections]
[net:debug] Not making automatic block-relay-only connection to onion peer
selected for manual (addnode) connection: kpg...aid.onion:8333

2023-08-13T16:21:26.979052Z [opencon] [net.cpp:1949] [ThreadOpenConnections]
[net:debug] Not making automatic network-specific outbound-full-relay connection
to cjdns peer selected for manual (addnode) connection: [fcc...8ce]:8333

2023-08-14T20:43:53.401271Z [opencon] [net.cpp:1949] [ThreadOpenConnections]
[net:debug] Not making automatic network-specific outbound-full-relay connection
to cjdns peer selected for manual (addnode) connection: [fc7...59e]:8333

2023-08-15T00:10:01.894147Z [opencon] [net.cpp:1949] [ThreadOpenConnections]
[net:debug] Not making automatic feeler connection to i2p peer selected for
manual (addnode) connection: geh...odq.b32.i2p:8333

Finally, there does not seem to be a reason to make block-relay or short-lived
feeler connections to addnode peers, as the addnode logic will ensure we connect
to them if they are up, within the addnode connection limit.

Fix these issues by checking if the address is an addnode peer in our automatic
outbound connection logic.

Github-Pull: bitcoin#28895
Rebased-From: cc62716
fanquake added a commit that referenced this pull request Nov 22, 2023
2f86d30 doc: update release notes for v26.0rc3 (fanquake)
3b6c7f2 doc: update manual pages for v26.0rc3 (fanquake)
3db4d1c build: bump version to v26.0rc3 (fanquake)
6045f38 build: Fix regression in "ARMv8 CRC32 intrinsics" test (Hennadii Stepanov)
5eaa179 ci: Avoid toolset ambiguity that MSVC can't handle (Hennadii Stepanov)
55af112 p2p: do not make automatic outbound connections to addnode peers (Jon Atack)
5e0bcc1 ci: Switch from `apt` to `apt-get` (Hennadii Stepanov)
437a531 ci: Update apt cache (Hennadii Stepanov)
1488648 pool: change memusage_test to use int64_t, add allocation check (Martin Leitner-Ankerl)
bcc183c pool: make sure PoolAllocator uses the correct alignment (Martin Leitner-Ankerl)
7dda499 doc: regenerate example bitcoin.conf (fanquake)
5845331 doc: rewrite explanation for -par= (fanquake)

Pull request description:

  Currently backports:
  * #28858
  * #28895 (partial)
  * #28913
  * #28905
  * #28919
  * #28925

  Also includes changes for rc3, and reintegrating the release-notes.

ACKs for top commit:
  hebasto:
    re-ACK 2f86d30, only #28919 backported since my [recent](#28872 (review)) review.
  TheCharlatan:
    ACK 2f86d30

Tree-SHA512: 43c91b344d37f582081ac184ac59cf76c741317b2b69a24fcd4287eefa8333e20c545e150798f4057d6f4ac8e70ed9cba1c8dd9777b11c1cf8992cce09108727
@jonatack jonatack deleted the 2023-11-do-not-make-automatic-outbound-connection-to-addnode-peers branch November 22, 2023 18:55
kwvg added a commit to kwvg/dash that referenced this pull request Oct 26, 2024
kwvg added a commit to kwvg/dash that referenced this pull request Oct 26, 2024
kwvg added a commit to kwvg/dash that referenced this pull request Oct 26, 2024
kwvg added a commit to kwvg/dash that referenced this pull request Oct 26, 2024
PastaPastaPasta added a commit to dashpay/dash that referenced this pull request Oct 27, 2024
, bitcoin#27213, bitcoin#28189, bitcoin#28155, bitcoin#28895, partial bitcoin#26396 (networking backports: part 9)

09504bd merge bitcoin#28895: do not make automatic outbound connections to addnode peers (Kittywhiskers Van Gogh)
6cf206c merge bitcoin#28155: improves addnode / m_added_nodes logic (Kittywhiskers Van Gogh)
11d654a merge bitcoin#28189: diversify network outbounds release note (Kittywhiskers Van Gogh)
5dc52b3 merge bitcoin#27213: Diversify automatic outbound connections with respect to networks (Kittywhiskers Van Gogh)
291305b merge bitcoin#26497: Make ConsumeNetAddr always produce valid onion addresses (Kittywhiskers Van Gogh)
6a37934 partial bitcoin#26396: Avoid SetTxRelay for feeler connections (Kittywhiskers Van Gogh)
221a78e merge bitcoin#25156: Introduce PeerManagerImpl::RejectIncomingTxs (Kittywhiskers Van Gogh)
cc694c2 merge bitcoin#22778: Reduce resource usage for inbound block-relay-only connections (Kittywhiskers Van Gogh)
6e6de54 net: use `Peer::m_can_relay_tx` when we mean `m_tx_relay != nullptr` (Kittywhiskers Van Gogh)
77526d2 net: rename `Peer::m_block_relay_only` to `Peer::m_can_tx_relay` (Kittywhiskers Van Gogh)

Pull request description:

  ## Additional Information

  * Dependency for #6333
  * When backporting [bitcoin#22778](bitcoin#22778), the `m_tx_inventory_known_filter.insert()` call in  `queueAndMaybePushInv()` had to be moved out as `EXCLUSIVE_LOCKS_REQUIRED` does not seem to work with lambda parameters list (though it does work with capture list, [source](https://github.com/dashpay/dash/blob/4b5e39290c8f1c78305e597312408c84e2b06b64/src/net_processing.cpp#L5781)), see error below

    <details>

    <summary>Compiler error:</summary>

    ```
    net_processing.cpp:5895:21: note: found near match 'tx_relay->m_tx_inventory_mutex'
    net_processing.cpp:5955:21: error: calling function 'operator()' requires holding mutex 'queueAndMaybePushInv.m_tx_inventory_mutex' exclusively [-Werror,-Wthread-safety-precise]
                        queueAndMaybePushInv(tx_relay, CInv(nInvType, hash));
                        ^
    net_processing.cpp:5955:21: note: found near match 'tx_relay->m_tx_inventory_mutex'
    net_processing.cpp:5977:17: error: calling function 'operator()' requires holding mutex 'queueAndMaybePushInv.m_tx_inventory_mutex' exclusively [-Werror,-Wthread-safety-precise]
                    queueAndMaybePushInv(inv_relay, inv);
                    ^
    ```

    </details>

    * Attempting to remove the `EXCLUSIVE_LOCKS_REQUIRED` or the `AssertLockHeld` are not options due to stricter thread sanitization checks being applied since [dash#6319](#6319) (and in general, removing annotations being inadvisable regardless)

    * We cannot simply lock `peer->GetInvRelay()->m_tx_inventory_mutex` as a) the caller already has a copy of the relay pointer (which means that `m_tx_relay_mutex` was already held) that b) they used to lock `m_tx_inventory_mutex` (which we were already asserting) so c) we cannot simply re-lock `m_tx_inventory_mutex` as it's already already held, just through a less circuitous path.

      * The reason locking is mentioned instead of asserting is that the compiler treats `peer->GetInvRelay()->m_tx_inventory_mutex` and `tx_relay->m_tx_inventory_mutex` as separate locks (despite being different ways of accessing the same thing) and would complain similarly to the error above if attempting to assert the former while holding the latter.

  * As `m_tx_relay` is always initialized for Dash, to mimic the behaviour _expected_ upstream, in [bitcoin#22778](bitcoin#22778), `SetTxRelay()` will _allow_ `GetTxRelay()` to return `m_can_tx_relay` (while Dash code that demands unconditional access can use `GetInvRelay()` instead) with the lack of `SetTxRelay()` resulting in `GetTxRelay()` feigning the absence of `m_can_tx_relay`.

    This allows us to retain the same style of checks used upstream instead of using proxies like `!CNode::IsBlockOnlyConn()` to determined if we _should_ use `m_tx_relay`.

  * Speaking of proxies, being a block-only connection is now no longer the only reason not to relay transactions

    In preparation for [bitcoin#26396](bitcoin#26396), proxy usage of `!CNode::IsBlockOnlyConn()` and `!Peer::m_block_relay_only` was replaced with the more explicit `Peer::m_can_tx_relay` before being used to gate the result of `GetTxRelay()`, to help it mimic upstream semantics.

  ## Breaking Changes

  None expected

  ## Checklist

  - [x] I have performed a self-review of my own code
  - [x] I have commented my code, particularly in hard-to-understand areas **(note: N/A)**
  - [x] I have added or updated relevant unit/integration/functional/e2e tests
  - [x] I have made corresponding changes to the documentation
  - [x] I have assigned this pull request to a milestone _(for repository code-owners and collaborators only)_

ACKs for top commit:
  UdjinM6:
    utACK 09504bd

Tree-SHA512: f4f36f12f749b697dd4ad5521ed15f862c93ed4492a047759554aa80a3ce00dbd1bdc0242f7a4468f41f25925d5b79c8ab774d8489317437b1983f0a1277eecb
@bitcoin bitcoin locked and limited conversation to collaborators Nov 21, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants