Skip to content

Conversation

vasild
Copy link
Contributor

@vasild vasild commented Mar 7, 2025

Before this change the code used to count references to CNode objects manually via CNode::nRefCount. Unneeded CNodes were scheduled for deletion by putting them in CConnman::m_nodes_disconnected and were deleted after their reference count reached zero. Deleting consists of calling PeerManager::FinalizeNode() and destroying the CNode object.

Replace this scheme with std::shared_ptr. This simplifies the code and removes:
CNode::nRefCount
CNode::GetRefCount()
CNode::AddRef()
CNode::Release()
CConnman::m_nodes_disconnected
CConnman::NodesSnapshot

Now creating a snapshot of CConnman::m_nodes is done by simply copying it (under the mutex).

Call PeerManager::FinalizeNode() from the destructor of CNode, which is called when the reference count reaches 0.

This removes brittle code and replaces it by a code from the standard library which is more robust. All the below are possible with the manual reference counting and not possible with shared_ptr:

  • Use an object without adding a reference to it
  • Add too many references to an object, mismatch the add/remove by having more "add"
  • Forget to reduce the reference count, mismatch the add/remove by having less "remove"
  • Go with negative reference count by mistakenly having too many "remove"
  • Destroy an object while there are references to it

There is zero learning curve for the average C++ programmer who knows how shared_ptr works. Not so much with the manual reference counting because one has to learn about AddRef(), Release(), check and maintain where are they called and their relationship with m_nodes_disconnected.


This has some history in #28222 and #10738. See below #32015 (comment) and #32015 (comment).


Possible followup: change ProcessMessages() and SendMessages() to take CNode&: #32015 (comment)

@DrahtBot
Copy link
Contributor

DrahtBot commented Mar 7, 2025

The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.

Code Coverage & Benchmarks

For details see: https://corecheck.dev/bitcoin/bitcoin/pulls/32015.

Reviews

See the guideline for information on the review process.

Type Reviewers
ACK pinheadmz
Concept ACK hodlinator, theuni

If your review is incorrectly listed, please react with 👎 to this comment and the bot will ignore it on the next update.

Conflicts

Reviewers, this pull request conflicts with the following ones:

  • #32983 (rpc: refactor: use string_view in Arg/MaybeArg by stickies-v)
  • #32822 (fuzz: Make process_message(s) more deterministic by maflcko)
  • #32747 (Introduce SockMan ("lite"): low-level socket handling for HTTP by pinheadmz)
  • #32394 (net: make m_nodes_mutex non-recursive by vasild)
  • #32326 (net: improve the interface around FindNode() and avoid a recursive mutex lock by vasild)
  • #32278 (doc: better document NetEventsInterface and the deletion of "CNode"s by vasild)
  • #32065 (i2p: make a time gap between creating transient sessions and using them by vasild)
  • #30988 (Split CConnman by vasild)
  • #29641 (scripted-diff: Use LogInfo over LogPrintf [WIP, NOMERGE, DRAFT] by maflcko)
  • #28584 (Fuzz: extend CConnman tests by vasild)
  • #28463 (p2p: Increase inbound capacity for block-relay only connections by mzumsande)

If you consider this pull request important, please also help to review the conflicting pull requests. Ideally, start with the one that should be merged first.

@DrahtBot DrahtBot added the P2P label Mar 7, 2025
@vasild
Copy link
Contributor Author

vasild commented Mar 7, 2025

cc @willcl-ark, @dergoegge, @theuni

@vasild
Copy link
Contributor Author

vasild commented Mar 7, 2025

cc @hodlinator, you might be interested in this as well

@fanquake
Copy link
Member

fanquake commented Mar 7, 2025

https://github.com/bitcoin/bitcoin/actions/runs/13718139296/job/38367326902?pr=32015#step:7:1603:

Run p2p_headers_presync with args ['/Users/runner/work/bitcoin/bitcoin/ci/scratch/build-aarch64-apple-darwin23.6.0/src/test/fuzz/fuzz', PosixPath('/Users/runner/work/bitcoin/bitcoin/ci/scratch/qa-assets/fuzz_corpora/p2p_headers_presync')]net_processing.cpp:295 SetTxRelay: Assertion `!m_tx_relay' failed.
Error processing input "/Users/runner/work/bitcoin/bitcoin/ci/scratch/qa-assets/fuzz_corpora/p2p_headers_presync/c12f87158cda99b4fb2e1fe7312af392a1f4634a"

net_processing.cpp:295 SetTxRelay: Assertion `!m_tx_relay' failed.
Error processing input "/Users/runner/work/bitcoin/bitcoin/ci/scratch/qa-assets/fuzz_corpora/p2p_headers_presync/c12f87158cda99b4fb2e1fe7312af392a1f4634a"

src/net.cpp Outdated
// after m_nodes_mutex has been released. Destructing a node involves
// calling m_msgproc->FinalizeNode() which acquires cs_main. Lock order
// should be cs_main, m_nodes_mutex.
disconnected_nodes.push_back(node);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that ~CNode is responsible for calling FinalizeNode, the guarantees around which thread does the actual cleanup are lost. It's basically just whichever thread ends up holding the last instance of a CNode. This seems bug prone.

E.g. there are no guarantees (afaict) that disconnected_nodes will actually be the last owner.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea is that it does not matter which thread will destroy the CNode objects.

there are no guarantees (afaict) that disconnected_nodes will actually be the last owner.

That is right, but there is no need for such a guarantee. disconnected_nodes is to guarantee that the objects will not be destroyed while holding m_nodes_mutex.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In thread #32015 (comment):

Maybe one could have an additional vector<shared_ptr<CNode>> member in CConnman beyond m_nodes that always keeps the shared_ptr::use_count() above 0. Then use removal from that vector (when use_count() == 1) for controlled deletion with enforced lock order, more like how the code was before?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not think that is necessary. I mean - if we would assume that the code is going to be modified in a careless way to break the logic, in other words to destroy CNode objects while holding m_nodes_mutex, then we might as well assume that the code could be modified in a way that breaks the manual reference counting. IMO the manual reference counting is more fragile and error prone.

Further, if the code is changed to destroy CNode objects while holding m_nodes_mutex, then 83 functional tests fail. I think it is safe to assume that such a thing will not go unnoticed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In thread #32015 (comment):

Started proving out my suggestion and landed on something closer to master.

  • Resurrect m_nodes_disconnected (now holding shared_ptr) along with clearly defined lifetimes.
  • Resurrect DeleteNode(), but make it take an r-value shared_ptr and assert that use_count() == 1.
  • Avoid adding ~CNode() and CNode::m_destruct_cb along with sending it in everywhere in the constructor. This means we don't need to touch 5 of the test files.

master...hodlinator:bitcoin:pr/32015_suggestion

Passes unit tests and functional tests.

@DrahtBot
Copy link
Contributor

DrahtBot commented Mar 7, 2025

🚧 At least one of the CI tasks failed.
Debug: https://github.com/bitcoin/bitcoin/runs/38367328595

Hints

Try to run the tests locally, according to the documentation. However, a CI failure may still
happen due to a number of reasons, for example:

  • Possibly due to a silent merge conflict (the changes in this pull request being
    incompatible with the current code in the target branch). If so, make sure to rebase on the latest
    commit of the target branch.

  • A sanitizer issue, which can only be found by compiling with the sanitizer and running the
    affected test.

  • An intermittent issue.

Leave a comment here, if you need help tracking down a confusing failure.

@vasild vasild force-pushed the cnode_shared_ptr branch from 2747f13 to e35a382 Compare March 7, 2025 13:22
@vasild
Copy link
Contributor Author

vasild commented Mar 7, 2025

2747f135be...e35a382388: fix the CI failure, thanks!

Copy link
Contributor

@hodlinator hodlinator left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Concept ACK 36d932c

src/net.cpp Outdated
// after m_nodes_mutex has been released. Destructing a node involves
// calling m_msgproc->FinalizeNode() which acquires cs_main. Lock order
// should be cs_main, m_nodes_mutex.
disconnected_nodes.push_back(node);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In thread #32015 (comment):

Maybe one could have an additional vector<shared_ptr<CNode>> member in CConnman beyond m_nodes that always keeps the shared_ptr::use_count() above 0. Then use removal from that vector (when use_count() == 1) for controlled deletion with enforced lock order, more like how the code was before?

@vasild
Copy link
Contributor Author

vasild commented Mar 10, 2025

36d932cebb...af622d00ba: address #32015 (comment)

@purpleKarrot
Copy link
Contributor

Instead of passing around shared pointers to mutable data (aka global variables in disguise), would it be possible to increase the level of abstraction and pass around values with regular semantics? Every C++ programmer should rewatch https://youtu.be/W2tWOdzgXHA from time to time.

@hodlinator
Copy link
Contributor

Every C++ programmer should rewatch https://youtu.be/W2tWOdzgXHA from time to time.

I remembered that as "the seminal std::rotate()-talk", but it also has "No Raw Pointers" from 48:45. Probably stoked my hatred of smart pointers which led to #31713. To be fair, in the talk he also says that we used to do intrusive ref-counting... so this is at least one step away from that.

Instead of passing around shared pointers to mutable data (aka global variables in disguise), would it be possible to increase the level of abstraction and pass around values with regular semantics?

Guess one could have an InternalCNode which is the actual meat, and make the public CNode instances hold shared_ptr<InternalCNode>. Not sure how well that plays out.
Maybe you @purpleKarrot could take a stab at something along the lines of what you pointed to? Keep in mind that we try to minimize the number of lines touched in a change unless necessary.

@purpleKarrot
Copy link
Contributor

After reviewing the code line by line, I realized that the only places where AddRef was ever called, was directly after construction (plus one place in the fuzz tests, not sure why that is needed). It looks like the CNode instances are not actually shared (good news). Could it be passed around by std::unique_ptr instead? This would greatly simplify future refactoring.

@dergoegge
Copy link
Member

realized that the only places where AddRef was ever called, was directly after construction

It is also called in the RAII helper NodesSnapshot which is used in different threads (i.e. "net" and "msghand"), so unique_ptr won't work unfortunately.

plus one place in the fuzz tests, not sure why that is needed

It's probably just there to have it "covered".

@purpleKarrot
Copy link
Contributor

unique_ptr won't work unfortunately.

Got it. m_nodes needs to be able to be snapshotted. But that seems to be the only place where copies are made. FindNode should return non-owning observers rather than extend the ownership.

@vasild vasild force-pushed the cnode_shared_ptr branch from 8b93e08 to bd6c57f Compare May 14, 2025 11:26
@vasild
Copy link
Contributor Author

vasild commented May 14, 2025

8b93e0894f...bd6c57f387: address suggestions

@vasild vasild force-pushed the cnode_shared_ptr branch from bd6c57f to 20657a7 Compare May 14, 2025 11:39
@vasild
Copy link
Contributor Author

vasild commented May 14, 2025

bd6c57f387...20657a7c8e: rebase

Edit: CI failure looks unrelated

@vasild
Copy link
Contributor Author

vasild commented May 21, 2025

20657a7c8e...cbd7bd7422: rebase due to conflicts

vasild and others added 7 commits May 21, 2025 15:10
This is a non-functional change - the manual reference counting,
disconnection, finalization and deletion are unchanged.

`DeleteNode()` used to call `FinalizeNode()` + `delete` the raw pointer.
Expand `DeleteNode()` at the call sites - call `FinalizeNode()` at the
call sites and instead of raw `delete` either get the container vector
go out of scope or explicitly call `clear()` on it.
…aries

The other copies have a purpose for locking, these were just an awkward
convenience for deleting while iterating.

Co-authored-by: Vasil Dimov <vd@FreeBSD.org>
Subsequent commits will allow this to be shared with the message
handling thread.

Co-authored-by: Vasil Dimov <vd@FreeBSD.org>
Rather than relying on refcounting to abstract away our teardown
procedure, make the rules explicit:
- FinalizeNode() is only called when ProcessMessages() and
  SendMessages() is not running. This is enforced by finalizing only at
  the top of the message handler loop.
- FinalizeNode() may only be called on nodes which have been
  disconnected, by virtue of finding them in m_nodes_disconnected. This
  enforces the order: disconnect -> finalize -> delete. This order is
  the current behavior but is not strictly necessary, and the
  restriction will be removed in a later commit.
- Nodes are only deleted after they've been disconnected and finalized.
  This is enforced by deleting them from m_nodes_disconnected and only
  after calling FinalizeNode().
- Callers of ForNode() and ForEachNode() are currently protected from
  finalization or deletion due to their locking of m_nodes_mutex.

Co-authored-by: Vasil Dimov <vd@FreeBSD.org>
Snapshots of `m_nodes` are not needed anymore because finalization and
deletion of nodes only happens in `ThreadMessageHandler()` and only a
after nodes have been disconnected and moved from `m_nodes` to
`m_nodes_disconnected`.

The manual reference counting becomes unused, remove it too.

Co-authored-by: Cory Fields <cory-nospam-@coryfields.com>
The shared_ptrs now ensure that the CNodes will not be deleted for the
duration of the callbacks.

Disconnection/Finalization may happen while the shared_ptr is being held,
though it won't be deleted for the duration of the callback.

The majority of uses are in ProcessMessages and SendMessages, during which
disconnection may occur but FinalizeNode will not be called.

Co-authored-by: Vasil Dimov <vd@FreeBSD.org>
This keeps us from waiting for the condvar timeout (currently 100ms) before
finalizing a node that has been disconnected on the socket handler thread.

Co-authored-by: Vasil Dimov <vd@FreeBSD.org>
@vasild vasild force-pushed the cnode_shared_ptr branch from cbd7bd7 to 67bc008 Compare May 21, 2025 13:15
@vasild
Copy link
Contributor Author

vasild commented May 21, 2025

cbd7bd7422...67bc008f62: fix CI

Copy link
Member

@pinheadmz pinheadmz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ACK 67bc008

Only rebase and trivial adjustments since last ack

Show Signature
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA256

ACK 67bc008f62f9eac1616558770708c1e8547b09f0
-----BEGIN PGP SIGNATURE-----

iQIzBAEBCAAdFiEE5hdzzW4BBA4vG9eM5+KYS2KJyToFAmgt3x8ACgkQ5+KYS2KJ
yTqEKA//W/MMHBH002NCZEngJedxatJrtDCBj/RkfA9ijfOQ+WmNjnd2QNLJa3ds
z6NlZ0zlNNh60XZHGEgdWtEk8TvPECe/E9nEFjn6pze9REcIvizn/ZzomcNWgQxy
KJAptOo63YUXaK8mSQD3rF805KLjxbhEoG/YNOXhm5gBtaTFciHrc/bJk4l5IBIM
fnkC/p2+SnLZksY0ezZIZWCaBnQKEKPsda4ziwzNRQ3ER5Ze/COicn8MAXXSBgV1
nsRnSc6Rl65EbT9sIkH8A4zCcu2NWXqcP9gKnY9NjO48nauPHg8S/RiTySIxsiKZ
G7smIZiKTnfI6GAbolFH9NGhUZCI/AJ/dMGZXsKGl8h8zNv83Nxz9jFFsSRNuHZq
edHaCE9t2v3GnGDEtKeXVXjlKsr0azh3P8r2uBv3iR8FKmZ9tT3mMIyGAiHaz9gi
uihplbgiqbC67eE5NkiCykuG62rnwLqGME8eJaloRQC1PLi9XMAPbdTqOV/7797i
UoxMemp2DpotIiKtcNAUz+5egav29v38NnmPUC7hZGJNKslcQx37McDEKTVLxvZu
hnKM1HE/BmPMiErEUjlCmx7PAhyoToaxYf1LKVhKNp/pMNaf9HeDhO5mkAM2/+cb
E3+4enfaBZbu2OJeA1Y76t4T3jwlBl5i4++J6xO7E/6HA84nZ/4=
=AZtd
-----END PGP SIGNATURE-----

pinheadmz's public key is on openpgp.org

@hodlinator
Copy link
Contributor

Due to my anti-shared_ptr arguments in #32015 (comment) and the main point part of my unique_ptr alternative #32015 (comment) (with some support #32015 (comment)) not being incorporated I have a hard time A-C-K:ing this as-is. Not planning to open a competing PR at this point though.

Curious what Cory comes up with (#32015 (comment)), and time allowing I plan to attempt an eradication of shared_ptr from his next solution if it makes use of it by the end.

Happy that this PR spawned off other improvements like #32326 and #32394.

@theuni
Copy link
Member

theuni commented May 22, 2025

I'm also not in favor of this, but now it's because I think this is addressing the wrong issue altogether.

Generally I think this is a nice change, but it's scary, and if we're going to be changing it I'd rather do it once and for all. As part of the CConnman refactor, CNode was supposed to become internal and not passed into PeerManager at all. In that case, the lifetime and sharing issues would be completely moot. But that didn't get done.. my fault.

So, I'm working on doing that now. It's a large refactor and it will take a while, but I think it's better to fix the root issue here rather than continuing to work around it.

@vasild
Copy link
Contributor Author

vasild commented May 24, 2025

CNode was supposed to become internal and not passed into PeerManager at all

Do you need any help with that?

@theuni
Copy link
Member

theuni commented Jun 16, 2025

@vasild Sure!

This may be a good one to tag-team. I've been distracted for the last two weeks, but coming back to this now.

I did some initial hacking on it and got pretty far along, but it's a substantial amount of work, both on the coding side as well as review. I'm starting to wonder if it might be a good collaborative/working group project.

I'll ping you offline soon to share my thoughts/progress and see what you think.

@DrahtBot
Copy link
Contributor

🐙 This pull request conflicts with the target branch and needs rebase.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants