-
Notifications
You must be signed in to change notification settings - Fork 37.7k
net: replace manual reference counting of CNode with shared_ptr #32015
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
The following sections might be updated with supplementary metadata relevant to reviewers and maintainers. Code Coverage & BenchmarksFor details see: https://corecheck.dev/bitcoin/bitcoin/pulls/32015. ReviewsSee the guideline for information on the review process.
If your review is incorrectly listed, please react with 👎 to this comment and the bot will ignore it on the next update. ConflictsReviewers, this pull request conflicts with the following ones:
If you consider this pull request important, please also help to review the conflicting pull requests. Ideally, start with the one that should be merged first. |
cc @willcl-ark, @dergoegge, @theuni |
cc @hodlinator, you might be interested in this as well |
https://github.com/bitcoin/bitcoin/actions/runs/13718139296/job/38367326902?pr=32015#step:7:1603: Run p2p_headers_presync with args ['/Users/runner/work/bitcoin/bitcoin/ci/scratch/build-aarch64-apple-darwin23.6.0/src/test/fuzz/fuzz', PosixPath('/Users/runner/work/bitcoin/bitcoin/ci/scratch/qa-assets/fuzz_corpora/p2p_headers_presync')]net_processing.cpp:295 SetTxRelay: Assertion `!m_tx_relay' failed.
Error processing input "/Users/runner/work/bitcoin/bitcoin/ci/scratch/qa-assets/fuzz_corpora/p2p_headers_presync/c12f87158cda99b4fb2e1fe7312af392a1f4634a"
net_processing.cpp:295 SetTxRelay: Assertion `!m_tx_relay' failed.
Error processing input "/Users/runner/work/bitcoin/bitcoin/ci/scratch/qa-assets/fuzz_corpora/p2p_headers_presync/c12f87158cda99b4fb2e1fe7312af392a1f4634a" |
src/net.cpp
Outdated
// after m_nodes_mutex has been released. Destructing a node involves | ||
// calling m_msgproc->FinalizeNode() which acquires cs_main. Lock order | ||
// should be cs_main, m_nodes_mutex. | ||
disconnected_nodes.push_back(node); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now that ~CNode
is responsible for calling FinalizeNode
, the guarantees around which thread does the actual cleanup are lost. It's basically just whichever thread ends up holding the last instance of a CNode
. This seems bug prone.
E.g. there are no guarantees (afaict) that disconnected_nodes
will actually be the last owner.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The idea is that it does not matter which thread will destroy the CNode
objects.
there are no guarantees (afaict) that
disconnected_nodes
will actually be the last owner.
That is right, but there is no need for such a guarantee. disconnected_nodes
is to guarantee that the objects will not be destroyed while holding m_nodes_mutex
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In thread #32015 (comment):
Maybe one could have an additional vector<shared_ptr<CNode>>
member in CConnman
beyond m_nodes
that always keeps the shared_ptr::use_count()
above 0. Then use removal from that vector (when use_count() == 1
) for controlled deletion with enforced lock order, more like how the code was before?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not think that is necessary. I mean - if we would assume that the code is going to be modified in a careless way to break the logic, in other words to destroy CNode
objects while holding m_nodes_mutex
, then we might as well assume that the code could be modified in a way that breaks the manual reference counting. IMO the manual reference counting is more fragile and error prone.
Further, if the code is changed to destroy CNode
objects while holding m_nodes_mutex
, then 83 functional tests fail. I think it is safe to assume that such a thing will not go unnoticed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In thread #32015 (comment):
Started proving out my suggestion and landed on something closer to master.
- Resurrect
m_nodes_disconnected
(now holdingshared_ptr
) along with clearly defined lifetimes. - Resurrect
DeleteNode()
, but make it take an r-valueshared_ptr
and assert thatuse_count() == 1
. - Avoid adding
~CNode()
andCNode::m_destruct_cb
along with sending it in everywhere in the constructor. This means we don't need to touch 5 of the test files.
master...hodlinator:bitcoin:pr/32015_suggestion
Passes unit tests and functional tests.
🚧 At least one of the CI tasks failed. HintsTry to run the tests locally, according to the documentation. However, a CI failure may still
Leave a comment here, if you need help tracking down a confusing failure. |
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Concept ACK 36d932c
src/net.cpp
Outdated
// after m_nodes_mutex has been released. Destructing a node involves | ||
// calling m_msgproc->FinalizeNode() which acquires cs_main. Lock order | ||
// should be cs_main, m_nodes_mutex. | ||
disconnected_nodes.push_back(node); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In thread #32015 (comment):
Maybe one could have an additional vector<shared_ptr<CNode>>
member in CConnman
beyond m_nodes
that always keeps the shared_ptr::use_count()
above 0. Then use removal from that vector (when use_count() == 1
) for controlled deletion with enforced lock order, more like how the code was before?
36d932c
to
af622d0
Compare
|
Instead of passing around shared pointers to mutable data (aka global variables in disguise), would it be possible to increase the level of abstraction and pass around values with regular semantics? Every C++ programmer should rewatch https://youtu.be/W2tWOdzgXHA from time to time. |
I remembered that as "the seminal
Guess one could have an |
After reviewing the code line by line, I realized that the only places where |
It is also called in the RAII helper
It's probably just there to have it "covered". |
Got it. |
|
Edit: CI failure looks unrelated |
|
This is a non-functional change - the manual reference counting, disconnection, finalization and deletion are unchanged. `DeleteNode()` used to call `FinalizeNode()` + `delete` the raw pointer. Expand `DeleteNode()` at the call sites - call `FinalizeNode()` at the call sites and instead of raw `delete` either get the container vector go out of scope or explicitly call `clear()` on it.
…aries The other copies have a purpose for locking, these were just an awkward convenience for deleting while iterating. Co-authored-by: Vasil Dimov <vd@FreeBSD.org>
Subsequent commits will allow this to be shared with the message handling thread. Co-authored-by: Vasil Dimov <vd@FreeBSD.org>
Rather than relying on refcounting to abstract away our teardown procedure, make the rules explicit: - FinalizeNode() is only called when ProcessMessages() and SendMessages() is not running. This is enforced by finalizing only at the top of the message handler loop. - FinalizeNode() may only be called on nodes which have been disconnected, by virtue of finding them in m_nodes_disconnected. This enforces the order: disconnect -> finalize -> delete. This order is the current behavior but is not strictly necessary, and the restriction will be removed in a later commit. - Nodes are only deleted after they've been disconnected and finalized. This is enforced by deleting them from m_nodes_disconnected and only after calling FinalizeNode(). - Callers of ForNode() and ForEachNode() are currently protected from finalization or deletion due to their locking of m_nodes_mutex. Co-authored-by: Vasil Dimov <vd@FreeBSD.org>
Snapshots of `m_nodes` are not needed anymore because finalization and deletion of nodes only happens in `ThreadMessageHandler()` and only a after nodes have been disconnected and moved from `m_nodes` to `m_nodes_disconnected`. The manual reference counting becomes unused, remove it too. Co-authored-by: Cory Fields <cory-nospam-@coryfields.com>
The shared_ptrs now ensure that the CNodes will not be deleted for the duration of the callbacks. Disconnection/Finalization may happen while the shared_ptr is being held, though it won't be deleted for the duration of the callback. The majority of uses are in ProcessMessages and SendMessages, during which disconnection may occur but FinalizeNode will not be called. Co-authored-by: Vasil Dimov <vd@FreeBSD.org>
This keeps us from waiting for the condvar timeout (currently 100ms) before finalizing a node that has been disconnected on the socket handler thread. Co-authored-by: Vasil Dimov <vd@FreeBSD.org>
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ACK 67bc008
Only rebase and trivial adjustments since last ack
Show Signature
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA256
ACK 67bc008f62f9eac1616558770708c1e8547b09f0
-----BEGIN PGP SIGNATURE-----
iQIzBAEBCAAdFiEE5hdzzW4BBA4vG9eM5+KYS2KJyToFAmgt3x8ACgkQ5+KYS2KJ
yTqEKA//W/MMHBH002NCZEngJedxatJrtDCBj/RkfA9ijfOQ+WmNjnd2QNLJa3ds
z6NlZ0zlNNh60XZHGEgdWtEk8TvPECe/E9nEFjn6pze9REcIvizn/ZzomcNWgQxy
KJAptOo63YUXaK8mSQD3rF805KLjxbhEoG/YNOXhm5gBtaTFciHrc/bJk4l5IBIM
fnkC/p2+SnLZksY0ezZIZWCaBnQKEKPsda4ziwzNRQ3ER5Ze/COicn8MAXXSBgV1
nsRnSc6Rl65EbT9sIkH8A4zCcu2NWXqcP9gKnY9NjO48nauPHg8S/RiTySIxsiKZ
G7smIZiKTnfI6GAbolFH9NGhUZCI/AJ/dMGZXsKGl8h8zNv83Nxz9jFFsSRNuHZq
edHaCE9t2v3GnGDEtKeXVXjlKsr0azh3P8r2uBv3iR8FKmZ9tT3mMIyGAiHaz9gi
uihplbgiqbC67eE5NkiCykuG62rnwLqGME8eJaloRQC1PLi9XMAPbdTqOV/7797i
UoxMemp2DpotIiKtcNAUz+5egav29v38NnmPUC7hZGJNKslcQx37McDEKTVLxvZu
hnKM1HE/BmPMiErEUjlCmx7PAhyoToaxYf1LKVhKNp/pMNaf9HeDhO5mkAM2/+cb
E3+4enfaBZbu2OJeA1Y76t4T3jwlBl5i4++J6xO7E/6HA84nZ/4=
=AZtd
-----END PGP SIGNATURE-----
pinheadmz's public key is on openpgp.org
Due to my anti- Curious what Cory comes up with (#32015 (comment)), and time allowing I plan to attempt an eradication of Happy that this PR spawned off other improvements like #32326 and #32394. |
I'm also not in favor of this, but now it's because I think this is addressing the wrong issue altogether. Generally I think this is a nice change, but it's scary, and if we're going to be changing it I'd rather do it once and for all. As part of the So, I'm working on doing that now. It's a large refactor and it will take a while, but I think it's better to fix the root issue here rather than continuing to work around it. |
Do you need any help with that? |
@vasild Sure! This may be a good one to tag-team. I've been distracted for the last two weeks, but coming back to this now. I did some initial hacking on it and got pretty far along, but it's a substantial amount of work, both on the coding side as well as review. I'm starting to wonder if it might be a good collaborative/working group project. I'll ping you offline soon to share my thoughts/progress and see what you think. |
🐙 This pull request conflicts with the target branch and needs rebase. |
Before this change the code used to count references to
CNode
objects manually viaCNode::nRefCount
. UnneededCNode
s were scheduled for deletion by putting them inCConnman::m_nodes_disconnected
and were deleted after their reference count reached zero. Deleting consists of callingPeerManager::FinalizeNode()
and destroying theCNode
object.Replace this scheme with
std::shared_ptr
. This simplifies the code and removes:CNode::nRefCount
CNode::GetRefCount()
CNode::AddRef()
CNode::Release()
CConnman::m_nodes_disconnected
CConnman::NodesSnapshot
Now creating a snapshot of
CConnman::m_nodes
is done by simply copying it (under the mutex).Call
PeerManager::FinalizeNode()
from the destructor ofCNode
, which is called when the reference count reaches 0.This removes brittle code and replaces it by a code from the standard library which is more robust. All the below are possible with the manual reference counting and not possible with
shared_ptr
:There is zero learning curve for the average C++ programmer who knows how
shared_ptr
works. Not so much with the manual reference counting because one has to learn aboutAddRef()
,Release()
, check and maintain where are they called and their relationship withm_nodes_disconnected
.This has some history in #28222 and #10738. See below #32015 (comment) and #32015 (comment).
Possible followup: change
ProcessMessages()
andSendMessages()
to takeCNode&
: #32015 (comment)