refactor: Guard TxRequestTracker by its own lock instead of cs_main #26151

dergoegge · 2022-09-21T20:15:15Z

I don't see the need to have the TxRequestTracker guarded by cs_main which would also be more in line with our developer docs.

From developer-notes.md:

Re-architecting the core code so there are better-defined interfaces between
the various components is a goal, with any necessary locking done by the
components (e.g. see the self-contained FillableSigningProvider class and its
cs_KeyStore lock for example).

This PR gives TxRequestTracker its own mutex, thereby removing the need to guard PeerManagerImpl::m_txrequest using cs_main.

DrahtBot · 2022-09-21T23:31:16Z

The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.

Conflicts

Reviewers, this pull request conflicts with the following ones:

#26551 (net_processing: Track orphans by who provided them by ajtowns)
#26295 (Replace global g_cs_orphans lock with local by ajtowns)
#25880 (p2p: Make stalling timeout adaptive during IBD by mzumsande)

If you consider this pull request important, please also help to review the conflicting pull requests. Ideally, start with the one that should be merged first.

jnewbery · 2022-09-22T08:12:30Z

Concept ACK. Originally suggested in #19988 (comment)

fanquake

concept ack - restarted the *SAN job as that was an unrelated apt failure.

src/txrequest.cpp

ajtowns

If there's to be a separate lock, I'm not convinced it makes sense to push the lock all the way into the Impl class -- having m_impl GUARDED_BY(m_impl_mutex) in TxRequestTracker, or having m_txrequest GUARDED_BY(m_txrequest_mutex) seem more sensible to me, and the latter allows you to take the lock outside of the various for loops, rather than repeatedly taking it and dropping it on each iteration.

I think the arguments from the original PR still apply though -- this doesn't meaningfully allow any more parallelism / reduce blocking as far as I can see.

src/net_processing.cpp

src/txrequest.cpp

dergoegge · 2022-09-22T15:01:58Z

Thanks @jnewbery for linking the previous discussion! I was looking through the old PRs yesterday and didn't come across that comment.

If there's to be a separate lock, I'm not convinced it makes sense to push the lock all the way into the Impl class -- having m_impl GUARDED_BY(m_impl_mutex) in TxRequestTracker, or having m_txrequest GUARDED_BY(m_txrequest_mutex) seem more sensible to me, and the latter allows you to take the lock outside of the various for loops, rather than repeatedly taking it and dropping it on each iteration.

@ajtowns I'm not sure if I am on the same page about this but I am willing to be convinced. I used AddrMan (and the dev notes) as a reference for choosing the current design (i.e. putting the lock inside the Impl class).

Would you be in favor of refactoring our existing modules to externalize their locks? (e.g. AddrManImpl::cs)
Should our developer docs be updated? (I think our lock conventions are a bit all over the place, similar to general coding style, so maybe it helps to clarify our preferences?)

I think the arguments from the original PR still apply though -- this doesn't meaningfully allow any more parallelism / reduce blocking as far as I can see.

I have been looking at reducing cs_main usage within net processing because I don't think that our networking code should be locking cs_main (our main validation lock) as much as it currently does. Reducing the scope of cs_main to validation seems like a good direction to be heading in (especially w.r.t. the kernel project). This PR seemed like an easy small step in that direction (similar to #26140) as it does remove two LOCK(cs_main) call sites.

ajtowns · 2022-09-23T04:34:19Z

@ajtowns I'm not sure if I am on the same page about this but I am willing to be convinced. I used AddrMan (and the dev notes) as a reference for choosing the current design (i.e. putting the lock inside the Impl class).
* Would you be in favor of refactoring our existing modules to externalize their locks? (e.g. `AddrManImpl::cs`)

In general, I think it's better to leave things alone if we're not making things substantially more reliable/simple/efficient.

For locking, I think the best approach on all of those axes is "having a reference to an object means you can manipulate it; if you can't manipulate it, you don't have a reference to in the first place"; but that's often hard to achieve.

Maybe compare with what @vasild's proposing in #25390 -- with that you'd just say Synced<TxRequestTracker> m_txrequest and write either m_txrequest->Foo(); to do a single operation that takes then releases the lock, or { auto proxy = *m_txrequest; proxy->Foo(); proxy->Bar(); } to do multiple operations while the lock's held. Meanwhile, if you're not doing threading (like in the unit/fuzz tests) you just allocate a TxRequestTracker directly, and don't worry about locks at all. (Unfortunately, I don't think the implementation there quite works right/clang isn't clever enough to properly understand it, and I haven't been able to come up with a better one. Err, except, maybe...)

I have been looking at reducing cs_main usage within net processing because I don't think that our networking code should be locking cs_main (our main validation lock) as much as it currently does. Reducing the scope of cs_main to validation seems like a good direction to be heading in (especially w.r.t. the kernel project). This PR seemed like an easy small step in that direction (similar to #26140) as it does remove two LOCK(cs_main) call sites.

Yeah... Kind of feel like it's probably better spending coding/review time on the hard steps though? For cs_main, having dedicated, non-global, locks for blockstorage, block indexes, and coin states, so that you don't need cs_main there, seems like the priority. For net, getting rid of CNodeState entirely and passing around Peer& objects seems worthwhile, and I think maybe adding an opaque PeerRef to CNode would allow avoiding the extra map and ensure the Peer object doesn't get deallocated before the CNode object does... We could also perhaps have more things under "only accessed by the message processing thread" by having that thread make read-only copies of the data available to other threads in advance, which would then reduce contention...

hebasto · 2022-09-23T10:27:55Z

Concept ACK.

hebasto · 2022-09-23T10:43:05Z

Maybe compare with what @vasild's proposing in #25390 -- with that you'd just say Synced<TxRequestTracker> m_txrequest and write either m_txrequest->Foo(); to do a single operation that takes then releases the lock, or { auto proxy = *m_txrequest; proxy->Foo(); proxy->Bar(); } to do multiple operations while the lock's held.

It makes me think that such an issue should be resolved at TxRequestTracker class's API level. An object which self maintains its state in multi-threading environment is preferable (less error prone, easy to reason about etc).

vasild

ACK 2ac77c4

This PR adds an internal lock and acquires it inside the methods that touch the relevant variables. This would allow finer grained control - e.g. to lock the mutex only for some part of a method, to improve concurrency. In this PR, however, we don't do that - we lock the mutex for the entire duration of the methods, which is the same as having an external mutex outside, like @ajtowns mentioned above:

... or having m_txrequest GUARDED_BY(m_txrequest_mutex) ... allows you to take the lock outside of the various for loops, rather than repeatedly taking it and dropping it on each iteration.

Which way is better would depend on how this is used by multiple threads. Because it is not, both approaches look equally good (or equally bad) now.

Thanks, @ajtowns, for mentioning #25390. This PR can be achieved by just:

-    TxRequestTracker m_txrequest;
+    Synced<TxRequestTracker> m_txrequest;

and mechanically replacing . with -> (we don't need to hold the lock across multiple method calls). That is virtually one line of change. See this comment: #25390 (comment) looks like this PR is doing the "Lots of repetitions" case.

src/txrequest.cpp

maflcko · 2022-10-04T11:10:48Z

Could also move it out of the cs_main scope in FinalizeNode?

m_txrequest is now guarded by TxRequestTracker's internal mutex.

dergoegge · 2022-10-18T16:50:50Z

Rebased, added a method to remove multiple transactions from the tracker at once (to avoid locking the internal lock over and over again), and renamed m_txrequest_mutex -> m_mutex.

…once

vasild

~~ACK 1d72d42~~ See below.

vasild · 2022-10-21T12:12:35Z

src/net_processing.cpp

-        }
-    }
+
+    m_txrequest.ForgetTxs(Span{pblock->vtx});


nit:

Suggested change

m_txrequest.ForgetTxs(Span{pblock->vtx});

m_txrequest.ForgetTxs(pblock->vtx);

vasild · 2022-10-21T12:29:53Z

Could also move it out of the cs_main scope in FinalizeNode?

I am not sure about it. In PeerManagerImpl::FinalizeNode():

LOCK(cs_main);
...
m_txrequest.DisconnectedPeer(nodeid);
...
assert(m_txrequest.Size() == 0);

hmm, wait! is that a bug in this PR? If somebody modifies (adds to) m_txrequest after DisconnectedPeer() and before the assert() then the assert() will be triggered. And the point of this PR is to be able to access (add to) m_txrequest without cs_main from anywhere 💣 🔥

Maybe return the size from DisconnectedPeer() after the deletion and later assert that the returned size was 0 in FinalizeNode()?

dergoegge · 2022-10-21T12:59:15Z

hmm, wait! is that a bug in this PR? If somebody modifies (adds to) m_txrequest after DisconnectedPeer() and before the assert() then the assert() will be triggered. And the point of this PR is to be able to access (add to) m_txrequest without cs_main from anywhere

I don't think it is currently but it almost is! The assert(m_txrequest.Size() == 0); is gated by if (m_node_state.empty()) so no other peer can modify (add txs) m_txrequest.

I will still change this as that seems a little brittle wrt future changes.

vasild · 2022-10-22T07:55:07Z

How? It seems that the assumption in FinalizeNode() is that m_node_state is modified together/atomically with m_txrequest.

Yes, the current code is fine, but then it is fine even without this PR. The aim is to make it future proof.

Returning the size from DisconnectedPeer() like I suggested above seems to have its own (theoretical) flaw - it could return 1, afterwards m_node_state could become empty by another thread, we would enter the if and the assert would fail because 1 != 0.

Maybe just delete the assert? Or expose the mutex of m_txrequest, lock it before DisconnectedPeer() and unlock it after the assert ❓

maflcko · 2022-10-24T11:43:30Z

src/net_processing.cpp

@@ -1395,7 +1395,8 @@ void PeerManagerImpl::PushNodeVersion(CNode& pnode, const Peer& peer)

 void PeerManagerImpl::AddTxAnnouncement(const CNode& node, const GenTxid& gtxid, std::chrono::microseconds current_time)
 {
-    AssertLockHeld(::cs_main); // For m_txrequest
+    AssertLockHeld(::cs_main);


Not sure if this is correct. While m_txrequest has an internal mutex to guard against UB, the internal mutex does nothing to guard the processing logic. A mutex different from the internal one is still needed here to guard against several threads calling into this function at the same time.

Hm I think you're right. Also seems similar to what Vasil mentioned, when we call methods on m_txrequest but expect the internal state between those calls not to change, then the internal mutex won't help us. afaict the changes here don't break anything because cs_main is still held in these places but I am trying to seperate m_txrequest from cs_main so cs_main should not be needed for any of this after the PR. Will mark as draft until I figure out how to address this. (Seems like the only way to address this is to change the TxRequestTracker interface)

Yeah, an alternative would be to replace cs_main with g_msgproc_mutex?

Or expose the mutex outside of TxRequestTracker and lock for longer duration in the net processing code. I.e. have TxRequestTracker m_txrequest GUARDED_BY(m_its_own_mutex); as suggested by @ajtowns in #26151 (review)

DrahtBot · 2022-11-28T13:14:50Z

🐙 This pull request conflicts with the target branch and needs rebase.

DrahtBot · 2023-05-19T00:50:29Z

There hasn't been much activity lately and the patch still needs rebase. What is the status here?

Is it still relevant? ➡️ Please solve the conflicts to make it ready for review and to ensure the CI passes.
Is it no longer relevant? ➡️ Please close.
Did the author lose interest or time to work on this? ➡️ Please close it and mark it 'Up for grabs' with the label, so that it can be picked up in the future.

dergoegge · 2023-05-30T14:56:26Z

Closing this for now, can be marked up for grabs.

#26151 (comment) needs to be addressed for this to move forward. Imo, the interface of TxRequestTracker should change to internally enforce the MAX_PEER_TX_ANNOUNCEMENTS and MAX_PEER_TX_REQUEST_IN_FLIGHT limits.

fanquake added the P2P label Sep 21, 2022

fanquake requested review from ajtowns and vasild September 21, 2022 20:16

fanquake reviewed Sep 22, 2022

View reviewed changes

src/txrequest.cpp Outdated Show resolved Hide resolved

dergoegge force-pushed the 2022-09-txrequest-cs_main-split branch from 7f9a893 to f2f498a Compare September 22, 2022 10:24

ajtowns reviewed Sep 22, 2022

View reviewed changes

src/net_processing.cpp Show resolved Hide resolved

src/net_processing.cpp Show resolved Hide resolved

src/txrequest.cpp Outdated Show resolved Hide resolved

src/txrequest.cpp Outdated Show resolved Hide resolved

dergoegge force-pushed the 2022-09-txrequest-cs_main-split branch from f2f498a to 2ac77c4 Compare September 22, 2022 15:32

DrahtBot mentioned this pull request Sep 22, 2022

p2p: Erlay support signaling #23443

Merged

vasild approved these changes Sep 28, 2022

View reviewed changes

src/txrequest.cpp Outdated Show resolved Hide resolved

Riahiamirreza reviewed Sep 29, 2022

View reviewed changes

src/txrequest.cpp Show resolved Hide resolved

This was referenced Oct 4, 2022

refactor: Make m_mempool optional in PeerManager #26247

Closed

p2p: Fill reconciliation sets and request reconciliation (Erlay) #26283

Closed

Replace global g_cs_orphans lock with local #26295

Merged

vasild mentioned this pull request Oct 12, 2022

sync: introduce a thread-safe generic container and use it to remove a bunch of "GlobalMutex"es #25390

Closed

DrahtBot added the Needs rebase label Oct 17, 2022

dergoegge added 2 commits October 18, 2022 17:42

[net processing] Guard TxRequestTracker internally using its own mutex

a51d282

[net processing] Reduce cs_main scope for m_txrequest

5a031f2

m_txrequest is now guarded by TxRequestTracker's internal mutex.

dergoegge force-pushed the 2022-09-txrequest-cs_main-split branch from 2ac77c4 to 4ec6b41 Compare October 18, 2022 16:47

[txrequest] Add method for removing multiple transaction requests at …

1d72d42

…once

dergoegge force-pushed the 2022-09-txrequest-cs_main-split branch from 4ec6b41 to 1d72d42 Compare October 18, 2022 17:22

DrahtBot removed the Needs rebase label Oct 18, 2022

DrahtBot mentioned this pull request Oct 19, 2022

p2p: Make stalling timeout adaptive during IBD #25880

Merged

vasild reviewed Oct 21, 2022

View reviewed changes

maflcko reviewed Oct 24, 2022

View reviewed changes

dergoegge marked this pull request as draft October 24, 2022 13:41

DrahtBot mentioned this pull request Nov 22, 2022

p2p: Track orphans by who provided them #26551

Merged

DrahtBot added the Needs rebase label Nov 28, 2022

dergoegge mentioned this pull request Apr 20, 2023

meta: Isolated fuzzing of net processing #27502

Open

11 tasks

dergoegge closed this May 30, 2023

fanquake added the Up for grabs label May 30, 2023

glozow mentioned this pull request Jul 13, 2023

Package Relay 1/3: Introduce TxDownloadManager and improve orphan-handling #28031

Closed

bitcoin locked and limited conversation to collaborators May 29, 2024

	m_txrequest.ForgetTxs(Span{pblock->vtx});
	m_txrequest.ForgetTxs(pblock->vtx);

refactor: Guard TxRequestTracker by its own lock instead of cs_main #26151

refactor: Guard TxRequestTracker by its own lock instead of cs_main #26151

Uh oh!

Conversation

dergoegge commented Sep 21, 2022

Uh oh!

DrahtBot commented Sep 21, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Conflicts

Uh oh!

jnewbery commented Sep 22, 2022

Uh oh!

fanquake left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ajtowns left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

dergoegge commented Sep 22, 2022

Uh oh!

ajtowns commented Sep 23, 2022

Uh oh!

hebasto commented Sep 23, 2022

Uh oh!

hebasto commented Sep 23, 2022

Uh oh!

vasild left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

maflcko commented Oct 4, 2022

Uh oh!

dergoegge commented Oct 18, 2022

Uh oh!

vasild left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vasild Oct 21, 2022

Choose a reason for hiding this comment

Uh oh!

vasild commented Oct 21, 2022

Uh oh!

dergoegge commented Oct 21, 2022

Uh oh!

vasild commented Oct 22, 2022

Uh oh!

maflcko Oct 24, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dergoegge Oct 24, 2022

Choose a reason for hiding this comment

Uh oh!

maflcko Oct 24, 2022

Choose a reason for hiding this comment

Uh oh!

vasild Oct 25, 2022

Choose a reason for hiding this comment

Uh oh!

DrahtBot commented Nov 28, 2022

Uh oh!

DrahtBot commented May 19, 2023

Uh oh!

dergoegge commented May 30, 2023

Uh oh!

Uh oh!

DrahtBot commented Sep 21, 2022 •

edited

Loading

vasild left a comment •

edited

Loading

maflcko Oct 24, 2022 •

edited

Loading