Skip to content

Conversation

dergoegge
Copy link
Member

I don't see the need to have the TxRequestTracker guarded by cs_main which would also be more in line with our developer docs.

From developer-notes.md:

Re-architecting the core code so there are better-defined interfaces between
the various components is a goal, with any necessary locking done by the
components (e.g. see the self-contained FillableSigningProvider class and its
cs_KeyStore lock for example).

This PR gives TxRequestTracker its own mutex, thereby removing the need to guard PeerManagerImpl::m_txrequest using cs_main.

@fanquake fanquake added the P2P label Sep 21, 2022
@fanquake fanquake requested review from ajtowns and vasild September 21, 2022 20:16
@DrahtBot
Copy link
Contributor

DrahtBot commented Sep 21, 2022

The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.

Conflicts

Reviewers, this pull request conflicts with the following ones:

  • #26551 (net_processing: Track orphans by who provided them by ajtowns)
  • #26295 (Replace global g_cs_orphans lock with local by ajtowns)
  • #25880 (p2p: Make stalling timeout adaptive during IBD by mzumsande)

If you consider this pull request important, please also help to review the conflicting pull requests. Ideally, start with the one that should be merged first.

@jnewbery
Copy link
Contributor

Concept ACK. Originally suggested in #19988 (comment)

Copy link
Member

@fanquake fanquake left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

concept ack - restarted the *SAN job as that was an unrelated apt failure.

@dergoegge dergoegge force-pushed the 2022-09-txrequest-cs_main-split branch from 7f9a893 to f2f498a Compare September 22, 2022 10:24
Copy link
Contributor

@ajtowns ajtowns left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there's to be a separate lock, I'm not convinced it makes sense to push the lock all the way into the Impl class -- having m_impl GUARDED_BY(m_impl_mutex) in TxRequestTracker, or having m_txrequest GUARDED_BY(m_txrequest_mutex) seem more sensible to me, and the latter allows you to take the lock outside of the various for loops, rather than repeatedly taking it and dropping it on each iteration.

I think the arguments from the original PR still apply though -- this doesn't meaningfully allow any more parallelism / reduce blocking as far as I can see.

@dergoegge
Copy link
Member Author

Thanks @jnewbery for linking the previous discussion! I was looking through the old PRs yesterday and didn't come across that comment.

If there's to be a separate lock, I'm not convinced it makes sense to push the lock all the way into the Impl class -- having m_impl GUARDED_BY(m_impl_mutex) in TxRequestTracker, or having m_txrequest GUARDED_BY(m_txrequest_mutex) seem more sensible to me, and the latter allows you to take the lock outside of the various for loops, rather than repeatedly taking it and dropping it on each iteration.

@ajtowns I'm not sure if I am on the same page about this but I am willing to be convinced. I used AddrMan (and the dev notes) as a reference for choosing the current design (i.e. putting the lock inside the Impl class).

  • Would you be in favor of refactoring our existing modules to externalize their locks? (e.g. AddrManImpl::cs)
  • Should our developer docs be updated? (I think our lock conventions are a bit all over the place, similar to general coding style, so maybe it helps to clarify our preferences?)

I think the arguments from the original PR still apply though -- this doesn't meaningfully allow any more parallelism / reduce blocking as far as I can see.

I have been looking at reducing cs_main usage within net processing because I don't think that our networking code should be locking cs_main (our main validation lock) as much as it currently does. Reducing the scope of cs_main to validation seems like a good direction to be heading in (especially w.r.t. the kernel project). This PR seemed like an easy small step in that direction (similar to #26140) as it does remove two LOCK(cs_main) call sites.

@dergoegge dergoegge force-pushed the 2022-09-txrequest-cs_main-split branch from f2f498a to 2ac77c4 Compare September 22, 2022 15:32
@ajtowns
Copy link
Contributor

ajtowns commented Sep 23, 2022

@ajtowns I'm not sure if I am on the same page about this but I am willing to be convinced. I used AddrMan (and the dev notes) as a reference for choosing the current design (i.e. putting the lock inside the Impl class).

* Would you be in favor of refactoring our existing modules to externalize their locks? (e.g. `AddrManImpl::cs`)

In general, I think it's better to leave things alone if we're not making things substantially more reliable/simple/efficient.

For locking, I think the best approach on all of those axes is "having a reference to an object means you can manipulate it; if you can't manipulate it, you don't have a reference to in the first place"; but that's often hard to achieve.

Maybe compare with what @vasild's proposing in #25390 -- with that you'd just say Synced<TxRequestTracker> m_txrequest and write either m_txrequest->Foo(); to do a single operation that takes then releases the lock, or { auto proxy = *m_txrequest; proxy->Foo(); proxy->Bar(); } to do multiple operations while the lock's held. Meanwhile, if you're not doing threading (like in the unit/fuzz tests) you just allocate a TxRequestTracker directly, and don't worry about locks at all. (Unfortunately, I don't think the implementation there quite works right/clang isn't clever enough to properly understand it, and I haven't been able to come up with a better one. Err, except, maybe...)

I have been looking at reducing cs_main usage within net processing because I don't think that our networking code should be locking cs_main (our main validation lock) as much as it currently does. Reducing the scope of cs_main to validation seems like a good direction to be heading in (especially w.r.t. the kernel project). This PR seemed like an easy small step in that direction (similar to #26140) as it does remove two LOCK(cs_main) call sites.

Yeah... Kind of feel like it's probably better spending coding/review time on the hard steps though? For cs_main, having dedicated, non-global, locks for blockstorage, block indexes, and coin states, so that you don't need cs_main there, seems like the priority. For net, getting rid of CNodeState entirely and passing around Peer& objects seems worthwhile, and I think maybe adding an opaque PeerRef to CNode would allow avoiding the extra map and ensure the Peer object doesn't get deallocated before the CNode object does... We could also perhaps have more things under "only accessed by the message processing thread" by having that thread make read-only copies of the data available to other threads in advance, which would then reduce contention...

@hebasto
Copy link
Member

hebasto commented Sep 23, 2022

Concept ACK.

@hebasto
Copy link
Member

hebasto commented Sep 23, 2022

Maybe compare with what @vasild's proposing in #25390 -- with that you'd just say Synced<TxRequestTracker> m_txrequest and write either m_txrequest->Foo(); to do a single operation that takes then releases the lock, or { auto proxy = *m_txrequest; proxy->Foo(); proxy->Bar(); } to do multiple operations while the lock's held.

It makes me think that such an issue should be resolved at TxRequestTracker class's API level. An object which self maintains its state in multi-threading environment is preferable (less error prone, easy to reason about etc).

Copy link
Contributor

@vasild vasild left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ACK 2ac77c4

This PR adds an internal lock and acquires it inside the methods that touch the relevant variables. This would allow finer grained control - e.g. to lock the mutex only for some part of a method, to improve concurrency. In this PR, however, we don't do that - we lock the mutex for the entire duration of the methods, which is the same as having an external mutex outside, like @ajtowns mentioned above:

... or having m_txrequest GUARDED_BY(m_txrequest_mutex) ... allows you to take the lock outside of the various for loops, rather than repeatedly taking it and dropping it on each iteration.

Which way is better would depend on how this is used by multiple threads. Because it is not, both approaches look equally good (or equally bad) now.

Thanks, @ajtowns, for mentioning #25390. This PR can be achieved by just:

-    TxRequestTracker m_txrequest;
+    Synced<TxRequestTracker> m_txrequest;

and mechanically replacing . with -> (we don't need to hold the lock across multiple method calls). That is virtually one line of change. See this comment: #25390 (comment) looks like this PR is doing the "Lots of repetitions" case.

@maflcko
Copy link
Member

maflcko commented Oct 4, 2022

Could also move it out of the cs_main scope in FinalizeNode?

@dergoegge dergoegge force-pushed the 2022-09-txrequest-cs_main-split branch from 2ac77c4 to 4ec6b41 Compare October 18, 2022 16:47
@dergoegge
Copy link
Member Author

Rebased, added a method to remove multiple transactions from the tracker at once (to avoid locking the internal lock over and over again), and renamed m_txrequest_mutex -> m_mutex.

Copy link
Contributor

@vasild vasild left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ACK 1d72d42 See below.

}
}

m_txrequest.ForgetTxs(Span{pblock->vtx});
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit:

Suggested change
m_txrequest.ForgetTxs(Span{pblock->vtx});
m_txrequest.ForgetTxs(pblock->vtx);

@vasild
Copy link
Contributor

vasild commented Oct 21, 2022

Could also move it out of the cs_main scope in FinalizeNode?

I am not sure about it. In PeerManagerImpl::FinalizeNode():

LOCK(cs_main);
...
m_txrequest.DisconnectedPeer(nodeid);
...
assert(m_txrequest.Size() == 0);

hmm, wait! is that a bug in this PR? If somebody modifies (adds to) m_txrequest after DisconnectedPeer() and before the assert() then the assert() will be triggered. And the point of this PR is to be able to access (add to) m_txrequest without cs_main from anywhere 💣 🔥

Maybe return the size from DisconnectedPeer() after the deletion and later assert that the returned size was 0 in FinalizeNode()?

@dergoegge
Copy link
Member Author

hmm, wait! is that a bug in this PR? If somebody modifies (adds to) m_txrequest after DisconnectedPeer() and before the assert() then the assert() will be triggered. And the point of this PR is to be able to access (add to) m_txrequest without cs_main from anywhere

I don't think it is currently but it almost is! The assert(m_txrequest.Size() == 0); is gated by if (m_node_state.empty()) so no other peer can modify (add txs) m_txrequest.

I will still change this as that seems a little brittle wrt future changes.

@vasild
Copy link
Contributor

vasild commented Oct 22, 2022

How? It seems that the assumption in FinalizeNode() is that m_node_state is modified together/atomically with m_txrequest.

Yes, the current code is fine, but then it is fine even without this PR. The aim is to make it future proof.

Returning the size from DisconnectedPeer() like I suggested above seems to have its own (theoretical) flaw - it could return 1, afterwards m_node_state could become empty by another thread, we would enter the if and the assert would fail because 1 != 0.

Maybe just delete the assert? Or expose the mutex of m_txrequest, lock it before DisconnectedPeer() and unlock it after the assert ❓

@@ -1395,7 +1395,8 @@ void PeerManagerImpl::PushNodeVersion(CNode& pnode, const Peer& peer)

void PeerManagerImpl::AddTxAnnouncement(const CNode& node, const GenTxid& gtxid, std::chrono::microseconds current_time)
{
AssertLockHeld(::cs_main); // For m_txrequest
AssertLockHeld(::cs_main);
Copy link
Member

@maflcko maflcko Oct 24, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if this is correct. While m_txrequest has an internal mutex to guard against UB, the internal mutex does nothing to guard the processing logic. A mutex different from the internal one is still needed here to guard against several threads calling into this function at the same time.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm I think you're right. Also seems similar to what Vasil mentioned, when we call methods on m_txrequest but expect the internal state between those calls not to change, then the internal mutex won't help us. afaict the changes here don't break anything because cs_main is still held in these places but I am trying to seperate m_txrequest from cs_main so cs_main should not be needed for any of this after the PR. Will mark as draft until I figure out how to address this. (Seems like the only way to address this is to change the TxRequestTracker interface)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, an alternative would be to replace cs_main with g_msgproc_mutex?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or expose the mutex outside of TxRequestTracker and lock for longer duration in the net processing code. I.e. have TxRequestTracker m_txrequest GUARDED_BY(m_its_own_mutex); as suggested by @ajtowns in #26151 (review)

@DrahtBot
Copy link
Contributor

🐙 This pull request conflicts with the target branch and needs rebase.

@DrahtBot
Copy link
Contributor

There hasn't been much activity lately and the patch still needs rebase. What is the status here?

  • Is it still relevant? ➡️ Please solve the conflicts to make it ready for review and to ensure the CI passes.
  • Is it no longer relevant? ➡️ Please close.
  • Did the author lose interest or time to work on this? ➡️ Please close it and mark it 'Up for grabs' with the label, so that it can be picked up in the future.

@dergoegge
Copy link
Member Author

Closing this for now, can be marked up for grabs.

#26151 (comment) needs to be addressed for this to move forward. Imo, the interface of TxRequestTracker should change to internally enforce the MAX_PEER_TX_ANNOUNCEMENTS and MAX_PEER_TX_REQUEST_IN_FLIGHT limits.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants