Deduplicate parent txid loop of requested transactions and missing parents of orphan transactions #19596

sdaftuar · 2020-07-27T04:01:15Z

I noticed a couple of places recently where we loop over all inputs of a transaction in order to do some processing on the txids we find in those inputs. There may be thousands of inputs in a transaction, and the same txid may appear many times. In a couple of places in particular, we loop over those txids and add them to a rolling bloom filter; doing that multiple times for the same txid wastes entries in that filter.

This PR fixes that in two places relating to transaction relay: one on the server side, where we look for parent transactions of a tx that we are delivering to a peer to ensure that getdata requests for those parents will succeed; and the other on the client side, where when we process an orphan tx we want to loop over the parent txids and ensure that all are eventually requested from the peer who provided the orphan.

This addresses a couple of related comments left in #19109.

JeremyRubin · 2020-07-27T06:43:20Z

It should be substantially faster and memory friendly for these big txns to fill the parents set as a vector of hashes, sort it, use std::unique to shift out the duplicates, and then erase garbage. std::set will do O(n) allocations and use about 3x the memory compared to std::vector, and set traversal and construction is worse than vector. The only case where set may win is if the duplicate rate was large, but the adversarial worst case is better under vector than under set and I suspect the average case (not a big duplicate rate) is better too.

jonatack · 2020-07-27T07:31:44Z

Good ideas. There is the question of using sets for the data structure. The first commit 2898d65 "Rewrite parent txid loop of requested transactions" seems to break a number of functional tests for me locally. All the tests pass again after rebasing to keep only the second commit.

interface_rest.py                       | ✖ Failed  | 35 s
mempool_packages.py                     | ✖ Failed  | 3 s
p2p_segwit.py                           | ✖ Failed  | 23 s
rpc_rawtransaction.py                   | ✖ Failed  | 9 s
wallet_avoidreuse.py                    | ✖ Failed  | 6 s
wallet_avoidreuse.py --descriptors      | ✖ Failed  | 7 s
wallet_backup.py                        | ✖ Failed  | 9 s
wallet_balance.py                       | ✖ Failed  | 15 s
wallet_bumpfee.py                       | ✖ Failed  | 5 s
wallet_groups.py                        | ✖ Failed  | 12 s
wallet_listtransactions.py              | ✖ Failed  | 36 s

DrahtBot · 2020-07-27T10:34:39Z

The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.

Conflicts

Reviewers, this pull request conflicts with the following ones:

p2p: refactor AlreadyHave(), CInv::type, INV/TX processing #19610 (p2p: refactor AlreadyHave(), CInv::type, INV/TX processing by jonatack)
refactor: Replace RecursiveMutex with Mutex in CTxMemPool #19306 (refactor: Replace RecursiveMutex with Mutex in CTxMemPool by hebasto)

If you consider this pull request important, please also help to review the conflicting pull requests. Ideally, start with the one that should be merged first.

ajtowns · 2020-07-27T10:38:02Z

It should be substantially faster and memory friendly for these big txns to fill the parents set as a vector of hashes, sort it, use std::unique to shift out the duplicates, and then erase garbage.

I think this is just:

--- a/src/net_processing.cpp
+++ b/src/net_processing.cpp
@@ -2990,10 +2990,12 @@ void ProcessMessage(
 
             // Deduplicate parent txids, so that we don't have to loop over
             // the same parent txid more than once down below.
-            std::set<uint256> unique_parents;
+            std::vector<uint256> unique_parents;
+            unique_parents.reserve(tx.vin.size());
             for (const CTxIn& txin : tx.vin) {
-                unique_parents.insert(txin.prevout.hash);
+                unique_parents.push_back(txin.prevout.hash);
             }
+            std::sort(unique_parents.begin(), unique_parents.end());
+            unique_parents.erase(std::unique(unique_parents.begin(), unique_parents.end()), unique_parents.end());
             for (const uint256& parent_txid : unique_parents) {
                 if (recentRejects->contains(parent_txid)) {
                     fRejectedParents = true;

I think I agree the vector/sort/unique approach is better -- temporarily allocating up to twice the space it takes to store the transaction seems fine, and allocating memory at once rather than for each input seems better; but not sure it should matter all that much.

Alternatively, could maybe do it as:

    for (txin : tx.vin) {
        if (recentRejects->contains(txin.prevout.hash)) {
            fRejectedParents = true;
        }
    }
    if (!fRejectedParents && pfrom.m_tx_relay != nullptr) {
        LOCK(pfrom.m_tx_relay->cs_tx_inventory);
        for (txin : tx.vin) {
            if (pfrom.m_tx_relay->filterInventoryKnown.contains(txin.prevout.hash)) continue;
            if (AlreadyHave(CInv(MSG_TX | nFetchFlags, txin.prevout.hash))) continue;
            pfrom.m_tx_relay->filterInventoryKnown.insert(txin.prevout.hash)
            RequestTx(txin.prevout.hash);
        }
    }

and rely on both recentRejects and filterInventoryKnown lookups being quick enough that duplicates don't matter, and use filterInventoryKnown to ensure the slow ops only get queried once per tx, rather than once per prevout.

(travis reports a lock order violation btw)

sdaftuar · 2020-07-27T23:23:14Z

Thanks all for the quick review! @ajtowns I grabbed your code for using a vector instead of a set, and hopefully fixed the lock inversion problem.

@jonatack I'm surprised by the test failures, is it possible the lock inversion caused that?

sipa · 2020-07-28T18:34:16Z

Code review ACK 842f908

I don't understand where the locking failure is coming from. Is it possibly fixed now?

sdaftuar · 2020-07-28T18:48:25Z

I don't understand where the locking failure is coming from. Is it possibly fixed now?

@sipa Yes it should be fixed now; when I opened the PR initially, I wasn't letting go of the mempool lock before acquiring the others in the first commit.

jonatack

@jonatack I'm surprised by the test failures, is it possible the lock inversion caused that?

Yes, I believe so, the local test failures (debug build, gcc Debian 10.1.0-6) stemming from the first commit were similar to the Travis one.

ACK 842f908 code review (the changes are straightforward, verified among other things mempool.info vs mempool.GetMemPoolParents()), debug build/tests/ran bitcoind for a day; operation nominal.

src/net_processing.cpp

jnewbery

Looks good. A few comments inline.

std::set will do O(n) allocations and use about 3x the memory compared to
std::vector

@JeremyRubin how did you get 3x? My understanding is that overhead for a std::set node is ~32 bytes, which would imply double the memory usage of a vector in this case.

src/net_processing.cpp

sipa · 2020-07-30T16:06:33Z

@jnewbery Overhead of a set over a vector is the size of 3 pointers and an int (for red black tree structure), plus 2 pointers (for malloc overhead), and that rounded up to the next multiple of 2 pointers' worth.

For 64-bit systems that means between 48 and 63 bytes overhead.

jnewbery · 2020-07-30T16:21:08Z

Thanks @sipa!

JeremyRubin · 2020-07-30T17:47:22Z

If that's not clear for anyone reading along, because I frequently have to re-check this logic...

the implementation we assume for std::set is a red black tree, with a layout of:

template <typename T>
struct node {
int color;
void* parent;
void* left;
void* right;
T element;
}

This sums to 8 bytes * 4 (int has 4 bytes padding) = 32 bytes of overhead, no extra alignment for T required (unless there's some external alignment requirement > 32). But in std::set, every node is individually alloc'd with malloc (as opposed to using an arena, which would reduce overhead). Malloc typically requires an extra 16 bytes of data (but can be higher or lower platform dependent, not counting whatever other internal bookkeeping memory overhead heap may required?), and then pads out the allocation to a 16 byte boundary. If you look at MallocUsage, it's ((s + 31) >> 4 <<4), but the 31 is really +16 (for malloc metadata) and +15 >>4 <<4 for the padding.

This means that for a 32 byte hash, you need an estimated 80 bytes per element in a std::set (so 2.5x, a bit better than 3x, but still excessive). The main issue IMO is less so the memory and more so the calls to malloc itself. E.g., if vector used 2x memory and std::set used 1x memory, it still might be preferable to use the vector to avoid N calls to malloc.

jnewbery · 2020-07-30T22:15:48Z

Thanks @JeremyRubin! Very interesting.

luke-jr · 2020-07-31T00:51:00Z

Has unordered_set been evaluated?

JeremyRubin · 2020-07-31T04:50:46Z

unordered_set... ah the promise of O(1) accesses. I don't think it makes sense to do here since N is small. Here's my overall thinking:

This piece of code isn't exactly adversarial. Whatever algorithm we pick -- sorting or hash tabling, is going to be good enough. This is mostly about picking something with good average runtime and acceptable worst case runtime, and preferably low memory on average.

For unordered_set we use about 72 bytes per entry (32 bytes, 1 pointer, padded out to 64 bytes, and then 8 more for the buckets), and still need do O(n) allocations (unless we do a custom arena allocator, then we can save it!). Then we get an implementation that will be O(N) instead of O(N log N). However I think the largest N is about 100kb/40b (largest standard txn divided by size of input), so we're looking at about 2,500 inputs, and log2(2500) is about 11. However, if you think about sorting a vector of bytearrays, I think that should be pretty performant compared to all the pointer chasing, hashing, and allocating that a hash map will make you do. And in the common case, which is smaller numbers of inputs, the log N will matter less.

So I think a plain hash map, which better big O, is going to be worse on average and not much better worst case, if any better at all.

if you did want to make a simple competitive algorithm for doing this based on hash sets, here's what I would do:

allocate a std::vector 8x the size of the number of inputs in bits and a vector for N elements
get 5 x 12 bits of entropy for each entry, and check and set the bit in the vector for each by using the projecting trick.
a. If all 5 bits are already set, insert the item into a std::set or std::unordered_set named likely_dups.
b. otherwise, push back element onto the vector
iterate over the vector removing anything in the set (quit if set empty)
extend the end of the vector with whatever is left in the set

This algorithm is correct because every item with no perfect overlapped bits in the set is unique. So by step 3, the vector is unique, and the std::unordered_set has all the likely non unique entries (but is also, conveniently, unique). We then attempt to remove from the set every element in the vector, guaranteeing the set only has what's not in the vector. Then we add those to the vector, and the vector contains all unique elements from the original vector of elems.

However this algorithm works really nicely for the expected case (no colliding) because we only have to make 2 malloc calls and emplace everything into our vector after checking the filter (which should have low prob of collision), so it should be pretty speedy! And in the case where there are a lot of duplicates, we don't end up allocating that much memory in the likely set because they are duplicated! The worst case is where every element has exactly 1 duplicate, then we end up doing extra work.

However, this is a lot of work to do and get right so I think the sort remove technique is fine.

Previously, we would potentially add the same txid many times to the rolling bloom filter of recently announced transactions to a peer, if many outputs of the same txid appeared as inputs in a transaction. Eliminate this problem and avoid redundant lookups by asking the mempool for the unique parents of a requested transaction.

In the logic for requesting missing parents of orphan transactions, parent transactions with multiple outputs being spent by the given orphan were being processed multiple times. Fix this by deduplicating the set of missing parent txids first. Co-authored-by: Anthony Towns <aj@erisian.com.au>

sdaftuar · 2020-08-04T19:52:12Z

Rebased and addressed @jnewbery's feedback.

jonatack

ACK 4c0731f

ajtowns

ACK 4c0731f

ajtowns · 2020-08-06T04:52:47Z

src/net_processing.cpp

+            for (const uint256& parent_txid : parent_ids_to_add) {
+                // Relaying a transaction with a recent but unconfirmed parent.
+                if (WITH_LOCK(pfrom.m_tx_relay->cs_tx_inventory, return !pfrom.m_tx_relay->filterInventoryKnown.contains(parent_txid))) {
+                    LOCK(cs_main);


It's probably fine in practice, and it's not a regression in this PR, but locking and unlocking cs_tx_inventory and cs_main (potentially) for each parent kind of bugs me each time I look at it, so I had a go at trying to avoid it. Came up with:

{ LOCK(pfrom.m_tx_relay->cs_tx_inventory); auto is_known = [&](uint256& parent_txid) EXCLUSIVE_LOCKS_REQUIRED(pfrom.m_tx_relay->cs_tx_inventory) { return pfrom.m_tx_relay->filterInventoryKnown.contains(parent_txid); }; parent_ids_to_add.erase(std::remove_if(parent_ids_to_add.begin(), parent_ids_to_add.end(), is_known), parent_ids_to_add.end()); } if (!parent_ids_to_add.empty()) { LOCK(cs_main); for (const uint256& parent_txid : parent_ids_to_add) { // Relaying a transaction with a recent but unconfirmed parent. State(pfrom.GetId())->m_recently_announced_invs.insert(parent_txid); } }

Calling State() to iterate over every peer for every input txid also seems wasteful, especially since we're already in a loop over all getdata items. I think better would be:

if (!parent_ids_to_add.empty()) { LOCK(cs_main); CNodeState* node = State(pfrom.GetId()); for (const uint256& parent_txid : parent_ids_to_add) { // Relaying a transaction with a recent but unconfirmed parent. node->m_recently_announced_invs.insert(parent_txid); } }

I also don't understand the benefit of adding the clang lock annotation to the lambda. It's only in scope here, so it's never going to be called from somewhere else without pfrom.m_tx_relay->cs_tx_inventory being held.

All of this locking/unlocking of multiple mutexes goes away once peer data is consolidated into a single Peer object.

I started to do this, but this doesn't exactly add to the readability of the code; can we punt on this to some future refactor, especially if we think existing refactors under way would change all this anyway?

In practice there should be more than 25 unconfirmed parents anyway, and I wouldn't expect much lock contention either.

Definitely seems fine to punt it to me

@jnewbery Accessing guarded data structures in a lambda means the lambda needs to be annotated with the required lock (or an AssertLockHeld or LockAssertion added cf #19668), or there will be compile time thread safety errors. Doesn't matter how limited the lambda's scope is.

laanwj · 2020-08-10T18:37:50Z

Code review ACK 4c0731f

…ctions and missing parents of orphan transactions 4c0731f Deduplicate missing parents of orphan transactions (Suhas Daftuar) 8196176 Rewrite parent txid loop of requested transactions (Suhas Daftuar) Pull request description: I noticed a couple of places recently where we loop over all inputs of a transaction in order to do some processing on the txids we find in those inputs. There may be thousands of inputs in a transaction, and the same txid may appear many times. In a couple of places in particular, we loop over those txids and add them to a rolling bloom filter; doing that multiple times for the same txid wastes entries in that filter. This PR fixes that in two places relating to transaction relay: one on the server side, where we look for parent transactions of a tx that we are delivering to a peer to ensure that getdata requests for those parents will succeed; and the other on the client side, where when we process an orphan tx we want to loop over the parent txids and ensure that all are eventually requested from the peer who provided the orphan. This addresses a couple of [related](bitcoin#19109 (comment)) [comments](bitcoin#19109 (comment)) left in bitcoin#19109. ACKs for top commit: laanwj: Code review ACK 4c0731f jonatack: ACK 4c0731f ajtowns: ACK 4c0731f Tree-SHA512: 8af9df7f56c6e54b5915519d7d5465e081473ceb1bcc89bbebf83e78722cf51ff58145e588cf57126bce17071a8053273f4bcef0ad8166bec83ba14352e40f5d

Summary: Previously, we would potentially add the same txid many times to the rolling bloom filter of recently announced transactions to a peer, if many outputs of the same txid appeared as inputs in a transaction. Eliminate this problem and avoid redundant lookups by asking the mempool for the unique parents of a requested transaction. This is a backport of [[bitcoin/bitcoin#19596 | core#19596]] [1/2] bitcoin/bitcoin@8196176 Test Plan: `ninja all check-all` Reviewers: #bitcoin_abc, majcosta Reviewed By: #bitcoin_abc, majcosta Differential Revision: https://reviews.bitcoinabc.org/D10066

Summary: In the logic for requesting missing parents of orphan transactions, parent transactions with multiple outputs being spent by the given orphan were being processed multiple times. Fix this by deduplicating the set of missing parent txids first. Co-authored-by: Anthony Towns <aj@erisian.com.au> This is a backport of [[bitcoin/bitcoin#19596 | core#19596]] [2/2] bitcoin/bitcoin@4c0731f Depends on D10066 Test Plan: `ninja all check-all` Reviewers: #bitcoin_abc, majcosta Reviewed By: #bitcoin_abc, majcosta Differential Revision: https://reviews.bitcoinabc.org/D10067

fanquake added the P2P label Jul 27, 2020

fanquake requested review from ajtowns and sipa July 27, 2020 04:10

DrahtBot mentioned this pull request Jul 27, 2020

Enable fetching of orphan parents from wtxid peers #19569

Merged

sdaftuar force-pushed the 2020-07-dedup-inputs branch from a450721 to 842f908 Compare July 27, 2020 23:21

DrahtBot mentioned this pull request Jul 29, 2020

refactor: Replace RecursiveMutex with Mutex in CTxMemPool #19306

Closed

jonatack reviewed Jul 30, 2020

View reviewed changes

src/net_processing.cpp Show resolved Hide resolved

jnewbery reviewed Jul 30, 2020

View reviewed changes

src/net_processing.cpp Show resolved Hide resolved

src/net_processing.cpp Show resolved Hide resolved

src/net_processing.cpp Show resolved Hide resolved

src/net_processing.cpp Show resolved Hide resolved

src/net_processing.cpp Outdated Show resolved Hide resolved

DrahtBot added the Needs rebase label Jul 31, 2020

jonatack mentioned this pull request Aug 3, 2020

p2p: change CInv::type from int to uint32_t, fix UBSan warning #19611

Closed

sdaftuar and others added 2 commits August 4, 2020 13:59

sdaftuar force-pushed the 2020-07-dedup-inputs branch from 842f908 to 4c0731f Compare August 4, 2020 17:59

DrahtBot removed the Needs rebase label Aug 4, 2020

jonatack reviewed Aug 5, 2020

View reviewed changes

ajtowns approved these changes Aug 6, 2020

View reviewed changes

DrahtBot mentioned this pull request Aug 8, 2020

p2p: refactor AlreadyHave(), CInv::type, INV/TX processing #19610

Merged

laanwj merged commit 85fa648 into bitcoin:master Aug 10, 2020

bitcoin locked as resolved and limited conversation to collaborators Feb 15, 2022

Deduplicate parent txid loop of requested transactions and missing parents of orphan transactions #19596

Deduplicate parent txid loop of requested transactions and missing parents of orphan transactions #19596

Uh oh!

Conversation

sdaftuar commented Jul 27, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JeremyRubin commented Jul 27, 2020 via email

Uh oh!

jonatack commented Jul 27, 2020

Uh oh!

DrahtBot commented Jul 27, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Conflicts

Uh oh!

ajtowns commented Jul 27, 2020

Uh oh!

sdaftuar commented Jul 27, 2020

Uh oh!

sipa commented Jul 28, 2020

Uh oh!

sdaftuar commented Jul 28, 2020

Uh oh!

jonatack left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

jnewbery left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

sipa commented Jul 30, 2020

Uh oh!

jnewbery commented Jul 30, 2020

Uh oh!

JeremyRubin commented Jul 30, 2020

Uh oh!

jnewbery commented Jul 30, 2020

Uh oh!

luke-jr commented Jul 31, 2020

Uh oh!

JeremyRubin commented Jul 31, 2020

Uh oh!

sdaftuar commented Aug 4, 2020

Uh oh!

jonatack left a comment

Choose a reason for hiding this comment

Uh oh!

ajtowns left a comment

Choose a reason for hiding this comment

Uh oh!

ajtowns Aug 6, 2020

Choose a reason for hiding this comment

Uh oh!

jnewbery Aug 6, 2020

Choose a reason for hiding this comment

Uh oh!

sdaftuar Aug 6, 2020

Choose a reason for hiding this comment

Uh oh!

ajtowns Aug 6, 2020

Choose a reason for hiding this comment

Uh oh!

ajtowns Aug 6, 2020

Choose a reason for hiding this comment

Uh oh!

laanwj commented Aug 10, 2020

Uh oh!

Uh oh!

sdaftuar commented Jul 27, 2020 •

edited

Loading

DrahtBot commented Jul 27, 2020 •

edited

Loading