Skip to content

Conversation

sdaftuar
Copy link
Member

@sdaftuar sdaftuar commented Jul 27, 2020

I noticed a couple of places recently where we loop over all inputs of a transaction in order to do some processing on the txids we find in those inputs. There may be thousands of inputs in a transaction, and the same txid may appear many times. In a couple of places in particular, we loop over those txids and add them to a rolling bloom filter; doing that multiple times for the same txid wastes entries in that filter.

This PR fixes that in two places relating to transaction relay: one on the server side, where we look for parent transactions of a tx that we are delivering to a peer to ensure that getdata requests for those parents will succeed; and the other on the client side, where when we process an orphan tx we want to loop over the parent txids and ensure that all are eventually requested from the peer who provided the orphan.

This addresses a couple of related comments left in #19109.

@fanquake fanquake added the P2P label Jul 27, 2020
@fanquake fanquake requested review from ajtowns and sipa July 27, 2020 04:10
@JeremyRubin
Copy link
Contributor

JeremyRubin commented Jul 27, 2020 via email

@jonatack
Copy link
Member

Good ideas. There is the question of using sets for the data structure. The first commit 2898d65 "Rewrite parent txid loop of requested transactions" seems to break a number of functional tests for me locally. All the tests pass again after rebasing to keep only the second commit.

interface_rest.py                       | ✖ Failed  | 35 s
mempool_packages.py                     | ✖ Failed  | 3 s
p2p_segwit.py                           | ✖ Failed  | 23 s
rpc_rawtransaction.py                   | ✖ Failed  | 9 s
wallet_avoidreuse.py                    | ✖ Failed  | 6 s
wallet_avoidreuse.py --descriptors      | ✖ Failed  | 7 s
wallet_backup.py                        | ✖ Failed  | 9 s
wallet_balance.py                       | ✖ Failed  | 15 s
wallet_bumpfee.py                       | ✖ Failed  | 5 s
wallet_groups.py                        | ✖ Failed  | 12 s
wallet_listtransactions.py              | ✖ Failed  | 36 s

@DrahtBot
Copy link
Contributor

DrahtBot commented Jul 27, 2020

The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.

Conflicts

Reviewers, this pull request conflicts with the following ones:

If you consider this pull request important, please also help to review the conflicting pull requests. Ideally, start with the one that should be merged first.

@ajtowns
Copy link
Contributor

ajtowns commented Jul 27, 2020

It should be substantially faster and memory friendly for these big txns to fill the parents set as a vector of hashes, sort it, use std::unique to shift out the duplicates, and then erase garbage.

I think this is just:

--- a/src/net_processing.cpp
+++ b/src/net_processing.cpp
@@ -2990,10 +2990,12 @@ void ProcessMessage(
 
             // Deduplicate parent txids, so that we don't have to loop over
             // the same parent txid more than once down below.
-            std::set<uint256> unique_parents;
+            std::vector<uint256> unique_parents;
+            unique_parents.reserve(tx.vin.size());
             for (const CTxIn& txin : tx.vin) {
-                unique_parents.insert(txin.prevout.hash);
+                unique_parents.push_back(txin.prevout.hash);
             }
+            std::sort(unique_parents.begin(), unique_parents.end());
+            unique_parents.erase(std::unique(unique_parents.begin(), unique_parents.end()), unique_parents.end());
             for (const uint256& parent_txid : unique_parents) {
                 if (recentRejects->contains(parent_txid)) {
                     fRejectedParents = true;

I think I agree the vector/sort/unique approach is better -- temporarily allocating up to twice the space it takes to store the transaction seems fine, and allocating memory at once rather than for each input seems better; but not sure it should matter all that much.

Alternatively, could maybe do it as:

    for (txin : tx.vin) {
        if (recentRejects->contains(txin.prevout.hash)) {
            fRejectedParents = true;
        }
    }
    if (!fRejectedParents && pfrom.m_tx_relay != nullptr) {
        LOCK(pfrom.m_tx_relay->cs_tx_inventory);
        for (txin : tx.vin) {
            if (pfrom.m_tx_relay->filterInventoryKnown.contains(txin.prevout.hash)) continue;
            if (AlreadyHave(CInv(MSG_TX | nFetchFlags, txin.prevout.hash))) continue;
            pfrom.m_tx_relay->filterInventoryKnown.insert(txin.prevout.hash)
            RequestTx(txin.prevout.hash);
        }
    }

and rely on both recentRejects and filterInventoryKnown lookups being quick enough that duplicates don't matter, and use filterInventoryKnown to ensure the slow ops only get queried once per tx, rather than once per prevout.

(travis reports a lock order violation btw)

@sdaftuar
Copy link
Member Author

Thanks all for the quick review! @ajtowns I grabbed your code for using a vector instead of a set, and hopefully fixed the lock inversion problem.

@jonatack I'm surprised by the test failures, is it possible the lock inversion caused that?

@sipa
Copy link
Member

sipa commented Jul 28, 2020

Code review ACK 842f908

I don't understand where the locking failure is coming from. Is it possibly fixed now?

@sdaftuar
Copy link
Member Author

I don't understand where the locking failure is coming from. Is it possibly fixed now?

@sipa Yes it should be fixed now; when I opened the PR initially, I wasn't letting go of the mempool lock before acquiring the others in the first commit.

Copy link
Member

@jonatack jonatack left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jonatack I'm surprised by the test failures, is it possible the lock inversion caused that?

Yes, I believe so, the local test failures (debug build, gcc Debian 10.1.0-6) stemming from the first commit were similar to the Travis one.

ACK 842f908 code review (the changes are straightforward, verified among other things mempool.info vs mempool.GetMemPoolParents()), debug build/tests/ran bitcoind for a day; operation nominal.

Copy link
Contributor

@jnewbery jnewbery left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. A few comments inline.

std::set will do O(n) allocations and use about 3x the memory compared to
std::vector

@JeremyRubin how did you get 3x? My understanding is that overhead for a std::set node is ~32 bytes, which would imply double the memory usage of a vector in this case.

@sipa
Copy link
Member

sipa commented Jul 30, 2020

@jnewbery Overhead of a set over a vector is the size of 3 pointers and an int (for red black tree structure), plus 2 pointers (for malloc overhead), and that rounded up to the next multiple of 2 pointers' worth.

For 64-bit systems that means between 48 and 63 bytes overhead.

@jnewbery
Copy link
Contributor

Thanks @sipa!

@JeremyRubin
Copy link
Contributor

If that's not clear for anyone reading along, because I frequently have to re-check this logic...

the implementation we assume for std::set is a red black tree, with a layout of:

template <typename T>
struct node {
int color;
void* parent;
void* left;
void* right;
T element;
}

This sums to 8 bytes * 4 (int has 4 bytes padding) = 32 bytes of overhead, no extra alignment for T required (unless there's some external alignment requirement > 32). But in std::set, every node is individually alloc'd with malloc (as opposed to using an arena, which would reduce overhead). Malloc typically requires an extra 16 bytes of data (but can be higher or lower platform dependent, not counting whatever other internal bookkeeping memory overhead heap may required?), and then pads out the allocation to a 16 byte boundary. If you look at MallocUsage, it's ((s + 31) >> 4 <<4), but the 31 is really +16 (for malloc metadata) and +15 >>4 <<4 for the padding.

This means that for a 32 byte hash, you need an estimated 80 bytes per element in a std::set (so 2.5x, a bit better than 3x, but still excessive). The main issue IMO is less so the memory and more so the calls to malloc itself. E.g., if vector used 2x memory and std::set used 1x memory, it still might be preferable to use the vector to avoid N calls to malloc.

@jnewbery
Copy link
Contributor

Thanks @JeremyRubin! Very interesting.

@luke-jr
Copy link
Member

luke-jr commented Jul 31, 2020

Has unordered_set been evaluated?

@JeremyRubin
Copy link
Contributor

unordered_set... ah the promise of O(1) accesses. I don't think it makes sense to do here since N is small. Here's my overall thinking:

This piece of code isn't exactly adversarial. Whatever algorithm we pick -- sorting or hash tabling, is going to be good enough. This is mostly about picking something with good average runtime and acceptable worst case runtime, and preferably low memory on average.

For unordered_set we use about 72 bytes per entry (32 bytes, 1 pointer, padded out to 64 bytes, and then 8 more for the buckets), and still need do O(n) allocations (unless we do a custom arena allocator, then we can save it!). Then we get an implementation that will be O(N) instead of O(N log N). However I think the largest N is about 100kb/40b (largest standard txn divided by size of input), so we're looking at about 2,500 inputs, and log2(2500) is about 11. However, if you think about sorting a vector of bytearrays, I think that should be pretty performant compared to all the pointer chasing, hashing, and allocating that a hash map will make you do. And in the common case, which is smaller numbers of inputs, the log N will matter less.

So I think a plain hash map, which better big O, is going to be worse on average and not much better worst case, if any better at all.

if you did want to make a simple competitive algorithm for doing this based on hash sets, here's what I would do:

  1. allocate a std::vector 8x the size of the number of inputs in bits and a vector for N elements
  2. get 5 x 12 bits of entropy for each entry, and check and set the bit in the vector for each by using the projecting trick.
    a. If all 5 bits are already set, insert the item into a std::set or std::unordered_set named likely_dups.
    b. otherwise, push back element onto the vector
  3. iterate over the vector removing anything in the set (quit if set empty)
  4. extend the end of the vector with whatever is left in the set

This algorithm is correct because every item with no perfect overlapped bits in the set is unique. So by step 3, the vector is unique, and the std::unordered_set has all the likely non unique entries (but is also, conveniently, unique). We then attempt to remove from the set every element in the vector, guaranteeing the set only has what's not in the vector. Then we add those to the vector, and the vector contains all unique elements from the original vector of elems.

However this algorithm works really nicely for the expected case (no colliding) because we only have to make 2 malloc calls and emplace everything into our vector after checking the filter (which should have low prob of collision), so it should be pretty speedy! And in the case where there are a lot of duplicates, we don't end up allocating that much memory in the likely set because they are duplicated! The worst case is where every element has exactly 1 duplicate, then we end up doing extra work.

However, this is a lot of work to do and get right so I think the sort remove technique is fine.

sdaftuar and others added 2 commits August 4, 2020 13:59
Previously, we would potentially add the same txid many times to the rolling
bloom filter of recently announced transactions to a peer, if many outputs of
the same txid appeared as inputs in a transaction. Eliminate this problem and
avoid redundant lookups by asking the mempool for the unique parents of a
requested transaction.
In the logic for requesting missing parents of orphan transactions, parent
transactions with multiple outputs being spent by the given orphan were being
processed multiple times. Fix this by deduplicating the set of missing parent
txids first.

Co-authored-by: Anthony Towns <aj@erisian.com.au>
@sdaftuar sdaftuar force-pushed the 2020-07-dedup-inputs branch from 842f908 to 4c0731f Compare August 4, 2020 17:59
@sdaftuar
Copy link
Member Author

sdaftuar commented Aug 4, 2020

Rebased and addressed @jnewbery's feedback.

Copy link
Member

@jonatack jonatack left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ACK 4c0731f

Copy link
Contributor

@ajtowns ajtowns left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ACK 4c0731f

for (const uint256& parent_txid : parent_ids_to_add) {
// Relaying a transaction with a recent but unconfirmed parent.
if (WITH_LOCK(pfrom.m_tx_relay->cs_tx_inventory, return !pfrom.m_tx_relay->filterInventoryKnown.contains(parent_txid))) {
LOCK(cs_main);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's probably fine in practice, and it's not a regression in this PR, but locking and unlocking cs_tx_inventory and cs_main (potentially) for each parent kind of bugs me each time I look at it, so I had a go at trying to avoid it. Came up with:

{
    LOCK(pfrom.m_tx_relay->cs_tx_inventory);
    auto is_known = [&](uint256& parent_txid) EXCLUSIVE_LOCKS_REQUIRED(pfrom.m_tx_relay->cs_tx_inventory) { return pfrom.m_tx_relay->filterInventoryKnown.contains(parent_txid); };
    parent_ids_to_add.erase(std::remove_if(parent_ids_to_add.begin(), parent_ids_to_add.end(), is_known), parent_ids_to_add.end());
}
if (!parent_ids_to_add.empty()) {
    LOCK(cs_main);
    for (const uint256& parent_txid : parent_ids_to_add) {
        // Relaying a transaction with a recent but unconfirmed parent.
        State(pfrom.GetId())->m_recently_announced_invs.insert(parent_txid);
    }
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Calling State() to iterate over every peer for every input txid also seems wasteful, especially since we're already in a loop over all getdata items. I think better would be:

            if (!parent_ids_to_add.empty()) {
                LOCK(cs_main);
                CNodeState* node = State(pfrom.GetId());
                for (const uint256& parent_txid : parent_ids_to_add) {
                    // Relaying a transaction with a recent but unconfirmed parent.
                    node->m_recently_announced_invs.insert(parent_txid);
                }
            }

I also don't understand the benefit of adding the clang lock annotation to the lambda. It's only in scope here, so it's never going to be called from somewhere else without pfrom.m_tx_relay->cs_tx_inventory being held.

All of this locking/unlocking of multiple mutexes goes away once peer data is consolidated into a single Peer object.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I started to do this, but this doesn't exactly add to the readability of the code; can we punt on this to some future refactor, especially if we think existing refactors under way would change all this anyway?

In practice there should be more than 25 unconfirmed parents anyway, and I wouldn't expect much lock contention either.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Definitely seems fine to punt it to me

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jnewbery Accessing guarded data structures in a lambda means the lambda needs to be annotated with the required lock (or an AssertLockHeld or LockAssertion added cf #19668), or there will be compile time thread safety errors. Doesn't matter how limited the lambda's scope is.

@laanwj
Copy link
Member

laanwj commented Aug 10, 2020

Code review ACK 4c0731f

@laanwj laanwj merged commit 85fa648 into bitcoin:master Aug 10, 2020
sidhujag pushed a commit to syscoin/syscoin that referenced this pull request Aug 11, 2020
…ctions and missing parents of orphan transactions

4c0731f Deduplicate missing parents of orphan transactions (Suhas Daftuar)
8196176 Rewrite parent txid loop of requested transactions (Suhas Daftuar)

Pull request description:

  I noticed a couple of places recently where we loop over all inputs of a transaction in order to do some processing on the txids we find in those inputs.  There may be thousands of inputs in a transaction, and the same txid may appear many times.  In a couple of places in particular, we loop over those txids and add them to a rolling bloom filter; doing that multiple times for the same txid wastes entries in that filter.

  This PR fixes that in two places relating to transaction relay: one on the server side, where we look for parent transactions of a tx that we are delivering to a peer to ensure that getdata requests for those parents will succeed; and the other on the client side, where when we process an orphan tx we want to loop over the parent txids and ensure that all are eventually requested from the peer who provided the orphan.

  This addresses a couple of [related](bitcoin#19109 (comment)) [comments](bitcoin#19109 (comment)) left in bitcoin#19109.

ACKs for top commit:
  laanwj:
    Code review ACK 4c0731f
  jonatack:
    ACK 4c0731f
  ajtowns:
    ACK 4c0731f

Tree-SHA512: 8af9df7f56c6e54b5915519d7d5465e081473ceb1bcc89bbebf83e78722cf51ff58145e588cf57126bce17071a8053273f4bcef0ad8166bec83ba14352e40f5d
deadalnix pushed a commit to Bitcoin-ABC/bitcoin-abc that referenced this pull request Sep 7, 2021
Summary:
Previously, we would potentially add the same txid many times to the rolling
bloom filter of recently announced transactions to a peer, if many outputs of
the same txid appeared as inputs in a transaction. Eliminate this problem and
avoid redundant lookups by asking the mempool for the unique parents of a
requested transaction.

This is a backport of [[bitcoin/bitcoin#19596 | core#19596]] [1/2]
bitcoin/bitcoin@8196176

Test Plan: `ninja all check-all`

Reviewers: #bitcoin_abc, majcosta

Reviewed By: #bitcoin_abc, majcosta

Differential Revision: https://reviews.bitcoinabc.org/D10066
Fabcien pushed a commit to Bitcoin-ABC/bitcoin-abc that referenced this pull request Sep 7, 2021
Summary:
In the logic for requesting missing parents of orphan transactions, parent
transactions with multiple outputs being spent by the given orphan were being
processed multiple times. Fix this by deduplicating the set of missing parent
txids first.

Co-authored-by: Anthony Towns <aj@erisian.com.au>

This is a backport of [[bitcoin/bitcoin#19596 | core#19596]] [2/2]
bitcoin/bitcoin@4c0731f

Depends on D10066

Test Plan: `ninja all check-all`

Reviewers: #bitcoin_abc, majcosta

Reviewed By: #bitcoin_abc, majcosta

Differential Revision: https://reviews.bitcoinabc.org/D10067
@bitcoin bitcoin locked as resolved and limited conversation to collaborators Feb 15, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants