-
Notifications
You must be signed in to change notification settings - Fork 37.7k
Protect localhost and block-relay-only peers from eviction #19670
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This should fix #19500, but I need to figure out how to test this... |
75533f2
to
ce35ffe
Compare
ce35ffe
to
e5af0d1
Compare
Concept ACK (modulo proper testing of course :)) |
Concept ACK. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Concept ACK on the onion peers protection. We might do the block-relay-only peer protection a bit better.
@@ -903,12 +918,31 @@ bool CConnman::AttemptToEvictConnection() | |||
// Protect 4 nodes that most recently sent us transactions. | |||
// An attacker cannot manipulate this metric without performing useful work. | |||
EraseLastKElements(vEvictionCandidates, CompareNodeTXTime, 4); | |||
// Protect up to 8 non-tx-relay peers that have sent us blocks. | |||
std::sort(vEvictionCandidates.begin(), vEvictionCandidates.end(), CompareNodeBlockRelayOnlyTime); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nLastBlockTime
is updated at ProcessNewBlock
return if the process block is new to us and connecting on our best chain, among other yardsticks. An attacker-controlled inbound connection by always being the fastest to announce block to us can trump other honest block-relay-only inbound to prevent update of their nLastBlockTime
. I think assuming a malicious peer can be always the fastest is reasonable, just connect to Fibre and bypass some validation checks. By rotating fastest block announcement among few of its inbound connections, a malicious attacker seems to win over this protection.
Have you consider instead selecting randomly among all our block-relay-only peers 8 of them ? I think that would make this protection better as an attacker can't predict its block-relay-only inbound will be favored. Because if an attacker can forge the fast-block-announcement characteristic as protected by the following protection ("Protect 4 nodes that most recently sent us blocks"), it's not hard to also forge this new one as signaling as block-relay-only is cheap.
Correct me I I'm wrong about ProcessNewBlock
and nLastBlockTime
updates, this part is fairly complex with all the anti-DoS/trickery measures.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right about the code and behavior -- an adversary that is trying to be well-behaved by being the first to deliver blocks to us, from 8 different block-relay peers, can indeed take over this protection -- but randomly choosing among block-relay-only peers is not helpful protection. If we choose randomly each time, an attacker can game that by merely spamming us with inbound connections until any given peer that had protection in a prior round has lots its protection in this round, and might get evicted.
Note also that it is free to pretend to be a blocks-only full-node peer; you merely send the right version message to us on connection. It is more work, and useful work, to be the first peer to deliver us a new block. So randomly protecting inbound peers who turn off tx-relay is not a useful thing to do, even if your random selection were on a longer timer to try to avoid the spamming problem I described above.
One way to think about the protections here is: if we can pick out a set of peers that may be useful to us, protect them. Because the alternative to protection is that we may lose peers that are demonstrably useful just because an attacker is spamming us. If we make a mistake and include peers that happen to also be adversaries because they're acting like good peers, that's no worse than if we didn't add the protection at all; but if we can meaningfully increase the cost to an adversary of bypassing the protection, then we've accomplished something.
So maybe a poor criteria would be one where an attacker does useful work for us once and then can misbehave but still take over most or all of our protected spots. I don't think that's what is happening here (arguably, earliest connection time has the risk of an attacker being able to take it over, eventually -- but the first sort is on most-recently-gave-us-a-block, so an attacker would have to be able to keep doing this to ensure that other peers don't gain an advantage).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So maybe a poor criteria would be one where an attacker does useful work for us once and then can misbehave but still take over most or all of our protected spots. I don't think that's what is happening here (arguably, earliest connection time has the risk of an attacker being able to take it over, eventually -- but the first sort is on most-recently-gave-us-a-block, so an attacker would have to be able to keep doing this to ensure that other peers don't gain an advantage).
I do worry that this might be the case, however, because block-relay-only peers may not be the first to deliver us blocks very often (it's possible that due to compact block relay, there are biases in favor of compact block reconstruction happening best on links that are also relaying transactions -- I am not sure). In which case, the attack you describe of an adversary connecting 8 times, delivering the next 8 blocks from each one, and then being silent, may well work for a while to cause only their links to be protected, because the honest peers have a tough time beating other peers.
We can try to get some anecdotal data on this by looking at our own existing nodes to see if apparent block-relay peers are being selected as HB compact block peers? @gmaxwell any thoughts here?
EDIT: I guess my overall thought is that even if this is somewhat gameable, adding this won't hurt and may help. If anyone comes up with something better we can always improve it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, after writing this, I realized thinking further that my counter-proposal wasn't worthy it. As you hinted, analyzing eviction as a serie of interdependent events means sooner or latter, an attacker still has a high-chance of success.
If we make a mistake and include peers that happen to also be adversaries because they're acting like good peers, that's no worse than if we didn't add the protection at all; but if we can meaningfully increase the cost to an adversary of bypassing the protection, then we've accomplished something.
Thanks for the explanation. I agree that the current protection of fast-delivery-block peers is worth it and compel an attacker to position its nodes well with regard to block-topology and propagation, even if potentially gameable. Where I was more doubtful was on the value of adding a second protection relying on an-already used assumption. Even I was presuming that an attacker, by forgetting validation because it's a zero-cost accepting invalid blocks, will always trump block announcement, independently of being in concurrence with compact block peers.
I think there is the interesting property with block-relay-only peers that it's far far harder to guess their pairing, as you remember back in commit. Whereas our protected-by-netgroup peers (1st heuristic) might be full-relay ones, and thus an attacker might be able to guess their IP prefix by tx-relay topology probing.
Block-relay-only peers don't have this weakness so it could be more worthy to combine them with the netgroup critera rather than with the fast-block-delivery one ?
Overall, if we can't come up with better ideas, you right let's move further and try to address this kind of issues in future works on network anomalies detection/anti-eclipse.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ariaid There are, unfortunately, a lot of "nodes" that just connect and don't do anything useful. Reserving slots for zombies that don't even once do something useful isn't very helpful. Unfortunately the software doesn't really have that much criteria for deciding usefulness and most things you could imagine could be easily faked. One advantage of favouring many peers that have recently sent blocks is that even if the leader is someone who's only making themselves first for a little while before misbehaving there are still many other slots. (Also, aside, the public fibre network is gone and the software is now unmaintained)
Keep in mind that blocks-only peers are currently disadvantaged because they're ineligible for the CompareNodeTXTime criteria-- yet they are important: because they're difficult for an attacker to map they are intended form an important part of the network's protection against an attacker carving partitions into the network via DOS attacks. It would be extremely unfortunate to have that protection get undermined by a simple connection flooding attack that gets most of the blocks-only peers evicted.
You can see the recent blocks protection for blocks-only peers being morally similar to reserving part of the longest uptime slots for localhost/onion connections. In this case actually 'reserving' doesn't make sense because the number protected for that reason is relatively few. Basically there are now two reasons for a group to exist: Spreading connections across orthogonal kinds of being good under the theory that its harder for an attacker to be good in many ways, and spreading connections across different kinds of (useful) peers.
One could even imagine a generalization where you have a collection of different kinds of peers {ipv4, ipv6, onion, local-network, blocks-only, accepts-inbound-according-to-the-feeler-test} and different kinds of goodness {latency, block relay, tx relay, netgroups, uptime} and then you form a cross-product of the two sets and keep the best in each bucket--- but currently there just aren't that many slots to fight over.
If it were changed to netgroup based, I'd still recommend nLastBlockTime!=0 and fRelevantServices as sorting criteria ahead of a random netgroup preference.
At one point I wanted to add a generic "helpful" flag that was based on some a set of criteria that judged if the peer was probably capable of serving us a novel block-- do they have the right services? if not-- not helpful. have they recently given us a block? helpful. have they recently given us a new transaction we accepted? helpful. Have they been recently announcing blocks we eventually accepted? helpful. Otherwise not helpful. ... stuff like that-- so essentially all working peers should be flagged as helpful.
And then have all the eviction criteria (except perhaps the longest uptime) use the helpful flag as the first sorting key, so that reserved slots won't be taken by stuff that is very obviously broken. Attackers can feign helpfulness-- sure, but some won't bother. And weirdly broken nodes (some of which are just incompetent attackers) are common and don't feign anything.
I just found it to be too difficult to meet people's expectations for testing for this kind of behaviour.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gmaxwell thanks for your explanation, I would push your model a bit further, beyond types of peers/goodness, you can classify goodness criterias between themselves, with regards to hardness assumptions on attackers. Some are good-behavioral monitoring (block-delivery, stable connections, ...), other costly-topological-identifier (asmap, net-address), other non-observable ones (netgroup sort, you could imagine peer nonce sort).
It needs further research but I fear good-behavioral ones are on the low scale of hardness, because once you have an attacker infrastructure deployed it sounds a marginal cost to fake good-behavioral ones (e.g learn the block first, give it to all your node puppets). These types of goodness are the ones we use the most, I don't deny their usefulness but I lean to think about them more like sanitization measures than inbound slot overtakes protections. By pruning out zombies nodes like you underscore, they optimize your resources, in the sense for a given bucket, increase potential utility.
With regards to switching to a pro-active logic, like the "helpful" described, it defavor low-grade full-nodes even when they are protocol compliant. I presume that's already an effect of our eviction heuristics but I would be cautious going further and such breaking some network-wise diversity of peers. Obviously pushing for more efficiency-based pairing isn't great for robustness.
Stricter slot allocation, especially based on netgroup might have the side-effect to facilitate some net-splits, like if you allocate X slots based on netgroup ABC, an attacker, once deployed in this netgroup can exhaust the quota towards every listening peers. I guess peers netaddresses aren't evenly distributed across netgroup so you might be able to drop a fair chunk of the network.
Note, we do addr-relay for blocks-only peers, not for block-relay-only. So it might be easier for an attacker to map them than expected. I don't think we want to severe addr-relay for blocks-only as otherwise it would be harder (impossible?) for them to learn about new addresses.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
other non-observable ones (netgroup sort, you could imagine peer nonce sort).
Not sure if it was clear, but the motivation for netgroup diverse peers was also along the lines of trying to restore topology diversity.
Many criteria-- latency, block relay, and tx relay all somewhat favor peers that are geographically close. So you don't want a situation where most nodes are just incestuously linked within the same geography, resulting in a small min-cut to other continents. It's easy to imaging the network becoming that way over time when you factor in the fact that further links crossing more networks are a little less reliable than shorter ones.
I fear good-behavioral ones are on the low scale of hardness, because once you have an attacker infrastructure deployed it sounds a marginal cost to fake good-behavioral ones
Sure, though if they stop being good, they'll quickly lose the advantage it gave them. ... assuming they haven't already successfully eclipsed you entirely. If they keep being good then hurrah, you won.
, once deployed in this netgroup can exhaust the quota towards every listening peers
Yes, though the preference isn't directly observable.
presume that's already an effect of our eviction heuristics but I would be cautious going further and such breaking some network-wise diversity of peers
I really disagree. It's a reason to be not overly specific in what constitutes helpful, but there are a pretty obvious set of behaviours (like relaying us blocks) which if a peer doesn't do, it isn't helping us stay non-partitioned from the honest network. A peer is free to not to helpful stuff, they wouldn't be disconnected directly for it-- but for criteria that is expressly intended to protect the nodes' connectivity, it can make sense to filter it.
I don't even think it would be completely absurd to reserve some connection capacity for peers which claim to be exactly the same protocol & subversion as you. Sure, a malicious party could spoof it, always matching it-- but there have been several incidents in the past where we were concerned about fork software accidentally engaging in a partitioning attack: Say Mallory release a fork of the software that on some date will change its consensus rules to require every block sends half its coinbase to her, Mallory isn't malicious, just stupid. She suckers a fair number of people into running this fork. When the time comes, your node is, by chance, totally surrounded by BitcoinMallory nodes. No one mines any Mallory blocks (or, if they do-- the Mallory node graph may be partitioned and not get them to any of your neighbours) and the Mallory peers block the 'invalid' valid blocks from reaching you. Without any blocks showing up you can't tell the Mallory nodes consensus rules differ. The stale tip detection is a last ditch hail marry to recover from this state, but it may not work well when you need it (e.g. due to honest nodes being full or attacked, or just taking a long time to find one that's up and not a Mallory node). Wouldn't it have been better if a couple of your connections were running the same software as you-- so that if any of them were connected to the graph of healthy nodes you would be too?
It's critical to keep in mind that broken software radically outnumbers attackers at almost all times. Esp because if countermeasures against attackers are generally effective then there won't be attackers. Also, if merely being a "broken" peer is enough to harm the network
Note, we do addr-relay for blocks-only peers,
It's a choice of the peer if they want to announce themselves through us. You make a great point that a peer might be wise to suppress its announcements toward and from its outbound block-only connections because those connections exist in part to help obscure the node connectivity graph. I thought they already did this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Considering how easy it is to be the first to announce a block and then launch a rotation right after, I'm not very positive abound this solution.
I acknowledge that finding these criteria is hard though.
I would suggest "take all no-tx nodes announced a block within a day, and save the most long-living connections among them".
It would still protect from completely fake nodes (no blocks at all), and impose a stronger attack cost I'd say (in terms of connection lifetime). Another benefit is slower-but-persistent honest blocks-only peers won't be evicted.
Not a strong opinion, I'm not a big fan of long-lifetime bias either, it just feels a bit stronger for now.
For now gonna ACK the existing code.
e5af0d1
to
ec4ee3f
Compare
@@ -887,7 +902,7 @@ bool CConnman::AttemptToEvictConnection() | |||
node->nLastBlockTime, node->nLastTXTime, | |||
HasAllDesirableServiceFlags(node->nServices), | |||
peer_relay_txes, peer_filter_not_null, node->addr, node->nKeyedNetGroup, | |||
node->m_prefer_evict}; | |||
node->m_prefer_evict, node->addr.IsLocal()}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I discovered that doing IsLocal(node->addr)
is different from node->addr.IsLocal()
, and the latter appears to be correct, but if someone can confirm that this captures the right idea I'd appreciate it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Somewhat confusingly:
node->addr.IsLocal()
returns true only ifnode->addr
is IPv[46] and is in the loopback address rangesIsLocal(node->addr)
checks whether or notnode->addr
resides inmapLocalHost
, which tracks our view of addresses we might be referred to externally
So in this case, I believe your use of node->addr.IsLocal()
is correct.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. addr.IsLocal() is right. IsLocal() ought to be renamed to IsProbablyMyAddress. :)
Updated to fix two bugs I discovered, and I did some light manual testing and verified that this seems to now do what is intended. |
@sdaftuar I've boldly copied all the changes to |
ACK ec4ee3f |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review ACK ec4ee3f.
@@ -845,6 +852,14 @@ static bool CompareNodeTXTime(const NodeEvictionCandidate &a, const NodeEviction | |||
return a.nTimeConnected > b.nTimeConnected; | |||
} | |||
|
|||
// Pick out the potential block-relay only peers, and sort them by last block time. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you have to modify, you might change to non-tx-relay peers/CompareNonTxRelayPeers
as comment above the sort is referring to.
vEvictionCandidates.erase(std::remove_if(vEvictionCandidates.end() - local_erase_size, vEvictionCandidates.end(), [](NodeEvictionCandidate const &n) { return n.m_is_local; }), vEvictionCandidates.end()); | ||
// Calculate how many we removed, and update our total number of peers that | ||
// we want to protect based on uptime accordingly. | ||
total_protect_size -= initial_size - vEvictionCandidates.size(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can write it as size_t total_protect_size = initial_size / 2 - (initial_size - vEvictionCandidates.size())
and thus save one line above. I found this equation expressing better that up to 1/2 remaining peers are protected in function of effectively protected localhost.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the way I wrote it makes sense, so not inclined to change this (seems like mostly style and not actually any more or less confusing?).
Anything left here? From #19500 it seems this fixes the user's issue. |
ACK ec4ee3f |
The following sections might be updated with supplementary metadata relevant to reviewers and maintainers. ConflictsNo conflicts as of last run. |
Onion peers are disadvantaged under our eviction criteria, so prevent eventual eviction of them in the presence of contention for inbound slots by reserving some slots for localhost peers (sorted by longest uptime). Block-relay-only connections exist as a protection against eclipse attacks, by creating a path for block propagation that may be unknown to adversaries. Protect against inbound peer connection slot attacks from disconnecting such peers by attempting to protect up to 8 peers that are not relaying transactions but appear to be full-nodes, sorted by recency of last delivered block. Thanks to gmaxwell for suggesting these strategies.
ec4ee3f
to
752e6ad
Compare
I think this strategy makes sense. |
…m eviction 752e6ad Protect localhost and block-relay-only peers from eviction (Suhas Daftuar) Pull request description: Onion peers are disadvantaged under our eviction criteria, so prevent eventual eviction of them in the presence of contention for inbound slots by reserving some slots for localhost peers (sorted by longest uptime). Block-relay-only connections exist as a protection against eclipse attacks, by creating a path for block propagation that may be unknown to adversaries. Protect against inbound peer connection slot attacks from disconnecting such peers by attempting to protect up to 8 peers that are not relaying transactions but have provided us with blocks. Thanks to gmaxwell for suggesting these strategies. ACKs for top commit: laanwj: Code review ACK 752e6ad Tree-SHA512: dbf089c77c1f747aa1dbbbc2e9c2799c628028b0918d0c336d8d0e5338acedd573b530eb3b689c7f603a17221e557268a9f5c3f585f204bfb12e5d2e76de39a3
…eviction protection test coverage 0cca08a Add unit test coverage for our onion peer eviction protection (Jon Atack) caa21f5 Protect onion+localhost peers in ProtectEvictionCandidatesByRatio() (Jon Atack) 8f1a53e Use EraseLastKElements() throughout SelectNodeToEvict() (Jon Atack) 8b1e156 Add m_inbound_onion to AttemptToEvictConnection() (Jon Atack) 72e30e8 Add unit tests for ProtectEvictionCandidatesByRatio() (Jon Atack) ca63b53 Use std::unordered_set instead of std::vector in IsEvicted() (Jon Atack) 41f84d5 Move peer eviction tests to a separate test file (Jon Atack) f126cbd Extract ProtectEvictionCandidatesByRatio from SelectNodeToEvict (Jon Atack) Pull request description: Now that #19991 and #20210 have been merged, we can determine inbound onion peers using `CNode::m_inbound_onion` and add it to the localhost peers protection in `AttemptToEvictConnection`, which was added in #19670 to address issue #19500. Update 28 February 2021: I've updated this to follow gmaxwell's suggestion in #20197 (comment). This branch now protects up to 1/4 onion peers (connected via our tor control service), if any, sorted by longest uptime. If any (or all) onion slots remain after that operation, they are then allocated to protect localhost peers, or a minimum of 2 localhost peers in the case that no onion slots remain and 2 or more onion peers were protected, sorted as before by longest uptime. This patch also adds test coverage for the longest uptime, localhost, and onion peer eviction protection logic to build on the welcome initial unit testing of #20477. Suggest reviewing the commits that move code with `colorMoved = dimmed-zebra` and `colorMovedWs = allow-indentation-change`. Closes #11537. ACKs for top commit: laanwj: Code review ACK 0cca08a vasild: ACK 0cca08a Tree-SHA512: 2f5a63f942acaae7882920fc61f0185dcd51da85e5b736df9d1fc72343726dd17da740e02f30fa5dc5eb3b2d8345707aed96031bec143d48a2497a610aa19abd
Summary: ``` Onion peers are disadvantaged under our eviction criteria, so prevent eventual eviction of them in the presence of contention for inbound slots by reserving some slots for localhost peers (sorted by longest uptime). Block-relay-only connections exist as a protection against eclipse attacks, by creating a path for block propagation that may be unknown to adversaries. Protect against inbound peer connection slot attacks from disconnecting such peers by attempting to protect up to 8 peers that are not relaying transactions but appear to be full-nodes, sorted by recency of last delivered block. ``` Backport of [[bitcoin/bitcoin#19670 | core#19670]]. Note that tests will be added when porting this PR: bitcoin/bitcoin#20477 Ref T1611. Test Plan: ninja all check-all Reviewers: #bitcoin_abc, PiRK Reviewed By: #bitcoin_abc, PiRK Maniphest Tasks: T1611 Differential Revision: https://reviews.bitcoinabc.org/D9676
Onion peers are disadvantaged under our eviction criteria, so prevent eventual
eviction of them in the presence of contention for inbound slots by reserving
some slots for localhost peers (sorted by longest uptime).
Block-relay-only connections exist as a protection against eclipse attacks, by
creating a path for block propagation that may be unknown to adversaries.
Protect against inbound peer connection slot attacks from disconnecting such
peers by attempting to protect up to 8 peers that are not relaying transactions
but have provided us with blocks.
Thanks to gmaxwell for suggesting these strategies.