-
Notifications
You must be signed in to change notification settings - Fork 37.7k
Immediately disconnect on invalid net message checksum #15206
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
The following sections might be updated with supplementary metadata relevant to reviewers and maintainers. ConflictsReviewers, this pull request conflicts with the following ones:
If you consider this pull request important, please also help to review the conflicting pull requests. Ideally, start with the one that should be merged first. |
Concept ACK |
I don't the deferred processing matters-- the checkvalue is protecting against network corruption not malicious input or anything like that, but I also don't think that doing it eagerly is harmful either. One consequence of this is that it moves more processing into the networking thread, rather than leaving it in the message processing thread. As a result it may result in lower performance in cases where hashing was a bottleneck, but I'm doubtful the slowdown will matter much anywhere. |
@gmaxwell It actually doesn't change much, as the message hash is being computed on the fly as the message packets come in. The |
@sipa I was aware, but for many messages (e.g. small ones) the finalization is the hashing. |
"The checksum check happens after deserialising all messages in the current read buffer." <-- No, this has not been the case since #9045. We hash as we receive data, only the finalization (ie last compression round and second hash) happen after we've received the full message. Don't really have a strong feeling about this, it does move a tiny big of CPU cost over to the net thread from net_processing, which is nice, though. |
@TheBlueMatt: thanks for the clarification. The intention of this PR is mainly to move the transportation protocol out of the message processing... and slowly moves towards a I think |
2bf4bf7
to
b9cc94b
Compare
Needs rebase (for sanity checking against the merged 36aeb43) |
b9cc94b
to
e7ab7d7
Compare
Rebased (required for now merged #15246). |
Indeed. I think it is because of the immediate termination of the connection which somehow confuses |
Should we also worry about this breaking networking for nodes behind idiot firewalls that corrupt certain strings? We know they exist because they occasionally cause nodes to be unable to fetch certain blocks and other such sillyness. Right now if the stupidity that triggers them is in an ephemeral message (like a ping, inv, or an addr message) it'll just get dropped, logged, and move on. |
I always wondered. Would the checksum could just be skipped at all? If I understand what is being said here, the only reason for having the checksum is so people have a clearer message in logs if the message get corrupted by firewall? |
@NicolasDorier It could be but it would generally be a bad idea, because then nodes would operate with silent corruption in some cases. In particular, a common failure mode (relatively speaking) is for firewalls rewrite addresses and port numbers, corrupting version handshakes and addr messages. It was more of a concern before we started relaxing banning behaviour, but even without that, no one should have to waste countless hours troubleshooting a really dumb issue like a non-transparent firewall. Doubly so because protocol conformance requires encoding it anyways. (plus, there is simply no reason to do so, the time they take is essentially negligible compared to other activities, and if we wanted to reduce it the authentication in the encrypted transport is faster (and thoroughly not optional)...) |
e7ab7d7
to
1d9bc6a
Compare
Rebased |
Do you think it is possible to do the refactor without the behaviour change? I believe it is only possible to do the behaviour change without the refactor (see #15206 (comment)) but not the other way round. Happy to be proven wrong |
For reference, I had one checksum error on a google cloud test node for one connection. Haven't tested other networks or NAT. Full history for that connection: history_peer_id_1849951.txt (2.5 Mb). (The error happened in an inv message after about 15 minutes of being connected. Total connection time was less than a day) Note that checksum errors are only logged with |
Why wouldn't it be possible? The checksum can be checked in |
Concept ACK on the refactor. After reading the discussion here and in the PR review club I am still undecided if the disconnect is a good idea. We don't seem to have enough data to be confident that it will not be a problem for some users. Would using the banscore be a viable middle ground in this case? Yes, we would ban the peer but only after multiple offenses and we could tweak the number so that we are confident there is a bigger problem (like the actual attack discussed earlier) in which case a ban seems actually more warranted than a disconnect. Banscore is generally not considered much these days it seems but I can't see why it would not be a possible solution in this case. |
@fjahr my view on banscore is that it's not particularly useful. If a peer sends us one bad message it's generally safe to assume that it'll probably send us more bad messages. Why wait until it's sent us 10 or 100 before disconnecting or banning? The ban score is a simple counter (and not some protocol violations/time gauge), so even if a peer sends us one bad message every few days, it'll eventually reach the ban limit. More generally, it's useful to ask why we'd disconnect a peer. I think there are three good reasons:
In all of those cases, I think that if you're going to disconnect, it's better to disconnect immediately, rather than wait for some arbitrary number of protocol violations. Or put differently, if you can tolerate 9 bad messages from a peer, why should you suddenly disconnect on the 10th? From an implementation standpoint, ban scores exist only in the net processing layer. net.cpp has no concept of a peer's ban score. If we want to move checksum verification to the net layer, then we either can't use ban scores, or we need to do a big change in the way they're handled. |
Ran all the tests. All unit tests and functional tests passing. Verified the test fails on master. Still haven't been able to decide on connection dropping behavior.
This seems like a good idea as the good refactoring part of the change can get merged while we gather up more data to decide on connection dropping behavior. Also I was wondering, in the scenario where a firewall mingles with message data causing checksum to fail, do they do it for all data all the time? or they do it intermittently?
Is it possible that my own firewall is playing foul on me which then results in me disconnecting all the nodes? Can this also give signal that something is wrong with my end of the connection? |
Just for information. I see this errors at my node (about 400 connections).
|
Those are clearly malicious nodes and disconnection is the best we can do |
1d9bc6a
to
9164e1c
Compare
Rebased. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
re-ACK 9164e1c rebase only, no change since my last review @ 1d9bc6a per git range-diff 24f7029 1d9bc6a 9164e1c
. If you retouch or for a follow-up, have at a look at suggestions like #15206 (comment), #15206 (comment), #15206 (comment), and #15206 (comment).
In acknowledgement of @MarcoFalke's challenge and @jnewbery's suggestion, this branch holds an example of moving the checksum verification to Thanks again to @jonasschnelli for being the impetus behind these PRs. These will pave the way for improvements like BIP324 and #18989. I hope these branches can clear up some of the worries expressed both here and in the PR Review Club. |
I like the approach in @troygiorshev's branch https://github.com/troygiorshev/bitcoin/tree/p2p-refactor-header, which moves the checks into net.cpp without changing behaviour. @jonasschnelli : can you take a look at that branch and see what you think? Does it interfere with your BIP324 work? |
I don't know how to proceed (if) with this PR. Splitting the PR into a pure refactoring is possible. @troygiorshev branch https://github.com/troygiorshev/bitcoin/tree/p2p-refactor-header looks not bad. If we are not going to do the disconnect (just a refactor), I suggest do close this PR and start with a fresh one for something like @troygiorshev is suggesting. |
🐙 This pull request conflicts with the target branch and needs rebase. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Personally, I have no problem disconnecting on invalid checksum.
If we continue with this PR, can you run clang-format-diff?
diff --git a/src/net.cpp b/src/net.cpp
index 15e06ffbe..e41c64e44 100644
--- a/src/net.cpp
+++ b/src/net.cpp
@@ -677,12 +677,11 @@ int V1TransportDeserializer::readData(const char *pch, unsigned int nBytes)
if (Complete()) {
const uint256& hash = GetMessageHash();
- if (memcmp(hash.begin(), hdr.pchChecksum, CMessageHeader::CHECKSUM_SIZE) != 0)
- {
+ if (memcmp(hash.begin(), hdr.pchChecksum, CMessageHeader::CHECKSUM_SIZE) != 0) {
LogPrint(BCLog::NET, "CHECKSUM ERROR (%s, %u bytes), expected %s was %s, disconnecting\n",
- SanitizeString(hdr.pchCommand), hdr.nMessageSize,
- HexStr(hash.begin(), hash.begin()+CMessageHeader::CHECKSUM_SIZE),
- HexStr(hdr.pchChecksum, hdr.pchChecksum+CMessageHeader::CHECKSUM_SIZE));
+ SanitizeString(hdr.pchCommand), hdr.nMessageSize,
+ HexStr(hash.begin(), hash.begin() + CMessageHeader::CHECKSUM_SIZE),
+ HexStr(hdr.pchChecksum, hdr.pchChecksum + CMessageHeader::CHECKSUM_SIZE));
return -1;
}
}
…r, extend logging deb5271 Remove header checks out of net_processing (Troy Giorshev) 52d4ae4 Give V1TransportDeserializer CChainParams& member (Troy Giorshev) 5bceef6 Change CMessageHeader Constructor (Troy Giorshev) 1ca20c1 Add doxygen comment for ReceiveMsgBytes (Troy Giorshev) 890b1d7 Move checksum check from net_processing to net (Troy Giorshev) 2716647 Give V1TransportDeserializer an m_node_id member (Troy Giorshev) Pull request description: Inspired by #15206 and #15197, this PR moves all message header verification from the message processing layer and into the network/transport layer. In the previous PRs there is a change in behavior, where we would disconnect from peers upon a single failed checksum check. In various discussions there was concern over whether this was the right choice, and some expressed a desire to see how this would look if it was made to be a pure refactor. For more context, see https://bitcoincore.reviews/15206.html#l-81. This PR improves the separation between the p2p layers, helping improvements like [BIP324](#18242) and #18989. ACKs for top commit: ryanofsky: Code review ACK deb5271 just rebase due to conflict on adjacent line jnewbery: Code review ACK deb5271. Tree-SHA512: 1a3b7ae883b020cfee1bef968813e04df651ffdad9dd961a826bd80654f2c98676ce7f4721038a1b78d8790e4cebe8060419e3d8affc97ce2b9b4e4b72e6fa9f
This is obsoleted by #19107 @jonasschnelli - if you think that there's anything in here that isn't addressed by 19107, feel free to reopen or open another PR. |
…rk layer, extend logging deb5271 Remove header checks out of net_processing (Troy Giorshev) 52d4ae4 Give V1TransportDeserializer CChainParams& member (Troy Giorshev) 5bceef6 Change CMessageHeader Constructor (Troy Giorshev) 1ca20c1 Add doxygen comment for ReceiveMsgBytes (Troy Giorshev) 890b1d7 Move checksum check from net_processing to net (Troy Giorshev) 2716647 Give V1TransportDeserializer an m_node_id member (Troy Giorshev) Pull request description: Inspired by bitcoin#15206 and bitcoin#15197, this PR moves all message header verification from the message processing layer and into the network/transport layer. In the previous PRs there is a change in behavior, where we would disconnect from peers upon a single failed checksum check. In various discussions there was concern over whether this was the right choice, and some expressed a desire to see how this would look if it was made to be a pure refactor. For more context, see https://bitcoincore.reviews/15206.html#l-81. This PR improves the separation between the p2p layers, helping improvements like [BIP324](bitcoin#18242) and bitcoin#18989. ACKs for top commit: ryanofsky: Code review ACK deb5271 just rebase due to conflict on adjacent line jnewbery: Code review ACK deb5271. Tree-SHA512: 1a3b7ae883b020cfee1bef968813e04df651ffdad9dd961a826bd80654f2c98676ce7f4721038a1b78d8790e4cebe8060419e3d8affc97ce2b9b4e4b72e6fa9f
Currently, messages with invalid checksums will be partially tolerated (skipped). The checksum check happens after deserialising all messages in the current read buffer.
This PR moves the checksum check to the network/transport layer (where it probably belongs) and rejects messages with invalid checksums immediately and disconnects the peer.
This PR separates the transport from the processing layer and helps protocol upgrades like BIP151.