Skip to content

Conversation

jonasschnelli
Copy link
Contributor

Currently, messages with invalid checksums will be partially tolerated (skipped). The checksum check happens after deserialising all messages in the current read buffer.

This PR moves the checksum check to the network/transport layer (where it probably belongs) and rejects messages with invalid checksums immediately and disconnects the peer.

This PR separates the transport from the processing layer and helps protocol upgrades like BIP151.

@jonasschnelli jonasschnelli changed the title p2p: immediately disconnect on invalid net message checksum Immediately disconnect on invalid net message checksum Jan 18, 2019
@DrahtBot
Copy link
Contributor

DrahtBot commented Jan 18, 2019

The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.

Conflicts

Reviewers, this pull request conflicts with the following ones:

If you consider this pull request important, please also help to review the conflicting pull requests. Ideally, start with the one that should be merged first.

@laanwj
Copy link
Member

laanwj commented Jan 19, 2019

Concept ACK

@gmaxwell
Copy link
Contributor

I don't the deferred processing matters-- the checkvalue is protecting against network corruption not malicious input or anything like that, but I also don't think that doing it eagerly is harmful either.

One consequence of this is that it moves more processing into the networking thread, rather than leaving it in the message processing thread. As a result it may result in lower performance in cases where hashing was a bottleneck, but I'm doubtful the slowdown will matter much anywhere.

@sipa
Copy link
Member

sipa commented Jan 19, 2019

@gmaxwell It actually doesn't change much, as the message hash is being computed on the fly as the message packets come in. The GetMessageHash call only does the SHA256 finalization.

@gmaxwell
Copy link
Contributor

@sipa I was aware, but for many messages (e.g. small ones) the finalization is the hashing.

@TheBlueMatt
Copy link
Contributor

"The checksum check happens after deserialising all messages in the current read buffer." <-- No, this has not been the case since #9045. We hash as we receive data, only the finalization (ie last compression round and second hash) happen after we've received the full message. Don't really have a strong feeling about this, it does move a tiny big of CPU cost over to the net thread from net_processing, which is nice, though.

@jonasschnelli
Copy link
Contributor Author

@TheBlueMatt: thanks for the clarification.

The intention of this PR is mainly to move the transportation protocol out of the message processing... and slowly moves towards a CNetMessage that is transport logic agnostic. With something like BIP150, we will have a different checksum (Poly1305 in the case of BIP151) and a different header.

I think CNetMessage should at one point only carry the payload and the command (and eventually the recv. time). The serialisation as well as optional encryption/decryption plus the checksum should happen outside of CNetMessage.

@maflcko
Copy link
Member

maflcko commented Jan 31, 2019

Needs rebase (for sanity checking against the merged 36aeb43)

@jonasschnelli
Copy link
Contributor Author

Rebased (required for now merged #15246).
Adapted test.

@Sjors
Copy link
Member

Sjors commented Feb 1, 2019

p2p_invalid_messages.py test passes on macOS 10.14.3 for the newly rebased e7ab7d7. It failed on one of the Travis machines though.

@jonasschnelli
Copy link
Contributor Author

p2p_invalid_messages.py test passes on macOS 10.14.3 for the newly rebased e7ab7d7. It failed on one of the Travis machines though.

Indeed. I think it is because of the immediate termination of the connection which somehow confuses mininode's getdata. I'll investigate.

@maflcko maflcko closed this Feb 3, 2019
@maflcko maflcko reopened this Feb 3, 2019
@gmaxwell
Copy link
Contributor

gmaxwell commented Feb 3, 2019

Should we also worry about this breaking networking for nodes behind idiot firewalls that corrupt certain strings? We know they exist because they occasionally cause nodes to be unable to fetch certain blocks and other such sillyness. Right now if the stupidity that triggers them is in an ephemeral message (like a ping, inv, or an addr message) it'll just get dropped, logged, and move on.

@NicolasDorier
Copy link
Contributor

NicolasDorier commented Feb 3, 2019

I always wondered. Would the checksum could just be skipped at all?

If I understand what is being said here, the only reason for having the checksum is so people have a clearer message in logs if the message get corrupted by firewall?

@gmaxwell
Copy link
Contributor

gmaxwell commented Feb 3, 2019

@NicolasDorier It could be but it would generally be a bad idea, because then nodes would operate with silent corruption in some cases. In particular, a common failure mode (relatively speaking) is for firewalls rewrite addresses and port numbers, corrupting version handshakes and addr messages. It was more of a concern before we started relaxing banning behaviour, but even without that, no one should have to waste countless hours troubleshooting a really dumb issue like a non-transparent firewall. Doubly so because protocol conformance requires encoding it anyways.

(plus, there is simply no reason to do so, the time they take is essentially negligible compared to other activities, and if we wanted to reduce it the authentication in the encrypted transport is faster (and thoroughly not optional)...)

@jonasschnelli
Copy link
Contributor Author

Rebased

@maflcko
Copy link
Member

maflcko commented May 21, 2020

Do you think it is possible to do the refactor without the behaviour change? I believe it is only possible to do the behaviour change without the refactor (see #15206 (comment)) but not the other way round.

Happy to be proven wrong

@maflcko
Copy link
Member

maflcko commented May 21, 2020

For reference, I had one checksum error on a google cloud test node for one connection. Haven't tested other networks or NAT. Full history for that connection: history_peer_id_1849951.txt (2.5 Mb). (The error happened in an inv message after about 15 minutes of being connected. Total connection time was less than a day)

Note that checksum errors are only logged with -debug or -debug=net.

@jnewbery
Copy link
Contributor

Do you think it is possible to do the refactor without the behaviour change?

Why wouldn't it be possible? The checksum can be checked in ReceiveMessageBytes() and the message dropped (with a log) if the checksum is wrong.

@fjahr
Copy link
Contributor

fjahr commented May 21, 2020

Concept ACK on the refactor. After reading the discussion here and in the PR review club I am still undecided if the disconnect is a good idea. We don't seem to have enough data to be confident that it will not be a problem for some users.

Would using the banscore be a viable middle ground in this case? Yes, we would ban the peer but only after multiple offenses and we could tweak the number so that we are confident there is a bigger problem (like the actual attack discussed earlier) in which case a ban seems actually more warranted than a disconnect.

Banscore is generally not considered much these days it seems but I can't see why it would not be a possible solution in this case.

@jnewbery
Copy link
Contributor

Banscore is generally not considered much these days it seems but I can't see why it would not be a possible solution in this case.

@fjahr my view on banscore is that it's not particularly useful. If a peer sends us one bad message it's generally safe to assume that it'll probably send us more bad messages. Why wait until it's sent us 10 or 100 before disconnecting or banning? The ban score is a simple counter (and not some protocol violations/time gauge), so even if a peer sends us one bad message every few days, it'll eventually reach the ban limit.

More generally, it's useful to ask why we'd disconnect a peer. I think there are three good reasons:

  1. to protect ourselves from resource-wasting by broken peers. Note that this doesn't really help against adversarial peers if we're accepting incoming connections, since they can always reconnect, and even banning connections from that IP address isn't much protection, since acquiring new IP addresses to connect from is trivial for an adversary.
  2. to free up on of our limited connection slots for a better peer (eg one that is more likely to relay blocks or propagate transactions to us)
    3 (maybe) in some cases, disconnecting from peers is the only way to signal to them that they're sending us bad data. If all your peers are disconnecting from you, then you can infer that there's probably something wrong with your connection or implementation. If peers were just dropping your messages, you may never find out.

In all of those cases, I think that if you're going to disconnect, it's better to disconnect immediately, rather than wait for some arbitrary number of protocol violations. Or put differently, if you can tolerate 9 bad messages from a peer, why should you suddenly disconnect on the 10th?

From an implementation standpoint, ban scores exist only in the net processing layer. net.cpp has no concept of a peer's ban score. If we want to move checksum verification to the net layer, then we either can't use ban scores, or we need to do a big change in the way they're handled.

@rajarshimaitra
Copy link
Contributor

Ran all the tests. All unit tests and functional tests passing. Verified the test fails on master. Still haven't been able to decide on connection dropping behavior.

Do you think it is possible to do the refactor without the behavior change?

This seems like a good idea as the good refactoring part of the change can get merged while we gather up more data to decide on connection dropping behavior.

Also I was wondering, in the scenario where a firewall mingles with message data causing checksum to fail, do they do it for all data all the time? or they do it intermittently?

3 (maybe) in some cases, disconnecting from peers is the only way to signal to them that they're sending us bad data. If all your peers are disconnecting from you, then you can infer that there's probably something wrong with your connection or implementation. If peers were just dropping your messages, you may never find out.

Is it possible that my own firewall is playing foul on me which then results in me disconnecting all the nodes? Can this also give signal that something is wrong with my end of the connection?

@Emzy
Copy link
Contributor

Emzy commented May 22, 2020

Just for information. I see this errors at my node (about 400 connections).

2020-05-21T00:22:57Z CHECKSUM ERROR (SMBr, 0 bytes), expected 5df6e0e2 was 00000000 2020-05-21T00:23:44Z CHECKSUM ERROR (, 0 bytes), expected 5df6e0e2 was 00000000 2020-05-21T00:23:55Z CHECKSUM ERROR (, 1 bytes), expected f3035c79 was 01000000
About 15 on one day.
Logging net is enabled.

@maflcko
Copy link
Member

maflcko commented May 22, 2020

Those are clearly malicious nodes and disconnection is the best we can do

@jonasschnelli
Copy link
Contributor Author

Rebased.

Copy link
Member

@jonatack jonatack left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

re-ACK 9164e1c rebase only, no change since my last review @ 1d9bc6a per git range-diff 24f7029 1d9bc6a 9164e1c. If you retouch or for a follow-up, have at a look at suggestions like #15206 (comment), #15206 (comment), #15206 (comment), and #15206 (comment).

@troygiorshev
Copy link
Contributor

In acknowledgement of @MarcoFalke's challenge and @jnewbery's suggestion, this branch holds an example of moving the checksum verification to GetMessage() (called by ReceiveMsgBytes()), without changing the current disconnect/don't disconnect behavior. In agreement with this other comment, a natural continuation could be this branch, which moves all header checks into net, again without changing the current behavior (along the lines of #15197).

Thanks again to @jonasschnelli for being the impetus behind these PRs. These will pave the way for improvements like BIP324 and #18989. I hope these branches can clear up some of the worries expressed both here and in the PR Review Club.

@jnewbery
Copy link
Contributor

I like the approach in @troygiorshev's branch https://github.com/troygiorshev/bitcoin/tree/p2p-refactor-header, which moves the checks into net.cpp without changing behaviour. @jonasschnelli : can you take a look at that branch and see what you think? Does it interfere with your BIP324 work?

@jonasschnelli
Copy link
Contributor Author

I don't know how to proceed (if) with this PR.
It seems like that the immediate disconnect on a single checksum is to harsh (by some contributors judgment). I'm personally unsure wether it's too strict or desirable.

Splitting the PR into a pure refactoring is possible. @troygiorshev branch https://github.com/troygiorshev/bitcoin/tree/p2p-refactor-header looks not bad.

If we are not going to do the disconnect (just a refactor), I suggest do close this PR and start with a fresh one for something like @troygiorshev is suggesting.

@DrahtBot
Copy link
Contributor

DrahtBot commented Jun 4, 2020

🐙 This pull request conflicts with the target branch and needs rebase.

Copy link
Contributor

@PastaPastaPasta PastaPastaPasta left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Personally, I have no problem disconnecting on invalid checksum.

If we continue with this PR, can you run clang-format-diff?

diff --git a/src/net.cpp b/src/net.cpp
index 15e06ffbe..e41c64e44 100644
--- a/src/net.cpp
+++ b/src/net.cpp
@@ -677,12 +677,11 @@ int V1TransportDeserializer::readData(const char *pch, unsigned int nBytes)
 
     if (Complete()) {
         const uint256& hash = GetMessageHash();
-        if (memcmp(hash.begin(), hdr.pchChecksum, CMessageHeader::CHECKSUM_SIZE) != 0)
-        {
+        if (memcmp(hash.begin(), hdr.pchChecksum, CMessageHeader::CHECKSUM_SIZE) != 0) {
             LogPrint(BCLog::NET, "CHECKSUM ERROR (%s, %u bytes), expected %s was %s, disconnecting\n",
-               SanitizeString(hdr.pchCommand), hdr.nMessageSize,
-               HexStr(hash.begin(), hash.begin()+CMessageHeader::CHECKSUM_SIZE),
-               HexStr(hdr.pchChecksum, hdr.pchChecksum+CMessageHeader::CHECKSUM_SIZE));
+                SanitizeString(hdr.pchCommand), hdr.nMessageSize,
+                HexStr(hash.begin(), hash.begin() + CMessageHeader::CHECKSUM_SIZE),
+                HexStr(hdr.pchChecksum, hdr.pchChecksum + CMessageHeader::CHECKSUM_SIZE));
             return -1;
         }
     }

fanquake added a commit that referenced this pull request Sep 29, 2020
…r, extend logging

deb5271 Remove header checks out of net_processing (Troy Giorshev)
52d4ae4 Give V1TransportDeserializer CChainParams& member (Troy Giorshev)
5bceef6 Change CMessageHeader Constructor (Troy Giorshev)
1ca20c1 Add doxygen comment for ReceiveMsgBytes (Troy Giorshev)
890b1d7 Move checksum check from net_processing to net (Troy Giorshev)
2716647 Give V1TransportDeserializer an m_node_id member (Troy Giorshev)

Pull request description:

  Inspired by #15206 and #15197, this PR moves all message header verification from the message processing layer and into the network/transport layer.

  In the previous PRs there is a change in behavior, where we would disconnect from peers upon a single failed checksum check.  In various discussions there was concern over whether this was the right choice, and some expressed a desire to see how this would look if it was made to be a pure refactor.

  For more context, see https://bitcoincore.reviews/15206.html#l-81.

  This PR improves the separation between the p2p layers, helping improvements like [BIP324](#18242) and #18989.

ACKs for top commit:
  ryanofsky:
    Code review ACK deb5271 just rebase due to conflict on adjacent line
  jnewbery:
    Code review ACK deb5271.

Tree-SHA512: 1a3b7ae883b020cfee1bef968813e04df651ffdad9dd961a826bd80654f2c98676ce7f4721038a1b78d8790e4cebe8060419e3d8affc97ce2b9b4e4b72e6fa9f
@jnewbery
Copy link
Contributor

This is obsoleted by #19107

@jonasschnelli - if you think that there's anything in here that isn't addressed by 19107, feel free to reopen or open another PR.

@jnewbery jnewbery closed this Sep 29, 2020
sidhujag pushed a commit to syscoin/syscoin that referenced this pull request Sep 29, 2020
…rk layer, extend logging

deb5271 Remove header checks out of net_processing (Troy Giorshev)
52d4ae4 Give V1TransportDeserializer CChainParams& member (Troy Giorshev)
5bceef6 Change CMessageHeader Constructor (Troy Giorshev)
1ca20c1 Add doxygen comment for ReceiveMsgBytes (Troy Giorshev)
890b1d7 Move checksum check from net_processing to net (Troy Giorshev)
2716647 Give V1TransportDeserializer an m_node_id member (Troy Giorshev)

Pull request description:

  Inspired by bitcoin#15206 and bitcoin#15197, this PR moves all message header verification from the message processing layer and into the network/transport layer.

  In the previous PRs there is a change in behavior, where we would disconnect from peers upon a single failed checksum check.  In various discussions there was concern over whether this was the right choice, and some expressed a desire to see how this would look if it was made to be a pure refactor.

  For more context, see https://bitcoincore.reviews/15206.html#l-81.

  This PR improves the separation between the p2p layers, helping improvements like [BIP324](bitcoin#18242) and bitcoin#18989.

ACKs for top commit:
  ryanofsky:
    Code review ACK deb5271 just rebase due to conflict on adjacent line
  jnewbery:
    Code review ACK deb5271.

Tree-SHA512: 1a3b7ae883b020cfee1bef968813e04df651ffdad9dd961a826bd80654f2c98676ce7f4721038a1b78d8790e4cebe8060419e3d8affc97ce2b9b4e4b72e6fa9f
@bitcoin bitcoin locked as resolved and limited conversation to collaborators Feb 15, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.