Immediately disconnect on invalid net message checksum #15206

jonasschnelli · 2019-01-18T21:16:18Z

Currently, messages with invalid checksums will be partially tolerated (skipped). The checksum check happens after deserialising all messages in the current read buffer.

This PR moves the checksum check to the network/transport layer (where it probably belongs) and rejects messages with invalid checksums immediately and disconnects the peer.

This PR separates the transport from the processing layer and helps protocol upgrades like BIP151.

DrahtBot · 2019-01-18T22:37:03Z

The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.

Conflicts

Reviewers, this pull request conflicts with the following ones:

p2p: Move all header verification into the network layer, extend logging #19107 (p2p: Refactor, move all header verification into the network layer, without changing behavior by troygiorshev)
refactor: replace CNode pointers by references within net_processing.{h,cpp} #19053 (refactor: replace CNode pointers by references within net_processing.{h,cpp} by theStack)
net: Use mockable time for ping/pong, add tests #18638 (net: Use mockable time for ping/pong, add tests by MarcoFalke)

If you consider this pull request important, please also help to review the conflicting pull requests. Ideally, start with the one that should be merged first.

laanwj · 2019-01-19T16:29:19Z

Concept ACK

gmaxwell · 2019-01-19T17:42:22Z

I don't the deferred processing matters-- the checkvalue is protecting against network corruption not malicious input or anything like that, but I also don't think that doing it eagerly is harmful either.

One consequence of this is that it moves more processing into the networking thread, rather than leaving it in the message processing thread. As a result it may result in lower performance in cases where hashing was a bottleneck, but I'm doubtful the slowdown will matter much anywhere.

sipa · 2019-01-19T17:46:16Z

@gmaxwell It actually doesn't change much, as the message hash is being computed on the fly as the message packets come in. The GetMessageHash call only does the SHA256 finalization.

gmaxwell · 2019-01-19T18:07:10Z

@sipa I was aware, but for many messages (e.g. small ones) the finalization is the hashing.

TheBlueMatt · 2019-01-19T21:37:59Z

"The checksum check happens after deserialising all messages in the current read buffer." <-- No, this has not been the case since #9045. We hash as we receive data, only the finalization (ie last compression round and second hash) happen after we've received the full message. Don't really have a strong feeling about this, it does move a tiny big of CPU cost over to the net thread from net_processing, which is nice, though.

jonasschnelli · 2019-01-20T06:11:16Z

@TheBlueMatt: thanks for the clarification.

The intention of this PR is mainly to move the transportation protocol out of the message processing... and slowly moves towards a CNetMessage that is transport logic agnostic. With something like BIP150, we will have a different checksum (Poly1305 in the case of BIP151) and a different header.

I think CNetMessage should at one point only carry the payload and the command (and eventually the recv. time). The serialisation as well as optional encryption/decryption plus the checksum should happen outside of CNetMessage.

src/net.cpp

maflcko · 2019-01-31T15:43:35Z

Needs rebase (for sanity checking against the merged 36aeb43)

jonasschnelli · 2019-01-31T19:45:16Z

Rebased (required for now merged #15246).
Adapted test.

Sjors · 2019-02-01T10:20:55Z

p2p_invalid_messages.py test passes on macOS 10.14.3 for the newly rebased e7ab7d7. It failed on one of the Travis machines though.

jonasschnelli · 2019-02-01T23:48:53Z

p2p_invalid_messages.py test passes on macOS 10.14.3 for the newly rebased e7ab7d7. It failed on one of the Travis machines though.

Indeed. I think it is because of the immediate termination of the connection which somehow confuses mininode's getdata. I'll investigate.

gmaxwell · 2019-02-03T04:56:28Z

Should we also worry about this breaking networking for nodes behind idiot firewalls that corrupt certain strings? We know they exist because they occasionally cause nodes to be unable to fetch certain blocks and other such sillyness. Right now if the stupidity that triggers them is in an ephemeral message (like a ping, inv, or an addr message) it'll just get dropped, logged, and move on.

NicolasDorier · 2019-02-03T05:37:34Z

I always wondered. Would the checksum could just be skipped at all?

If I understand what is being said here, the only reason for having the checksum is so people have a clearer message in logs if the message get corrupted by firewall?

gmaxwell · 2019-02-03T19:57:48Z

@NicolasDorier It could be but it would generally be a bad idea, because then nodes would operate with silent corruption in some cases. In particular, a common failure mode (relatively speaking) is for firewalls rewrite addresses and port numbers, corrupting version handshakes and addr messages. It was more of a concern before we started relaxing banning behaviour, but even without that, no one should have to waste countless hours troubleshooting a really dumb issue like a non-transparent firewall. Doubly so because protocol conformance requires encoding it anyways.

(plus, there is simply no reason to do so, the time they take is essentially negligible compared to other activities, and if we wanted to reduce it the authentication in the encrypted transport is faster (and thoroughly not optional)...)

jonasschnelli · 2020-05-07T07:15:53Z

Rebased

maflcko · 2020-05-21T15:08:57Z

Do you think it is possible to do the refactor without the behaviour change? I believe it is only possible to do the behaviour change without the refactor (see #15206 (comment)) but not the other way round.

Happy to be proven wrong

maflcko · 2020-05-21T15:20:02Z

For reference, I had one checksum error on a google cloud test node for one connection. Haven't tested other networks or NAT. Full history for that connection: history_peer_id_1849951.txt (2.5 Mb). (The error happened in an inv message after about 15 minutes of being connected. Total connection time was less than a day)

Note that checksum errors are only logged with -debug or -debug=net.

jnewbery · 2020-05-21T15:24:01Z

Do you think it is possible to do the refactor without the behaviour change?

Why wouldn't it be possible? The checksum can be checked in ReceiveMessageBytes() and the message dropped (with a log) if the checksum is wrong.

fjahr · 2020-05-21T21:01:11Z

Concept ACK on the refactor. After reading the discussion here and in the PR review club I am still undecided if the disconnect is a good idea. We don't seem to have enough data to be confident that it will not be a problem for some users.

Would using the banscore be a viable middle ground in this case? Yes, we would ban the peer but only after multiple offenses and we could tweak the number so that we are confident there is a bigger problem (like the actual attack discussed earlier) in which case a ban seems actually more warranted than a disconnect.

Banscore is generally not considered much these days it seems but I can't see why it would not be a possible solution in this case.

jnewbery · 2020-05-22T00:57:59Z

Banscore is generally not considered much these days it seems but I can't see why it would not be a possible solution in this case.

@fjahr my view on banscore is that it's not particularly useful. If a peer sends us one bad message it's generally safe to assume that it'll probably send us more bad messages. Why wait until it's sent us 10 or 100 before disconnecting or banning? The ban score is a simple counter (and not some protocol violations/time gauge), so even if a peer sends us one bad message every few days, it'll eventually reach the ban limit.

More generally, it's useful to ask why we'd disconnect a peer. I think there are three good reasons:

to protect ourselves from resource-wasting by broken peers. Note that this doesn't really help against adversarial peers if we're accepting incoming connections, since they can always reconnect, and even banning connections from that IP address isn't much protection, since acquiring new IP addresses to connect from is trivial for an adversary.
to free up on of our limited connection slots for a better peer (eg one that is more likely to relay blocks or propagate transactions to us)
3 (maybe) in some cases, disconnecting from peers is the only way to signal to them that they're sending us bad data. If all your peers are disconnecting from you, then you can infer that there's probably something wrong with your connection or implementation. If peers were just dropping your messages, you may never find out.

In all of those cases, I think that if you're going to disconnect, it's better to disconnect immediately, rather than wait for some arbitrary number of protocol violations. Or put differently, if you can tolerate 9 bad messages from a peer, why should you suddenly disconnect on the 10th?

From an implementation standpoint, ban scores exist only in the net processing layer. net.cpp has no concept of a peer's ban score. If we want to move checksum verification to the net layer, then we either can't use ban scores, or we need to do a big change in the way they're handled.

rajarshimaitra · 2020-05-22T11:22:47Z

Ran all the tests. All unit tests and functional tests passing. Verified the test fails on master. Still haven't been able to decide on connection dropping behavior.

Do you think it is possible to do the refactor without the behavior change?

This seems like a good idea as the good refactoring part of the change can get merged while we gather up more data to decide on connection dropping behavior.

Also I was wondering, in the scenario where a firewall mingles with message data causing checksum to fail, do they do it for all data all the time? or they do it intermittently?

3 (maybe) in some cases, disconnecting from peers is the only way to signal to them that they're sending us bad data. If all your peers are disconnecting from you, then you can infer that there's probably something wrong with your connection or implementation. If peers were just dropping your messages, you may never find out.

Is it possible that my own firewall is playing foul on me which then results in me disconnecting all the nodes? Can this also give signal that something is wrong with my end of the connection?

Emzy · 2020-05-22T13:15:04Z

Just for information. I see this errors at my node (about 400 connections).

2020-05-21T00:22:57Z CHECKSUM ERROR (SMBr, 0 bytes), expected 5df6e0e2 was 00000000 2020-05-21T00:23:44Z CHECKSUM ERROR (, 0 bytes), expected 5df6e0e2 was 00000000 2020-05-21T00:23:55Z CHECKSUM ERROR (, 1 bytes), expected f3035c79 was 01000000
About 15 on one day.
Logging net is enabled.

maflcko · 2020-05-22T13:32:26Z

Those are clearly malicious nodes and disconnection is the best we can do

jonasschnelli · 2020-05-26T07:00:08Z

Rebased.

jonatack

re-ACK 9164e1c rebase only, no change since my last review @ 1d9bc6a per git range-diff 24f7029 1d9bc6a 9164e1c. If you retouch or for a follow-up, have at a look at suggestions like #15206 (comment), #15206 (comment), #15206 (comment), and #15206 (comment).

troygiorshev · 2020-05-27T14:40:45Z

In acknowledgement of @MarcoFalke's challenge and @jnewbery's suggestion, this branch holds an example of moving the checksum verification to GetMessage() (called by ReceiveMsgBytes()), without changing the current disconnect/don't disconnect behavior. In agreement with this other comment, a natural continuation could be this branch, which moves all header checks into net, again without changing the current behavior (along the lines of #15197).

Thanks again to @jonasschnelli for being the impetus behind these PRs. These will pave the way for improvements like BIP324 and #18989. I hope these branches can clear up some of the worries expressed both here and in the PR Review Club.

jnewbery · 2020-05-27T15:07:37Z

I like the approach in @troygiorshev's branch https://github.com/troygiorshev/bitcoin/tree/p2p-refactor-header, which moves the checks into net.cpp without changing behaviour. @jonasschnelli : can you take a look at that branch and see what you think? Does it interfere with your BIP324 work?

jonasschnelli · 2020-05-29T06:03:05Z

I don't know how to proceed (if) with this PR.
It seems like that the immediate disconnect on a single checksum is to harsh (by some contributors judgment). I'm personally unsure wether it's too strict or desirable.

Splitting the PR into a pure refactoring is possible. @troygiorshev branch https://github.com/troygiorshev/bitcoin/tree/p2p-refactor-header looks not bad.

If we are not going to do the disconnect (just a refactor), I suggest do close this PR and start with a fresh one for something like @troygiorshev is suggesting.

DrahtBot · 2020-06-04T20:37:22Z

🐙 This pull request conflicts with the target branch and needs rebase.

PastaPastaPasta

Personally, I have no problem disconnecting on invalid checksum.

If we continue with this PR, can you run clang-format-diff?

diff --git a/src/net.cpp b/src/net.cpp
index 15e06ffbe..e41c64e44 100644
--- a/src/net.cpp
+++ b/src/net.cpp
@@ -677,12 +677,11 @@ int V1TransportDeserializer::readData(const char *pch, unsigned int nBytes)
 
     if (Complete()) {
         const uint256& hash = GetMessageHash();
-        if (memcmp(hash.begin(), hdr.pchChecksum, CMessageHeader::CHECKSUM_SIZE) != 0)
-        {
+        if (memcmp(hash.begin(), hdr.pchChecksum, CMessageHeader::CHECKSUM_SIZE) != 0) {
             LogPrint(BCLog::NET, "CHECKSUM ERROR (%s, %u bytes), expected %s was %s, disconnecting\n",
-               SanitizeString(hdr.pchCommand), hdr.nMessageSize,
-               HexStr(hash.begin(), hash.begin()+CMessageHeader::CHECKSUM_SIZE),
-               HexStr(hdr.pchChecksum, hdr.pchChecksum+CMessageHeader::CHECKSUM_SIZE));
+                SanitizeString(hdr.pchCommand), hdr.nMessageSize,
+                HexStr(hash.begin(), hash.begin() + CMessageHeader::CHECKSUM_SIZE),
+                HexStr(hdr.pchChecksum, hdr.pchChecksum + CMessageHeader::CHECKSUM_SIZE));
             return -1;
         }
     }

…r, extend logging deb5271 Remove header checks out of net_processing (Troy Giorshev) 52d4ae4 Give V1TransportDeserializer CChainParams& member (Troy Giorshev) 5bceef6 Change CMessageHeader Constructor (Troy Giorshev) 1ca20c1 Add doxygen comment for ReceiveMsgBytes (Troy Giorshev) 890b1d7 Move checksum check from net_processing to net (Troy Giorshev) 2716647 Give V1TransportDeserializer an m_node_id member (Troy Giorshev) Pull request description: Inspired by #15206 and #15197, this PR moves all message header verification from the message processing layer and into the network/transport layer. In the previous PRs there is a change in behavior, where we would disconnect from peers upon a single failed checksum check. In various discussions there was concern over whether this was the right choice, and some expressed a desire to see how this would look if it was made to be a pure refactor. For more context, see https://bitcoincore.reviews/15206.html#l-81. This PR improves the separation between the p2p layers, helping improvements like [BIP324](#18242) and #18989. ACKs for top commit: ryanofsky: Code review ACK deb5271 just rebase due to conflict on adjacent line jnewbery: Code review ACK deb5271. Tree-SHA512: 1a3b7ae883b020cfee1bef968813e04df651ffdad9dd961a826bd80654f2c98676ce7f4721038a1b78d8790e4cebe8060419e3d8affc97ce2b9b4e4b72e6fa9f

jnewbery · 2020-09-29T10:39:31Z

This is obsoleted by #19107

@jonasschnelli - if you think that there's anything in here that isn't addressed by 19107, feel free to reopen or open another PR.

…rk layer, extend logging deb5271 Remove header checks out of net_processing (Troy Giorshev) 52d4ae4 Give V1TransportDeserializer CChainParams& member (Troy Giorshev) 5bceef6 Change CMessageHeader Constructor (Troy Giorshev) 1ca20c1 Add doxygen comment for ReceiveMsgBytes (Troy Giorshev) 890b1d7 Move checksum check from net_processing to net (Troy Giorshev) 2716647 Give V1TransportDeserializer an m_node_id member (Troy Giorshev) Pull request description: Inspired by bitcoin#15206 and bitcoin#15197, this PR moves all message header verification from the message processing layer and into the network/transport layer. In the previous PRs there is a change in behavior, where we would disconnect from peers upon a single failed checksum check. In various discussions there was concern over whether this was the right choice, and some expressed a desire to see how this would look if it was made to be a pure refactor. For more context, see https://bitcoincore.reviews/15206.html#l-81. This PR improves the separation between the p2p layers, helping improvements like [BIP324](bitcoin#18242) and bitcoin#18989. ACKs for top commit: ryanofsky: Code review ACK deb5271 just rebase due to conflict on adjacent line jnewbery: Code review ACK deb5271. Tree-SHA512: 1a3b7ae883b020cfee1bef968813e04df651ffdad9dd961a826bd80654f2c98676ce7f4721038a1b78d8790e4cebe8060419e3d8affc97ce2b9b4e4b72e6fa9f

jonasschnelli added the P2P label Jan 18, 2019

jonasschnelli changed the title ~~p2p: immediately disconnect on invalid net message checksum~~ Immediately disconnect on invalid net message checksum Jan 18, 2019

Sjors reviewed Jan 23, 2019

View reviewed changes

src/net.cpp Outdated Show resolved Hide resolved

src/net.cpp Show resolved Hide resolved

jonasschnelli force-pushed the 2019/01/netmsg_2 branch from 2bf4bf7 to b9cc94b Compare January 24, 2019 19:34

jonasschnelli force-pushed the 2019/01/netmsg_2 branch from b9cc94b to e7ab7d7 Compare January 31, 2019 19:44

maflcko closed this Feb 3, 2019

maflcko reopened this Feb 3, 2019

jonasschnelli mentioned this pull request Jul 24, 2019

p2p: Refactor network message deserialization #16202

Merged

DrahtBot added the Needs rebase label Oct 9, 2019

rebroad mentioned this pull request Apr 25, 2020

IBD slowed down by failing to deal with a bad checksum block message #18755

Closed

jonasschnelli mentioned this pull request May 6, 2020

Add BIP324 encrypted p2p transport de-/serializer (only used in tests) #18242

Closed

jonasschnelli force-pushed the 2019/01/netmsg_2 branch from e7ab7d7 to 1d9bc6a Compare May 7, 2020 07:11

DrahtBot removed the Needs rebase label May 7, 2020

This was referenced May 7, 2020

Make g_chainman internal to validation #18698

Merged

net: Use mockable time for ping/pong, add tests #18638

Merged

WasanTantiwararom mentioned this pull request May 21, 2020

[WIP] transaction fees in getblock #16083

Closed

DrahtBot mentioned this pull request May 22, 2020

refactor: replace CNode pointers by references within net_processing.{h,cpp} #19053

Merged

DrahtBot added the Needs rebase label May 23, 2020

p2p: immediately disconnect on invalid net message checksum

9164e1c

jonasschnelli force-pushed the 2019/01/netmsg_2 branch from 1d9bc6a to 9164e1c Compare May 26, 2020 07:00

DrahtBot removed the Needs rebase label May 26, 2020

jonatack reviewed May 26, 2020

View reviewed changes

This was referenced May 26, 2020

[BROKEN IGNORE] troygiorshev/bitcoin#1

Closed

p2p: Refactor, move checksum verification into net, without changing behavior troygiorshev/bitcoin#2

Closed

troygiorshev mentioned this pull request May 29, 2020

p2p: Move all header verification into the network layer, extend logging #19107

Merged

DrahtBot added the Needs rebase label Jun 4, 2020

PastaPastaPasta reviewed Jun 28, 2020

View reviewed changes

jnewbery closed this Sep 29, 2020

bitcoin locked as resolved and limited conversation to collaborators Feb 15, 2022

Immediately disconnect on invalid net message checksum #15206

Immediately disconnect on invalid net message checksum #15206

Uh oh!

Conversation

jonasschnelli commented Jan 18, 2019

Uh oh!

DrahtBot commented Jan 18, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Conflicts

Uh oh!

laanwj commented Jan 19, 2019

Uh oh!

gmaxwell commented Jan 19, 2019

Uh oh!

sipa commented Jan 19, 2019

Uh oh!

gmaxwell commented Jan 19, 2019

Uh oh!

TheBlueMatt commented Jan 19, 2019

Uh oh!

jonasschnelli commented Jan 20, 2019

Uh oh!

Uh oh!

Uh oh!

maflcko commented Jan 31, 2019

Uh oh!

jonasschnelli commented Jan 31, 2019

Uh oh!

Sjors commented Feb 1, 2019

Uh oh!

jonasschnelli commented Feb 1, 2019

Uh oh!

gmaxwell commented Feb 3, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

NicolasDorier commented Feb 3, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gmaxwell commented Feb 3, 2019

Uh oh!

jonasschnelli commented May 7, 2020

Uh oh!

maflcko commented May 21, 2020

Uh oh!

maflcko commented May 21, 2020

Uh oh!

jnewbery commented May 21, 2020

Uh oh!

fjahr commented May 21, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jnewbery commented May 22, 2020

Uh oh!

rajarshimaitra commented May 22, 2020

Uh oh!

Emzy commented May 22, 2020

Uh oh!

maflcko commented May 22, 2020

Uh oh!

jonasschnelli commented May 26, 2020

Uh oh!

jonatack left a comment

Choose a reason for hiding this comment

Uh oh!

troygiorshev commented May 27, 2020

Uh oh!

jnewbery commented May 27, 2020

Uh oh!

jonasschnelli commented May 29, 2020

Uh oh!

DrahtBot commented Jun 4, 2020

Uh oh!

PastaPastaPasta left a comment

Choose a reason for hiding this comment

Uh oh!

jnewbery commented Sep 29, 2020

Uh oh!

Uh oh!

DrahtBot commented Jan 18, 2019 •

edited

Loading

gmaxwell commented Feb 3, 2019 •

edited

Loading

NicolasDorier commented Feb 3, 2019 •

edited

Loading

fjahr commented May 21, 2020 •

edited

Loading