Fix TX reject messages not being counted anymore #247

guggero · 2022-02-23T13:36:28Z

This is a follow-up PR to fix an issue that #240 seems to have introduced.
When we close the peerQuit channel after receiving an INV message, we don't wait for any reject messages from that peer.

Instead we want to wait a bit longer after an INV message to make sure they don't respond with a reject message. Otherwise we would never detect a double spend TX with Neutrino (this was found through a unit test in lnd: lightningnetwork/lnd#6285 (comment))

Not sure if this proposed solution is any better, I'm definitely open for suggestions as I'm not too familiar with this code base.

A peer might send an INV (MsgGetData) message multiple times, screwing up our replies count and causing us to not look at the rejected messages. By de-duplicating the replies and reject messages we make sure we count correctly.

chappjc · 2022-02-23T14:22:26Z

query.go

+						// Peers might send the INV
+						// request multiple times, we
+						// need to de-duplicate them
+						// using a map.
+						replyChan := make(chan struct{})
+						replies[sp.ID()] = replyChan
+
+						// Okay, so the peer responded
+						// with an INV message, and we
+						// sent out the TX. If we never
+						// hear anything back from the
+						// peer it means they accepted
+						// the tx. If we get a reject,
+						// things are clear as well. But
+						// what if they are just slow to
+						// respond? We'll give them
+						// another bit of time to
+						// respond.
+						closeEventually(
+							peerQuit, replyChan,
+						)


Hmm, sorry if I broke this. I must have been mistaken in thinking that our inv request would be completed after receiving a getdata response (this case) and handling it by queuing the msgtx send.

Is the issue that after this first getdata response case handles it, that there may be a subsequent reject that should be handled by this same closure? My (apparently bad) assumption was that further responses to the tx data was handled elsewhere since that tx send was queued up for asynchronous sending.

Looks like the QueryRejectTimeout is meant to handle any such second response following handling of the first getdata response? 1 second is certainly better than the full 5 second broadcast timeout that was always hit before! What you're doing with replyChan makes sense in that context.

Thanks for the proper resolution to the broadcast timeout issue I was trying to solve. :)

Hmm, sorry if I broke this.

No problem. I guess that's what our unit and integration tests in lnd are there for.

The actual problem is that publishing a TX over the p2p network isn't a synchronous process.
What happens is the following:

The Neutrino client sends out an INV message with just the TXID of the so far unpublished TX ("hey y'all, have you ever heard of this TX?")

Every peer will look at their mempool and if they don't find it, they'll answer back with a GetData message ("this is news to me, mind sending over that full TX?")
a) Now it could be that the peer just doesn't respond in time. In that case we just ignore that peer
b) It could also be that the peer already knows of the TX (for example we published it before). In that case it won't react at all (I think).

We send the full TX out to each peer that requested it as an MsgTx ("here it is, plz forward!").

The peer now either accepts the TX and forwards it or rejects it.
a) If the peer accepts, we don't get any ACK or other response. After some time we just need to assume they're okay with it
b) If the peer rejects, they send back a MsgReject and we know something's up.

From what I understand, all of this happens in the closure above. I'm not 100% sure my solution works in all cases. But at least it should reduce the time for a normal, successful publish from 5 seconds to 1 second.

Why 1 second and not 5 as before?
The assumption here is that if a peer responds quickly (e.g. before the overall timeout), it should respond with a reject reasonably quickly too. If a peer fails to respond within a second, we still have the threshold detection to recognize invalid transactions.

Thanks for describing that. It's all in line with my understanding of this particular query up until the last part where I was mistaken:

4. The peer now either accepts the TX and forwards it or rejects it.
a) If the peer accepts, we don't get any ACK or other response. After some time we just need to assume they're okay with it
b) If the peer rejects, they send back a MsgReject and we know something's up

This last bit is what I figured wasn't handled in this closure but rather whatever is sending the tx, but it does make sense it would be need to be to associate this reject with this inv message.

On 2(b), my observation is also no reaction if they already have it. I might expect a response in this already-have-it case, but the peers do stay quiet.

But at least it should reduce the time for a normal, successful publish from 5 seconds to 1 second.

Grateful for this, and I tend to agree with your logic on why a timeout on any second response after handling the initial getdata can be more brief. It seems to me that receiving a getdata and then getting the outgoing tx message queued is the biggest hurdle since you are potentially querying a stale peer with your inv message, but after that you're pretty sure it's a live and responsive peer. 👍

b) If the peer rejects, they send back a MsgReject and we know something's up.

FWIW, the latest bitcoind nodes won't send a reject message, as result we can end up in a more murky state where we don't know if they already have it, or think it's invalid.

This last bit is what I figured wasn't handled in this closure but rather whatever is sending the tx, but it does make sense it would be need to be to associate this reject with this inv message.

Yeah it ends up happening in this closure. This essentially hooks into every message received by the peer, with us only acting on certain messages.

This section of the code is actually using an older instance of this query code. The newer (re-designed) query interface is currently only used in the block manager: https://github.com/lightninglabs/neutrino/blob/master/blockmanager.go#L1052

We have some issues that track the process of moving over to the new interface uniformly.

Roasbeef

Yay integration tests for catching this issue!

Did an initial review of the patch set. The one thing that comes to mind is that in practice, given that most of the network is bitcoind and the newer versions actually never send a reject message, then this may not make a difference in practice. Would be interesting to check (based on advertised versions) the % of nodes that actually send a reject message.

In the end, I wager most neutrino nodes are connected to at least one btcd node given the lineage

query.go

Roasbeef · 2022-03-03T00:49:13Z

query.go

-		// We can't really get here unless something is totally wrong in
-		// the above error mapping code.
-		return fmt.Errorf("invalid error mapping")
+	// We did get some rejections, but not from all peers. Perhaps that's


This is intended to not return an error if say only one of our peers rejects (since we wait longer now?) the transaction?

This is intended to return an error in slightly more cases. Before, we'd only return an error if all the peers responded with an error. Now we also return an error if only 50% of the peers reject.

That doesn't handle the case where only one peer rejects (the case where we're connected to a bunch of bitcoind nodes and just one btcd). Not sure what to do there, since we also don't want to open ourselves up to a malicious peer always rejecting and us failing because of that.
Perhaps we need to take an even closer look at the type of error we get? Like if we get one "input already spent", we give that 100% weight as with other rejects we use a threshold?

Not sure what to do there, since we also don't want to open ourselves up to a malicious peer always rejecting and us failing because of that.

Yeah there's not really a great solution here...which I think was part of the rationale that the bitcoind devs used

Looking at the type of error would certainly help here. Ultimately, I think our rebroadcaster would help to fill the gap here: as long as we keep rebroadcasting until the transaction has been confirmed, then things should eventually be mined.

lightninglabs-deploy · 2022-04-18T19:32:09Z

@ellemouton: review reminder
@guggero, remember to re-request review from reviewers when ready

Roasbeef

LGTM ⛩

query.go

To make sure we don't ignore any incoming reject messages after answering an INV call, we give a responsive peer another second of time to respond before we tally up the responses.

If we didn't get a reject message from every peer that replied to our call with an INV message, we didn't look at the responses at all. With this commit we check if at least 60% (by default) of our responding peers declared the TX as invalid. If that's the case, we return the error since the transaction probably will never be accepted in the future.

From lightninglabs/neutrino/pull/247.

query: de-duplicate replies and rejects

a35c7af

A peer might send an INV (MsgGetData) message multiple times, screwing up our replies count and causing us to not look at the rejected messages. By de-duplicating the replies and reject messages we make sure we count correctly.

guggero requested review from Roasbeef and ellemouton February 23, 2022 13:36

guggero mentioned this pull request Feb 23, 2022

Bump btcec to v2, use integrated btcutil module lightningnetwork/lnd#6285

Merged

chappjc reviewed Feb 23, 2022

View reviewed changes

Roasbeef reviewed Mar 3, 2022

View reviewed changes

guggero force-pushed the broadcast-reject branch from 87789b0 to b3b8e1f Compare April 20, 2022 09:04

guggero requested a review from Roasbeef April 20, 2022 09:04

Roasbeef approved these changes Apr 21, 2022

View reviewed changes

query.go Show resolved Hide resolved

query.go Show resolved Hide resolved

query.go Outdated Show resolved Hide resolved

guggero added 2 commits April 21, 2022 09:43

query: wait for response after INV call

8614b3f

To make sure we don't ignore any incoming reject messages after answering an INV call, we give a responsive peer another second of time to respond before we tally up the responses.

guggero force-pushed the broadcast-reject branch from b3b8e1f to 2e3518d Compare April 21, 2022 07:44

Roasbeef merged commit 867b01d into lightninglabs:master Apr 21, 2022

guggero deleted the broadcast-reject branch April 22, 2022 07:28

chappjc mentioned this pull request May 30, 2022

client/asset/bch: native Bitcoin Cash SPV wallet decred/dcrdex#1635

Merged

buck54321 pushed a commit to buck54321/neutrino-bch that referenced this pull request Jun 16, 2022

Fix TX reject messages not being counted anymore

9b52ef5

From lightninglabs/neutrino/pull/247.

buck54321 mentioned this pull request Jun 16, 2022

Fix TX reject messages not being counted anymore gcash/neutrino#38

Open

buck54321 mentioned this pull request Aug 5, 2022

Fix TX reject messages not being counted anymore dcrlabs/neutrino-bch#5

Merged

chappjc pushed a commit to dcrlabs/neutrino-bch that referenced this pull request Aug 7, 2022

Fix TX reject messages not being counted anymore

8c4f01c

From lightninglabs/neutrino/pull/247.

buck54321 mentioned this pull request Aug 9, 2022

Fix TX reject messages not being counted anymore dcrlabs/neutrino-ltc#2

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix TX reject messages not being counted anymore #247

Fix TX reject messages not being counted anymore #247

Uh oh!

guggero commented Feb 23, 2022

Uh oh!

chappjc Feb 23, 2022 •

edited

Loading

Uh oh!

guggero Feb 23, 2022

Uh oh!

chappjc Feb 23, 2022 •

edited

Loading

Uh oh!

Roasbeef Mar 3, 2022

Uh oh!

Roasbeef Mar 3, 2022

Uh oh!

Roasbeef left a comment

Uh oh!

Uh oh!

Uh oh!

Roasbeef Mar 3, 2022

Uh oh!

guggero Apr 20, 2022

Uh oh!

Roasbeef Apr 21, 2022

Uh oh!

lightninglabs-deploy commented Apr 18, 2022

Uh oh!

Roasbeef left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Fix TX reject messages not being counted anymore #247

Fix TX reject messages not being counted anymore #247

Uh oh!

Conversation

guggero commented Feb 23, 2022

Uh oh!

chappjc Feb 23, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

guggero Feb 23, 2022

Choose a reason for hiding this comment

Uh oh!

chappjc Feb 23, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Roasbeef Mar 3, 2022

Choose a reason for hiding this comment

Uh oh!

Roasbeef Mar 3, 2022

Choose a reason for hiding this comment

Uh oh!

Roasbeef left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Roasbeef Mar 3, 2022

Choose a reason for hiding this comment

Uh oh!

guggero Apr 20, 2022

Choose a reason for hiding this comment

Uh oh!

Roasbeef Apr 21, 2022

Choose a reason for hiding this comment

Uh oh!

lightninglabs-deploy commented Apr 18, 2022

Uh oh!

Roasbeef left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

chappjc Feb 23, 2022 •

edited

Loading

chappjc Feb 23, 2022 •

edited

Loading