Skip to content

Conversation

guggero
Copy link
Contributor

@guggero guggero commented Feb 23, 2022

This is a follow-up PR to fix an issue that #240 seems to have introduced.
When we close the peerQuit channel after receiving an INV message, we don't wait for any reject messages from that peer.

Instead we want to wait a bit longer after an INV message to make sure they don't respond with a reject message. Otherwise we would never detect a double spend TX with Neutrino (this was found through a unit test in lnd: lightningnetwork/lnd#6285 (comment))

Not sure if this proposed solution is any better, I'm definitely open for suggestions as I'm not too familiar with this code base.

A peer might send an INV (MsgGetData) message multiple times, screwing
up our replies count and causing us to not look at the rejected
messages.
By de-duplicating the replies and reject messages we make sure we count
correctly.
query.go Outdated
Comment on lines 1109 to 1129
// Peers might send the INV
// request multiple times, we
// need to de-duplicate them
// using a map.
replyChan := make(chan struct{})
replies[sp.ID()] = replyChan

// Okay, so the peer responded
// with an INV message, and we
// sent out the TX. If we never
// hear anything back from the
// peer it means they accepted
// the tx. If we get a reject,
// things are clear as well. But
// what if they are just slow to
// respond? We'll give them
// another bit of time to
// respond.
closeEventually(
peerQuit, replyChan,
)
Copy link
Contributor

@chappjc chappjc Feb 23, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, sorry if I broke this. I must have been mistaken in thinking that our inv request would be completed after receiving a getdata response (this case) and handling it by queuing the msgtx send.

Is the issue that after this first getdata response case handles it, that there may be a subsequent reject that should be handled by this same closure? My (apparently bad) assumption was that further responses to the tx data was handled elsewhere since that tx send was queued up for asynchronous sending.

Looks like the QueryRejectTimeout is meant to handle any such second response following handling of the first getdata response? 1 second is certainly better than the full 5 second broadcast timeout that was always hit before! What you're doing with replyChan makes sense in that context.

Thanks for the proper resolution to the broadcast timeout issue I was trying to solve. :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, sorry if I broke this.

No problem. I guess that's what our unit and integration tests in lnd are there for.

The actual problem is that publishing a TX over the p2p network isn't a synchronous process.
What happens is the following:

  1. The Neutrino client sends out an INV message with just the TXID of the so far unpublished TX ("hey y'all, have you ever heard of this TX?")
  2. Every peer will look at their mempool and if they don't find it, they'll answer back with a GetData message ("this is news to me, mind sending over that full TX?")
    a) Now it could be that the peer just doesn't respond in time. In that case we just ignore that peer
    b) It could also be that the peer already knows of the TX (for example we published it before). In that case it won't react at all (I think).
  3. We send the full TX out to each peer that requested it as an MsgTx ("here it is, plz forward!").
  4. The peer now either accepts the TX and forwards it or rejects it.
    a) If the peer accepts, we don't get any ACK or other response. After some time we just need to assume they're okay with it
    b) If the peer rejects, they send back a MsgReject and we know something's up.

From what I understand, all of this happens in the closure above. I'm not 100% sure my solution works in all cases. But at least it should reduce the time for a normal, successful publish from 5 seconds to 1 second.

Why 1 second and not 5 as before?
The assumption here is that if a peer responds quickly (e.g. before the overall timeout), it should respond with a reject reasonably quickly too. If a peer fails to respond within a second, we still have the threshold detection to recognize invalid transactions.

Copy link
Contributor

@chappjc chappjc Feb 23, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for describing that. It's all in line with my understanding of this particular query up until the last part where I was mistaken:

4. The peer now either accepts the TX and forwards it or rejects it.
a) If the peer accepts, we don't get any ACK or other response. After some time we just need to assume they're okay with it
b) If the peer rejects, they send back a MsgReject and we know something's up

This last bit is what I figured wasn't handled in this closure but rather whatever is sending the tx, but it does make sense it would be need to be to associate this reject with this inv message.

On 2(b), my observation is also no reaction if they already have it. I might expect a response in this already-have-it case, but the peers do stay quiet.

But at least it should reduce the time for a normal, successful publish from 5 seconds to 1 second.

Grateful for this, and I tend to agree with your logic on why a timeout on any second response after handling the initial getdata can be more brief. It seems to me that receiving a getdata and then getting the outgoing tx message queued is the biggest hurdle since you are potentially querying a stale peer with your inv message, but after that you're pretty sure it's a live and responsive peer. 👍

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

b) If the peer rejects, they send back a MsgReject and we know something's up.

FWIW, the latest bitcoind nodes won't send a reject message, as result we can end up in a more murky state where we don't know if they already have it, or think it's invalid.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This last bit is what I figured wasn't handled in this closure but rather whatever is sending the tx, but it does make sense it would be need to be to associate this reject with this inv message.

Yeah it ends up happening in this closure. This essentially hooks into every message received by the peer, with us only acting on certain messages.

This section of the code is actually using an older instance of this query code. The newer (re-designed) query interface is currently only used in the block manager: https://github.com/lightninglabs/neutrino/blob/master/blockmanager.go#L1052

We have some issues that track the process of moving over to the new interface uniformly.

Copy link
Member

@Roasbeef Roasbeef left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yay integration tests for catching this issue!

Did an initial review of the patch set. The one thing that comes to mind is that in practice, given that most of the network is bitcoind and the newer versions actually never send a reject message, then this may not make a difference in practice. Would be interesting to check (based on advertised versions) the % of nodes that actually send a reject message.

In the end, I wager most neutrino nodes are connected to at least one btcd node given the lineage

// We can't really get here unless something is totally wrong in
// the above error mapping code.
return fmt.Errorf("invalid error mapping")
// We did get some rejections, but not from all peers. Perhaps that's
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is intended to not return an error if say only one of our peers rejects (since we wait longer now?) the transaction?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is intended to return an error in slightly more cases. Before, we'd only return an error if all the peers responded with an error. Now we also return an error if only 50% of the peers reject.

That doesn't handle the case where only one peer rejects (the case where we're connected to a bunch of bitcoind nodes and just one btcd). Not sure what to do there, since we also don't want to open ourselves up to a malicious peer always rejecting and us failing because of that.
Perhaps we need to take an even closer look at the type of error we get? Like if we get one "input already spent", we give that 100% weight as with other rejects we use a threshold?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure what to do there, since we also don't want to open ourselves up to a malicious peer always rejecting and us failing because of that.

Yeah there's not really a great solution here...which I think was part of the rationale that the bitcoind devs used

Looking at the type of error would certainly help here. Ultimately, I think our rebroadcaster would help to fill the gap here: as long as we keep rebroadcasting until the transaction has been confirmed, then things should eventually be mined.

@lightninglabs-deploy
Copy link

@ellemouton: review reminder
@guggero, remember to re-request review from reviewers when ready

@guggero guggero requested a review from Roasbeef April 20, 2022 09:04
Copy link
Member

@Roasbeef Roasbeef left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM ⛩

guggero added 2 commits April 21, 2022 09:43
To make sure we don't ignore any incoming reject messages after
answering an INV call, we give a responsive peer another second of time
to respond before we tally up the responses.
If we didn't get a reject message from every peer that replied to our
call with an INV message, we didn't look at the responses at all. With
this commit we check if at least 60% (by default) of our responding
peers declared the TX as invalid. If that's the case, we return the
error since the transaction probably will never be accepted in the
future.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants