-
Notifications
You must be signed in to change notification settings - Fork 204
Fix TX reject messages not being counted anymore #247
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
A peer might send an INV (MsgGetData) message multiple times, screwing up our replies count and causing us to not look at the rejected messages. By de-duplicating the replies and reject messages we make sure we count correctly.
query.go
Outdated
// Peers might send the INV | ||
// request multiple times, we | ||
// need to de-duplicate them | ||
// using a map. | ||
replyChan := make(chan struct{}) | ||
replies[sp.ID()] = replyChan | ||
|
||
// Okay, so the peer responded | ||
// with an INV message, and we | ||
// sent out the TX. If we never | ||
// hear anything back from the | ||
// peer it means they accepted | ||
// the tx. If we get a reject, | ||
// things are clear as well. But | ||
// what if they are just slow to | ||
// respond? We'll give them | ||
// another bit of time to | ||
// respond. | ||
closeEventually( | ||
peerQuit, replyChan, | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, sorry if I broke this. I must have been mistaken in thinking that our inv
request would be completed after receiving a getdata
response (this case) and handling it by queuing the msgtx send.
Is the issue that after this first getdata
response case handles it, that there may be a subsequent reject
that should be handled by this same closure? My (apparently bad) assumption was that further responses to the tx data was handled elsewhere since that tx
send was queued up for asynchronous sending.
Looks like the QueryRejectTimeout
is meant to handle any such second response following handling of the first getdata
response? 1 second is certainly better than the full 5 second broadcast timeout that was always hit before! What you're doing with replyChan
makes sense in that context.
Thanks for the proper resolution to the broadcast timeout issue I was trying to solve. :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, sorry if I broke this.
No problem. I guess that's what our unit and integration tests in lnd are there for.
The actual problem is that publishing a TX over the p2p network isn't a synchronous process.
What happens is the following:
- The Neutrino client sends out an INV message with just the TXID of the so far unpublished TX ("hey y'all, have you ever heard of this TX?")
- Every peer will look at their mempool and if they don't find it, they'll answer back with a GetData message ("this is news to me, mind sending over that full TX?")
a) Now it could be that the peer just doesn't respond in time. In that case we just ignore that peer
b) It could also be that the peer already knows of the TX (for example we published it before). In that case it won't react at all (I think). - We send the full TX out to each peer that requested it as an MsgTx ("here it is, plz forward!").
- The peer now either accepts the TX and forwards it or rejects it.
a) If the peer accepts, we don't get any ACK or other response. After some time we just need to assume they're okay with it
b) If the peer rejects, they send back a MsgReject and we know something's up.
From what I understand, all of this happens in the closure above. I'm not 100% sure my solution works in all cases. But at least it should reduce the time for a normal, successful publish from 5 seconds to 1 second.
Why 1 second and not 5 as before?
The assumption here is that if a peer responds quickly (e.g. before the overall timeout), it should respond with a reject reasonably quickly too. If a peer fails to respond within a second, we still have the threshold detection to recognize invalid transactions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for describing that. It's all in line with my understanding of this particular query up until the last part where I was mistaken:
4. The peer now either accepts the TX and forwards it or rejects it.
a) If the peer accepts, we don't get any ACK or other response. After some time we just need to assume they're okay with it
b) If the peer rejects, they send back a MsgReject and we know something's up
This last bit is what I figured wasn't handled in this closure but rather whatever is sending the tx
, but it does make sense it would be need to be to associate this reject with this inv
message.
On 2(b), my observation is also no reaction if they already have it. I might expect a response in this already-have-it case, but the peers do stay quiet.
But at least it should reduce the time for a normal, successful publish from 5 seconds to 1 second.
Grateful for this, and I tend to agree with your logic on why a timeout on any second response after handling the initial getdata
can be more brief. It seems to me that receiving a getdata
and then getting the outgoing tx
message queued is the biggest hurdle since you are potentially querying a stale peer with your inv
message, but after that you're pretty sure it's a live and responsive peer. 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
b) If the peer rejects, they send back a MsgReject and we know something's up.
FWIW, the latest bitcoind
nodes won't send a reject message, as result we can end up in a more murky state where we don't know if they already have it, or think it's invalid.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This last bit is what I figured wasn't handled in this closure but rather whatever is sending the tx, but it does make sense it would be need to be to associate this reject with this inv message.
Yeah it ends up happening in this closure. This essentially hooks into every message received by the peer, with us only acting on certain messages.
This section of the code is actually using an older instance of this query code. The newer (re-designed) query interface is currently only used in the block manager: https://github.com/lightninglabs/neutrino/blob/master/blockmanager.go#L1052
We have some issues that track the process of moving over to the new interface uniformly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yay integration tests for catching this issue!
Did an initial review of the patch set. The one thing that comes to mind is that in practice, given that most of the network is bitcoind
and the newer versions actually never send a reject message, then this may not make a difference in practice. Would be interesting to check (based on advertised versions) the % of nodes that actually send a reject message.
In the end, I wager most neutrino nodes are connected to at least one btcd node given the lineage
// We can't really get here unless something is totally wrong in | ||
// the above error mapping code. | ||
return fmt.Errorf("invalid error mapping") | ||
// We did get some rejections, but not from all peers. Perhaps that's |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is intended to not return an error if say only one of our peers rejects (since we wait longer now?) the transaction?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is intended to return an error in slightly more cases. Before, we'd only return an error if all the peers responded with an error. Now we also return an error if only 50% of the peers reject.
That doesn't handle the case where only one peer rejects (the case where we're connected to a bunch of bitcoind
nodes and just one btcd
). Not sure what to do there, since we also don't want to open ourselves up to a malicious peer always rejecting and us failing because of that.
Perhaps we need to take an even closer look at the type of error we get? Like if we get one "input already spent", we give that 100% weight as with other rejects we use a threshold?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure what to do there, since we also don't want to open ourselves up to a malicious peer always rejecting and us failing because of that.
Yeah there's not really a great solution here...which I think was part of the rationale that the bitcoind devs used
Looking at the type of error would certainly help here. Ultimately, I think our rebroadcaster would help to fill the gap here: as long as we keep rebroadcasting until the transaction has been confirmed, then things should eventually be mined.
@ellemouton: review reminder |
87789b0
to
b3b8e1f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM ⛩
To make sure we don't ignore any incoming reject messages after answering an INV call, we give a responsive peer another second of time to respond before we tally up the responses.
If we didn't get a reject message from every peer that replied to our call with an INV message, we didn't look at the responses at all. With this commit we check if at least 60% (by default) of our responding peers declared the TX as invalid. If that's the case, we return the error since the transaction probably will never be accepted in the future.
b3b8e1f
to
2e3518d
Compare
This is a follow-up PR to fix an issue that #240 seems to have introduced.
When we close the
peerQuit
channel after receiving an INV message, we don't wait for any reject messages from that peer.Instead we want to wait a bit longer after an INV message to make sure they don't respond with a reject message. Otherwise we would never detect a double spend TX with Neutrino (this was found through a unit test in lnd: lightningnetwork/lnd#6285 (comment))
Not sure if this proposed solution is any better, I'm definitely open for suggestions as I'm not too familiar with this code base.