-
Notifications
You must be signed in to change notification settings - Fork 37.7k
persist IBD latch across restarts #21106
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
persist IBD latch across restarts #21106
Conversation
516a106
to
22c27cc
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is essentially saying that if the node can't get out of IBD within two days, then just pretend we've completed IBD anyway. I don't think that's the right behaviour - if we've stalled and aren't making any progress towards getting out of IBD, we'd be better off warning the user very loudly and then perhaps shutting down.
Wouldn't that mean that there hasn't been a block in 24 hours? Then, waiting another two days for the next block doesn't seem like a fallback that will be needed. |
@MarcoFalke no it doesn't, just that the listening nodes haven't seen the block yet. |
22c27cc
to
99283bf
Compare
If the listening nodes haven't seen a block from 24h ago, something is going wrong and needs human attention to manually fixup. Having an automated fallback that hits after two days doesn't seem needed when having to manually interfere anyway. |
I fully agree that the node should try to recover (to avoid manual intervention as much as possible) when its both reasonable and safe to do. Though in this particular case, the users that are eager enough to corrupt their datadir, won't be patient enough to wait three full days for this workaround to take effect. And similarly, the patient users won't see the benefits either because some eager users successfully poked their node on the first day of issues. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting scenario. I wonder if latching IBD to false is a sufficiently fine-grained way to do this; maybe a separate timeout latch could be added, as you mention, "The listening nodes need to request headers from inbound connections, either by exiting IBD or special casing that limit while in IBD." I'm not sure about the default -ibdtimeout
value of 48 hours, which seems long, what other durations users would want to provide, or what guidance can be given to users. This probably needs a functional test if it goes forward.
Yes. |
If this was a boolean option (default to false), it would be easier to review than this time-based threshold. Obviously, it would also lose the node-fixes-itself-after-three-days feature. Though, I am still sceptical of it's use. It assumes that all of your connections (inbound, as well as block-feeler, outbound block-relay, ...) are out of IBD for three days without a single one of them setting the flag manually. (Edit: I wrote this before the previous comment, so it is not a reply to that) |
@MarcoFalke In IBD we do not request headers from inbound peers. Literally only from a single outbound peer with a timeout. A setting to disable IBD for your own node might fix your node if your node is listening and receiving inbound connections from nodes which are sync'd with miners. Relying on that would be extremely fragile. Even if you were able to fix your own node, you can't then heal the rest of the network without them connecting to you as the single initial sync peer, the odds of which are extremely low. The timeout prevents the network from being in this state for more than 48 hours with a bunch of nodes stuck in IBD. I'm assuming that the vast majority of listening nodes are not actively maintained and have no wallet, but are rather used as a kind of gateway, so being down for 48 hours is a nuisance to the rest of the network, but is probably not directly going to cause them to take action. |
@gmaxwell Yes just writing the latch to disk is probably the best solution. |
Yes, if we can avoid using time that would be a real win. Easier to test too. |
6f4e5dd
to
cf742ec
Compare
f02469b
to
4da5396
Compare
this prevents a network wide failure in the event that all or most listening nodes are restarted at approximately the same time
4da5396
to
a3e713a
Compare
The CI failures are all random zmq issues. |
Concept ACK |
@@ -1294,15 +1295,23 @@ void CChainState::InitCoinsCache(size_t cache_size_bytes) | |||
// `const` so that `CValidationInterface` clients (which are given a `const CChainState*`) | |||
// can call it. | |||
// | |||
// The file .ibd_complete is used to latch the datadir to IBD false across restarts |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this moves forward, noting that a file description should be added to doc/files.md
(alternatively, perhaps persist the latch in settings.json
, which is also chain-specific).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did this as a separate file so that users can easily add the file to override, with GetDataDir() it should be chain specific, no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I think they should both be.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Users can easily override settings.json as well - they're just interpreted as command-line options.
So something like --ibd_complete=0/1
would work
Is there anything I need to do here? |
Convince other contributors that this is a good idea. Currently, if a node is switched off and then switched back on some time later, it'll start up in IBD until it catches up to the tip. This seems like good behaviour - we don't want to download transactions, relay our address, respond to getheaders, etc until we've caught back up. If you want other contributors to ACK this PR, you'll need to convince them that changing that behaviour is an improvement, and does not introduce any regressions/vulnerabilities/etc. |
if (fs::exists(latch_filename)) { | ||
LogPrintf("Leaving InitialBlockDownload (latching to false)\n"); | ||
m_cached_finished_ibd.store(true, std::memory_order_relaxed); | ||
return false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will set ibd to false, when -reindex
? If yes, that seems wrong
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, it is wrong
TL;DR: I think we should just drop the "don't respond to GETHEADERS while in IBD" behavior, rather than keeping the IsInitialBlockDownload() latch across restarts. Possibly we can also improve the find-another-peer-to-sync-headers-from logic (and, if that's the real problem, this PR doesn't currently address it). These are the things that IsInitialBlockDownload() affects (there are more, but these are probably the most important ones).
There is also a related behavior that is not controlled by IsInitialBlockDownload(), but by
I believe it is mostly (1) and (6) we're concerned about here? I don't actually know why (1) exists in the first place. Perhaps it's trying to minimize time to be synced with the network first before providing services to others? I think that only has a minimal effect, especially when combined with (2). It seems to me that just removing (1) would address most of the concern here. It seems possible to improve on (6) as well, but it's fundamentally a tradeoff between bandwidth and (high probability of) fast header syncing. The headers sync timeout is ~15 minutes though, which can be annoying, but is probably short enough that (6) doesn't actually matter for the scenario under consideration here. A way to improve (6) (especially useful when combined with removing (1)) would be to treat a "less than 2000 headers received at once" while the last one doesn't get out us out of IBD as "done syncing with this peer, pick another one". These avoid the need to permanently latch to IBD, which seems like a very heavy hammer. Especially the transaction (4) and compact block related behaviours (5) we probably want to keep for nodes that were once out of IBD, but have been offline for a substantial period of time. (3) does look fairly scary if nodes can actually get stuck in IBD - but it's also a sign of a deeper problem. I'm not sure if that warrants addressing beyond just trying to make sure we don't get stuck in IBD. |
@pstratem So I suspect that what happened in $ALTCOIN is the headers fetching logic is approximately doing this:
I don't believe the current PR actually addresses this, because none of the logic above is dependent on IsInitialBlockDownload. If actually all reachable nodes on the network are in "headers 24 hours behind" state, then making them IsIBD()==false would make them respond to headers, but never enough to get close enough to $NOW to get the network unstuck. Here are a couple alternative ideas (not all necessary, but probably most powerful if all 3 are done):
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
.
@sipa Originally the logic was added to not respond to GETHEADERS while in IBD because we relaxed the checkpoint logic, so that we could accept a non-checkpointed chain during initial sync. If a node was in IBD and fed a bogus initial headers chain, then a peer that is on the checkpointed chain would ban us if we gave them a checkpoint-violating header. Since then, our logic has changed a lot, both around banning and with the addition of minimum-chain-work. An easy improvement would be to respond to getheaders messages as long as our tip is past the minimum chain work that we're configured with. @pstratem My understanding of block relay is that the scenario that I think you have in mind should not really exist, but perhaps I'm not understanding the scenario. Let's say the whole network is in IBD for some reason, because no blocks have been found in several days and then everyone restarts their software. At this point, initial headers sync with every peer is timing out every 20 minutes or whatever causing some peer churn (this all sounds bad, but not yet disastrous). (1) Then a miner finds a block and submits it to their node. That node is not a listening node, but (2) with the new block it will leave IBD. It will also (3) announce that block (via INV) to all its peers, which will in turn immediately (4) send a getheaders to the node, which it will (5) respond to because its out of IBD. That will (6) trigger block relay to all those peers via a getdata, which in turn will cause those nodes to (7) leave IBD and then (8) announce the block to their peers. Things should progress fine from there, in the same way, I think -- what do you think is breaking down exactly? I guess (1) could break down because miners have to manually cause their own node to leave IBD in order for |
Aha, on further investigation, I think (6) is where things break down: when a node is in IBD because its current tip is too old, if we learn of a new block header from an inbound peer, we will not fetch the block when we get the header via direct-fetch (because My recollection is that we don't want to do full-on IBD from inbound peers because users running bitcoind at home might not want to be used for some outbound peer's initial sync. Assuming we think that is still a reasonable concern, a simple fix might be to try fetching blocks via (To be clear, I don't believe that the headers-sync behaviors with inbound peers are really a problem, because if an inbound peer is the only one to be finding new blocks, we will get their headers via our inv-processing behavior. I think this is just a concern around block fetching.) |
Noting that I was operating a signet for a while which crashed, and by the time I rebooted it (life happens) it was a few months later. Now after restarting, I think the IBD latch is keeping my seed/mining node in IBD since it's in the past, even though I'm producing new blocks. This also prevents this node from relaying the new blocks to peers. This patch might fix this (pathological) case. h/t @ajtowns for pointing this out. |
Can this be closed now that #24171 was opened? |
Staying in IBD while producing a block with a recent enough timestamp shouldn't be possible. It may only happen that |
Going to close this for now, since there hasn't been any activity by the author for about a year after the author acknowledged that the current patch has a bug (#21106 (comment)). Discussion can continue and this can be reopened any time. |
Once started, the signet miner defaults to calculating timestamps as "previous block time + 10 minutes" with optional randomisation, so if the node was down for weeks or more it could take a while to catch up with realtime and hence leave the miner's node in IBD. |
Ok, I see. So this pull could indeed make sense for signet. Or, as an alternative the signet miner could modify the mine script to mine a block with a more recent timestamp or somehow mine the "missing" blocks faster if it is important to have them. For mainnet, my understanding is that #24171 will be sufficient to address the issue, as explained in #21106 (comment) and the following comment. |
It's just a default: you can run the miner with |
In the unlikely event that all or most of the listening nodes are restarted
into Initial Block Download mode the network might get stuck with little or
no listening nodes providing headers or blocks.
The timeout is intended to move nodes out of IBD so they will request and
serve blocks and headers to inbound peers.
Edit: By latching IBD across restarts we prevent the network from getting into this state at all.