Skip to content

Conversation

pstratem
Copy link
Contributor

@pstratem pstratem commented Feb 7, 2021

In the unlikely event that all or most of the listening nodes are restarted
into Initial Block Download mode the network might get stuck with little or
no listening nodes providing headers or blocks.

The timeout is intended to move nodes out of IBD so they will request and
serve blocks and headers to inbound peers.

Edit: By latching IBD across restarts we prevent the network from getting into this state at all.

@pstratem pstratem force-pushed the 2021-02-07-isinitialblockdownload-timeout branch from 516a106 to 22c27cc Compare February 7, 2021 22:32
Copy link
Contributor

@jnewbery jnewbery left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is essentially saying that if the node can't get out of IBD within two days, then just pretend we've completed IBD anyway. I don't think that's the right behaviour - if we've stalled and aren't making any progress towards getting out of IBD, we'd be better off warning the user very loudly and then perhaps shutting down.

@maflcko
Copy link
Member

maflcko commented Feb 8, 2021

all or most of the listening nodes are restarted into Initial Block Download mode

Wouldn't that mean that there hasn't been a block in 24 hours? Then, waiting another two days for the next block doesn't seem like a fallback that will be needed.

@pstratem
Copy link
Contributor Author

pstratem commented Feb 9, 2021

@MarcoFalke no it doesn't, just that the listening nodes haven't seen the block yet.

@pstratem pstratem force-pushed the 2021-02-07-isinitialblockdownload-timeout branch from 22c27cc to 99283bf Compare February 9, 2021 01:12
@maflcko
Copy link
Member

maflcko commented Feb 9, 2021

If the listening nodes haven't seen a block from 24h ago, something is going wrong and needs human attention to manually fixup. Having an automated fallback that hits after two days doesn't seem needed when having to manually interfere anyway.

@maflcko
Copy link
Member

maflcko commented Feb 10, 2021

I fully agree that the node should try to recover (to avoid manual intervention as much as possible) when its both reasonable and safe to do. Though in this particular case, the users that are eager enough to corrupt their datadir, won't be patient enough to wait three full days for this workaround to take effect. And similarly, the patient users won't see the benefits either because some eager users successfully poked their node on the first day of issues.

Copy link
Member

@jonatack jonatack left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting scenario. I wonder if latching IBD to false is a sufficiently fine-grained way to do this; maybe a separate timeout latch could be added, as you mention, "The listening nodes need to request headers from inbound connections, either by exiting IBD or special casing that limit while in IBD." I'm not sure about the default -ibdtimeout value of 48 hours, which seems long, what other durations users would want to provide, or what guidance can be given to users. This probably needs a functional test if it goes forward.

@jonatack
Copy link
Member

the tricky part is being equally confident that the various attacks that the IBD specific behaviour is intended to protect against are still protected against-- esp because no one reviewing this work may remember all of them

Yes.

@maflcko
Copy link
Member

maflcko commented Feb 10, 2021

If this was a boolean option (default to false), it would be easier to review than this time-based threshold. Obviously, it would also lose the node-fixes-itself-after-three-days feature. Though, I am still sceptical of it's use. It assumes that all of your connections (inbound, as well as block-feeler, outbound block-relay, ...) are out of IBD for three days without a single one of them setting the flag manually.

(Edit: I wrote this before the previous comment, so it is not a reply to that)

@pstratem
Copy link
Contributor Author

@MarcoFalke In IBD we do not request headers from inbound peers. Literally only from a single outbound peer with a timeout.

A setting to disable IBD for your own node might fix your node if your node is listening and receiving inbound connections from nodes which are sync'd with miners. Relying on that would be extremely fragile.

Even if you were able to fix your own node, you can't then heal the rest of the network without them connecting to you as the single initial sync peer, the odds of which are extremely low.

The timeout prevents the network from being in this state for more than 48 hours with a bunch of nodes stuck in IBD.

I'm assuming that the vast majority of listening nodes are not actively maintained and have no wallet, but are rather used as a kind of gateway, so being down for 48 hours is a nuisance to the rest of the network, but is probably not directly going to cause them to take action.

@pstratem
Copy link
Contributor Author

@gmaxwell Yes just writing the latch to disk is probably the best solution.

@jonatack
Copy link
Member

Yes, if we can avoid using time that would be a real win. Easier to test too.

@pstratem pstratem force-pushed the 2021-02-07-isinitialblockdownload-timeout branch 2 times, most recently from 6f4e5dd to cf742ec Compare February 10, 2021 21:15
@pstratem pstratem force-pushed the 2021-02-07-isinitialblockdownload-timeout branch 5 times, most recently from f02469b to 4da5396 Compare February 10, 2021 22:31
this prevents a network wide failure in the event that all or most listening
nodes are restarted at approximately the same time
@pstratem pstratem force-pushed the 2021-02-07-isinitialblockdownload-timeout branch from 4da5396 to a3e713a Compare February 11, 2021 01:20
@pstratem
Copy link
Contributor Author

The CI failures are all random zmq issues.

@decryp2kanon
Copy link
Contributor

Concept ACK

@pstratem pstratem changed the title add timeout to initial block download persist IBD latch across restarts Feb 12, 2021
@@ -1294,15 +1295,23 @@ void CChainState::InitCoinsCache(size_t cache_size_bytes)
// `const` so that `CValidationInterface` clients (which are given a `const CChainState*`)
// can call it.
//
// The file .ibd_complete is used to latch the datadir to IBD false across restarts
Copy link
Member

@jonatack jonatack Feb 13, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this moves forward, noting that a file description should be added to doc/files.md (alternatively, perhaps persist the latch in settings.json, which is also chain-specific).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did this as a separate file so that users can easily add the file to override, with GetDataDir() it should be chain specific, no?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think they should both be.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Users can easily override settings.json as well - they're just interpreted as command-line options.

So something like --ibd_complete=0/1 would work

@pstratem
Copy link
Contributor Author

Is there anything I need to do here?

@jnewbery
Copy link
Contributor

Is there anything I need to do here?

Convince other contributors that this is a good idea. Currently, if a node is switched off and then switched back on some time later, it'll start up in IBD until it catches up to the tip. This seems like good behaviour - we don't want to download transactions, relay our address, respond to getheaders, etc until we've caught back up. If you want other contributors to ACK this PR, you'll need to convince them that changing that behaviour is an improvement, and does not introduce any regressions/vulnerabilities/etc.

if (fs::exists(latch_filename)) {
LogPrintf("Leaving InitialBlockDownload (latching to false)\n");
m_cached_finished_ibd.store(true, std::memory_order_relaxed);
return false;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will set ibd to false, when -reindex? If yes, that seems wrong

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, it is wrong

@sipa
Copy link
Member

sipa commented Feb 22, 2021

TL;DR: I think we should just drop the "don't respond to GETHEADERS while in IBD" behavior, rather than keeping the IsInitialBlockDownload() latch across restarts. Possibly we can also improve the find-another-peer-to-sync-headers-from logic (and, if that's the real problem, this PR doesn't currently address it).

These are the things that IsInitialBlockDownload() affects (there are more, but these are probably the most important ones).

  1. We don't respond to GETHEADERS while in IBD.
  2. We don't announce our local address while in IBD.
  3. The getblocktemplate RPC doesn't work while in IBD.
  4. We don't request transactions in IBD.
  5. Various compact block related things are disabled in IBD.

There is also a related behavior that is not controlled by IsInitialBlockDownload(), but by pindexBestHeader->GetBlockTime() > GetAdjustedTime() - 24 * 60 * 60 instead:

  1. Only request headers from one peer at a time (however, if it times out, another peer will be selected).

I believe it is mostly (1) and (6) we're concerned about here?

I don't actually know why (1) exists in the first place. Perhaps it's trying to minimize time to be synced with the network first before providing services to others? I think that only has a minimal effect, especially when combined with (2). It seems to me that just removing (1) would address most of the concern here.

It seems possible to improve on (6) as well, but it's fundamentally a tradeoff between bandwidth and (high probability of) fast header syncing. The headers sync timeout is ~15 minutes though, which can be annoying, but is probably short enough that (6) doesn't actually matter for the scenario under consideration here.

A way to improve (6) (especially useful when combined with removing (1)) would be to treat a "less than 2000 headers received at once" while the last one doesn't get out us out of IBD as "done syncing with this peer, pick another one".

These avoid the need to permanently latch to IBD, which seems like a very heavy hammer. Especially the transaction (4) and compact block related behaviours (5) we probably want to keep for nodes that were once out of IBD, but have been offline for a substantial period of time.

(3) does look fairly scary if nodes can actually get stuck in IBD - but it's also a sign of a deeper problem. I'm not sure if that warrants addressing beyond just trying to make sure we don't get stuck in IBD.

@sipa
Copy link
Member

sipa commented Feb 27, 2021

@pstratem So I suspect that what happened in $ALTCOIN is the headers fetching logic is approximately doing this:

  • until the tip header is less than a day old, we pick a single peer to fetch headers from. If we have at least one outbound peer, only outbounds are considered.
  • headers fetching times out after 15 minutes, and after that a new peer is picked (but, in a normal situation, again an outbound one, and it'll never actually get to asking an inbound peer)

I don't believe the current PR actually addresses this, because none of the logic above is dependent on IsInitialBlockDownload. If actually all reachable nodes on the network are in "headers 24 hours behind" state, then making them IsIBD()==false would make them respond to headers, but never enough to get close enough to $NOW to get the network unstuck.

Here are a couple alternative ideas (not all necessary, but probably most powerful if all 3 are done):

  • When headers fetch reaches timeout, instead of disconnecting a peer, it is marked "failed headers sync", and another peer is tried. Perhaps such "failed headers sync" peers can be preferred for eviction/replacement.
  • If we can't respond to a GETHEADERS request (for whatever reason), just echo the reported tip back the requester in a HEADERS. That has no they-might-ban-us risk, because clearly that tip is valid to themselves, and can be used as a "done sending you headers" signed (see next point).
  • If a HEADERS message arrives from the headers sync peer with less than MAX_HEADERS entries, which doesn't get us within 24h of $NOW, treat that as a "failed headers sync" as well. That means that headers sync peer cycling can go almost instantly, rather than needing 15-minute timeouts.

Copy link

@schuie2528 schuie2528 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.

@bitcoin bitcoin deleted a comment from schuie2528 Feb 28, 2021
@bitcoin bitcoin deleted a comment from schuie2528 Feb 28, 2021
@sdaftuar
Copy link
Member

sdaftuar commented Mar 4, 2021

@sipa Originally the logic was added to not respond to GETHEADERS while in IBD because we relaxed the checkpoint logic, so that we could accept a non-checkpointed chain during initial sync. If a node was in IBD and fed a bogus initial headers chain, then a peer that is on the checkpointed chain would ban us if we gave them a checkpoint-violating header.

Since then, our logic has changed a lot, both around banning and with the addition of minimum-chain-work. An easy improvement would be to respond to getheaders messages as long as our tip is past the minimum chain work that we're configured with.

@pstratem My understanding of block relay is that the scenario that I think you have in mind should not really exist, but perhaps I'm not understanding the scenario. Let's say the whole network is in IBD for some reason, because no blocks have been found in several days and then everyone restarts their software. At this point, initial headers sync with every peer is timing out every 20 minutes or whatever causing some peer churn (this all sounds bad, but not yet disastrous).

(1) Then a miner finds a block and submits it to their node. That node is not a listening node, but (2) with the new block it will leave IBD. It will also (3) announce that block (via INV) to all its peers, which will in turn immediately (4) send a getheaders to the node, which it will (5) respond to because its out of IBD.

That will (6) trigger block relay to all those peers via a getdata, which in turn will cause those nodes to (7) leave IBD and then (8) announce the block to their peers. Things should progress fine from there, in the same way, I think -- what do you think is breaking down exactly?

I guess (1) could break down because miners have to manually cause their own node to leave IBD in order for getblocktemplate to work, via the -maxtipage option. But if you assume miners will quickly figure that out to not waste their hashpower, I think everything should work after that.

@sdaftuar
Copy link
Member

sdaftuar commented Mar 4, 2021

That will (6) trigger block relay to all those peers via a getdata, which in turn will cause those nodes to (7) leave IBD and then (8) announce the block to their peers. Things should progress fine from there, in the same way, I think -- what do you think is breaking down exactly?

Aha, on further investigation, I think (6) is where things break down: when a node is in IBD because its current tip is too old, if we learn of a new block header from an inbound peer, we will not fetch the block when we get the header via direct-fetch (because CanDirectFetch will return false due to our old tip), and we won't fetch the block through parallel block fetch (via FindNextBlocksToDownload) if we're in IBD and we have any outbound peers. I was able to replicate this behavior in a simple regtest setting.

My recollection is that we don't want to do full-on IBD from inbound peers because users running bitcoind at home might not want to be used for some outbound peer's initial sync. Assuming we think that is still a reasonable concern, a simple fix might be to try fetching blocks via FindNextBlocksToDownload from inbound peers if we have no blocks in flight, as presumably that means our outbound peers have nothing for us, so we might as well treat that scenario the same as if we have no outbounds at all.

(To be clear, I don't believe that the headers-sync behaviors with inbound peers are really a problem, because if an inbound peer is the only one to be finding new blocks, we will get their headers via our inv-processing behavior. I think this is just a concern around block fetching.)

@JeremyRubin
Copy link
Contributor

Noting that I was operating a signet for a while which crashed, and by the time I rebooted it (life happens) it was a few months later. Now after restarting, I think the IBD latch is keeping my seed/mining node in IBD since it's in the past, even though I'm producing new blocks. This also prevents this node from relaying the new blocks to peers.

This patch might fix this (pathological) case.

h/t @ajtowns for pointing this out.

@maflcko
Copy link
Member

maflcko commented Jan 27, 2022

Can this be closed now that #24171 was opened?

@maflcko
Copy link
Member

maflcko commented Jan 27, 2022

Now after restarting, I think the IBD latch is keeping my seed/mining node in IBD since it's in the past, even though I'm producing new blocks.

Staying in IBD while producing a block with a recent enough timestamp shouldn't be possible. It may only happen that getblocktemplate refuses to give you a reply due to IBD, but since you produced a block this is probably not something you ran into. (If you were, you could use -maxtipage)

@maflcko
Copy link
Member

maflcko commented Jan 27, 2022

Going to close this for now, since there hasn't been any activity by the author for about a year after the author acknowledged that the current patch has a bug (#21106 (comment)).

Discussion can continue and this can be reopened any time.

@maflcko maflcko closed this Jan 27, 2022
@ajtowns
Copy link
Contributor

ajtowns commented Jan 27, 2022

Staying in IBD while producing a block with a recent enough timestamp shouldn't be possible.

Once started, the signet miner defaults to calculating timestamps as "previous block time + 10 minutes" with optional randomisation, so if the node was down for weeks or more it could take a while to catch up with realtime and hence leave the miner's node in IBD.

@maflcko
Copy link
Member

maflcko commented Jan 27, 2022

Ok, I see. So this pull could indeed make sense for signet. Or, as an alternative the signet miner could modify the mine script to mine a block with a more recent timestamp or somehow mine the "missing" blocks faster if it is important to have them.

For mainnet, my understanding is that #24171 will be sufficient to address the issue, as explained in #21106 (comment) and the following comment.

@ajtowns
Copy link
Contributor

ajtowns commented Jan 27, 2022

Ok, I see. So this pull could indeed make sense for signet. Or, as an alternative the signet miner could modify the mine script to mine a block with a more recent timestamp or somehow mine the "missing" blocks faster if it is important to have them.

It's just a default: you can run the miner with --set-block-time=-1 --max-blocks=1 to mine a single block at the current time, then restart with --ongoing which will continue from there. (Not sure if that had been implemented when Jeremy was having problems) So I don't think it's too important to optimise for that scenario.

@bitcoin bitcoin locked and limited conversation to collaborators Jan 27, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.