-
Notifications
You must be signed in to change notification settings - Fork 902
Description
Description
A user in Discord recently posted the following log:
Apr 28 19:09:08.447 CRIT Beacon block processing error error: ValidatorPubkeyCacheLockTimeout, service: beacon
Apr 28 19:09:08.447 WARN BlockProcessingFailure outcome: ValidatorPubkeyCacheLockTimeout, msg: unexpected condition in processing block.
This user was not the first one, I've personally seen quite a few users report this same rror.
The validator_pubkey_cache
is protected by a TimeoutRwLock
. This means there can only be lock contention when something holds the write lock.
I think (it's worth confirming) this is the only place we take a write-lock on the pubkey cache:
lighthouse/beacon_node/beacon_chain/src/beacon_chain.rs
Lines 1633 to 1641 in 3a24ca5
// If there are new validators in this block, update our pubkey cache. | |
// | |
// We perform this _before_ adding the block to fork choice because the pubkey cache is | |
// used by attestation processing which will only process an attestation if the block is | |
// known to fork choice. This ordering ensure that the pubkey cache is always up-to-date. | |
self.validator_pubkey_cache | |
.try_write_for(VALIDATOR_PUBKEY_CACHE_LOCK_TIMEOUT) | |
.ok_or(Error::ValidatorPubkeyCacheLockTimeout)? | |
.import_new_pubkeys(&state)?; |
I think (also worth confirming) that bulk of the time holding the write-lock is spent inside this function:
lighthouse/beacon_node/beacon_chain/src/validator_pubkey_cache.rs
Lines 124 to 169 in 3a24ca5
/// Adds zero or more validators to `self`. | |
fn import<I>(&mut self, validator_keys: I) -> Result<(), BeaconChainError> | |
where | |
I: Iterator<Item = PublicKeyBytes> + ExactSizeIterator, | |
{ | |
self.pubkey_bytes.reserve(validator_keys.len()); | |
self.pubkeys.reserve(validator_keys.len()); | |
self.indices.reserve(validator_keys.len()); | |
for pubkey in validator_keys { | |
let i = self.pubkeys.len(); | |
if self.indices.contains_key(&pubkey) { | |
return Err(BeaconChainError::DuplicateValidatorPublicKey); | |
} | |
// The item is written to disk _before_ it is written into | |
// the local struct. | |
// | |
// This means that a pubkey cache read from disk will always be equivalent to or | |
// _later than_ the cache that was running in the previous instance of Lighthouse. | |
// | |
// The motivation behind this ordering is that we do not want to have states that | |
// reference a pubkey that is not in our cache. However, it's fine to have pubkeys | |
// that are never referenced in a state. | |
match &mut self.backing { | |
PubkeyCacheBacking::File(persistence_file) => { | |
persistence_file.append(i, &pubkey)?; | |
} | |
PubkeyCacheBacking::Database(store) => { | |
store.put_item(&DatabasePubkey::key_for_index(i), &DatabasePubkey(pubkey))?; | |
} | |
} | |
self.pubkeys.push( | |
(&pubkey) | |
.try_into() | |
.map_err(BeaconChainError::InvalidValidatorPubkeyBytes)?, | |
); | |
self.pubkey_bytes.push(pubkey); | |
self.indices.insert(pubkey, i); | |
} | |
Ok(()) | |
} |
So, it stands to reason that if we can optimize out some time spent in ValidatorPubkeyCache::import
then we can avoid ValidatorPubkeyCacheLockTimeout
errors.
I think the first step should be to add metrics to ValidatorPubkeyCache::import
and determine what's taking the longest.
Potential Optimization
I'm guessing that the write-to-disk part is taking the longest.
It's really important that we retain the current behaviour where the on-disk "backing" is updated somewhat-atomically with the internal Vec
s and HashMap
.
One potential option is to have a "prepare" step where we can copy the backing to a new random file and add the new keys to it. Then, we have a "commit" phase where we replace the old backing with the new (only a single file copy operation) and update the internal structures.
The "prepare" step only needs a read-lock, whist the "commit" still needs a write-lock (but is hopefully much faster).
Once we determine it is indeed writing to disk that's slow we can expand upon this.