Skip to content

Optimize against ValidatorPubkeyCache timeouts #2327

@paulhauner

Description

@paulhauner

Description

A user in Discord recently posted the following log:

Apr 28 19:09:08.447 CRIT Beacon block processing error           error: ValidatorPubkeyCacheLockTimeout, service: beacon
Apr 28 19:09:08.447 WARN BlockProcessingFailure                  outcome: ValidatorPubkeyCacheLockTimeout, msg: unexpected condition in processing block.

This user was not the first one, I've personally seen quite a few users report this same rror.

The validator_pubkey_cache is protected by a TimeoutRwLock. This means there can only be lock contention when something holds the write lock.

I think (it's worth confirming) this is the only place we take a write-lock on the pubkey cache:

// If there are new validators in this block, update our pubkey cache.
//
// We perform this _before_ adding the block to fork choice because the pubkey cache is
// used by attestation processing which will only process an attestation if the block is
// known to fork choice. This ordering ensure that the pubkey cache is always up-to-date.
self.validator_pubkey_cache
.try_write_for(VALIDATOR_PUBKEY_CACHE_LOCK_TIMEOUT)
.ok_or(Error::ValidatorPubkeyCacheLockTimeout)?
.import_new_pubkeys(&state)?;

I think (also worth confirming) that bulk of the time holding the write-lock is spent inside this function:

/// Adds zero or more validators to `self`.
fn import<I>(&mut self, validator_keys: I) -> Result<(), BeaconChainError>
where
I: Iterator<Item = PublicKeyBytes> + ExactSizeIterator,
{
self.pubkey_bytes.reserve(validator_keys.len());
self.pubkeys.reserve(validator_keys.len());
self.indices.reserve(validator_keys.len());
for pubkey in validator_keys {
let i = self.pubkeys.len();
if self.indices.contains_key(&pubkey) {
return Err(BeaconChainError::DuplicateValidatorPublicKey);
}
// The item is written to disk _before_ it is written into
// the local struct.
//
// This means that a pubkey cache read from disk will always be equivalent to or
// _later than_ the cache that was running in the previous instance of Lighthouse.
//
// The motivation behind this ordering is that we do not want to have states that
// reference a pubkey that is not in our cache. However, it's fine to have pubkeys
// that are never referenced in a state.
match &mut self.backing {
PubkeyCacheBacking::File(persistence_file) => {
persistence_file.append(i, &pubkey)?;
}
PubkeyCacheBacking::Database(store) => {
store.put_item(&DatabasePubkey::key_for_index(i), &DatabasePubkey(pubkey))?;
}
}
self.pubkeys.push(
(&pubkey)
.try_into()
.map_err(BeaconChainError::InvalidValidatorPubkeyBytes)?,
);
self.pubkey_bytes.push(pubkey);
self.indices.insert(pubkey, i);
}
Ok(())
}

So, it stands to reason that if we can optimize out some time spent in ValidatorPubkeyCache::import then we can avoid ValidatorPubkeyCacheLockTimeout errors.

I think the first step should be to add metrics to ValidatorPubkeyCache::import and determine what's taking the longest.

Potential Optimization

I'm guessing that the write-to-disk part is taking the longest.

It's really important that we retain the current behaviour where the on-disk "backing" is updated somewhat-atomically with the internal Vecs and HashMap.

One potential option is to have a "prepare" step where we can copy the backing to a new random file and add the new keys to it. Then, we have a "commit" phase where we replace the old backing with the new (only a single file copy operation) and update the internal structures.

The "prepare" step only needs a read-lock, whist the "commit" still needs a write-lock (but is hopefully much faster).

Once we determine it is indeed writing to disk that's slow we can expand upon this.

Metadata

Metadata

Assignees

Labels

enhancementNew feature or requestoptimizationSomething to make Lighthouse run more efficiently.

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions