Optimize against ValidatorPubkeyCache timeouts

## Description

A user in Discord recently posted the following log:

```
Apr 28 19:09:08.447 CRIT Beacon block processing error           error: ValidatorPubkeyCacheLockTimeout, service: beacon
Apr 28 19:09:08.447 WARN BlockProcessingFailure                  outcome: ValidatorPubkeyCacheLockTimeout, msg: unexpected condition in processing block.
```

This user was not the first one, I've personally seen quite a few users report this same rror.

The `validator_pubkey_cache` is protected by a `TimeoutRwLock`. This means there can only be lock contention when something holds the *write lock*.

I think (it's worth confirming) this is the only place we take a write-lock on the pubkey cache:

https://github.com/sigp/lighthouse/blob/3a24ca5f14c6e6d6612fd43eca82aa0c1e6aba16/beacon_node/beacon_chain/src/beacon_chain.rs#L1633-L1641

I think (also worth confirming) that bulk of the time holding the write-lock is spent inside this function:

https://github.com/sigp/lighthouse/blob/3a24ca5f14c6e6d6612fd43eca82aa0c1e6aba16/beacon_node/beacon_chain/src/validator_pubkey_cache.rs#L124-L169

So, it stands to reason that if we can optimize out some time spent in `ValidatorPubkeyCache::import` then we can avoid `ValidatorPubkeyCacheLockTimeout` errors.

I think the first step should be to add metrics to `ValidatorPubkeyCache::import` and determine what's taking the longest.

## Potential Optimization

I'm *guessing* that the write-to-disk part is taking the longest.

It's really important that we retain the current behaviour where the on-disk "backing" is updated somewhat-atomically with the internal `Vec`s and `HashMap`. 

One potential option is to have a "prepare" step where we can copy the backing to a new random file and add the new keys to it. Then, we have a "commit" phase where we replace the old backing with the new (only a single file copy operation) and update the internal structures.

The "prepare" step only needs a read-lock, whist the "commit" still needs a write-lock (but is hopefully much faster).

Once we determine it is indeed writing to disk that's slow we can expand upon this.



	// If there are new validators in this block, update our pubkey cache.
	//
	// We perform this _before_ adding the block to fork choice because the pubkey cache is
	// used by attestation processing which will only process an attestation if the block is
	// known to fork choice. This ordering ensure that the pubkey cache is always up-to-date.
	self.validator_pubkey_cache
	.try_write_for(VALIDATOR_PUBKEY_CACHE_LOCK_TIMEOUT)
	.ok_or(Error::ValidatorPubkeyCacheLockTimeout)?
	.import_new_pubkeys(&state)?;

	/// Adds zero or more validators to `self`.
	fn import<I>(&mut self, validator_keys: I) -> Result<(), BeaconChainError>
	where
	I: Iterator<Item = PublicKeyBytes> + ExactSizeIterator,
	{
	self.pubkey_bytes.reserve(validator_keys.len());
	self.pubkeys.reserve(validator_keys.len());
	self.indices.reserve(validator_keys.len());

	for pubkey in validator_keys {
	let i = self.pubkeys.len();

	if self.indices.contains_key(&pubkey) {
	return Err(BeaconChainError::DuplicateValidatorPublicKey);
	}

	// The item is written to disk _before_ it is written into
	// the local struct.
	//
	// This means that a pubkey cache read from disk will always be equivalent to or
	// _later than_ the cache that was running in the previous instance of Lighthouse.
	//
	// The motivation behind this ordering is that we do not want to have states that
	// reference a pubkey that is not in our cache. However, it's fine to have pubkeys
	// that are never referenced in a state.
	match &mut self.backing {
	PubkeyCacheBacking::File(persistence_file) => {
	persistence_file.append(i, &pubkey)?;
	}
	PubkeyCacheBacking::Database(store) => {
	store.put_item(&DatabasePubkey::key_for_index(i), &DatabasePubkey(pubkey))?;
	}
	}

	self.pubkeys.push(
	(&pubkey)
	.try_into()
	.map_err(BeaconChainError::InvalidValidatorPubkeyBytes)?,
	);
	self.pubkey_bytes.push(pubkey);

	self.indices.insert(pubkey, i);
	}

	Ok(())
	}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Optimize against ValidatorPubkeyCache timeouts #2327

Description

Potential Optimization

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Optimize against ValidatorPubkeyCache timeouts #2327

Description

Description

Potential Optimization

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions