Skip to content

Fix bug in freezer DB storage of randao_mixes #3011

@michaelsproul

Description

@michaelsproul

Description

There's a bug lurking in the database code that can cause occasional database corruption. It doesn't happen consistently, but when it does it seems to result in a zero hash (0x00) appearing in the randao_mixes array.

The case I'm investigating presented as corruption at slot 135168 on Prater:

Feb 08 07:31:26.461 ERRO State reconstruction failed             error: HotColdDBError(BlockReplayBlockError(HeaderInvalid { reason: ParentBlockRootMismatch { state: 0x373eb699eae0110e474671cab72d5c6ca666d4a6f5a5a356f2af89039ad98382, block: 0xabf45deec98af2873a04d352ebbc54eac35d00c8157fea27b21f9adc2446233b } })), service: beacon

Oddly the first corrupt state actually occurs much earlier. I found that the state at slot 12288 was corrupt using this (fish) script:

for i in (seq 0 2048 135168)
    set checksum (curl -s -H "Accept: application/octet-stream" "http://localhost:5052/eth/v2/debug/beacon/states/$i" | sha256sum)
    echo "$i: $checksum"
end

Diffing the corrupt state at slot 12288 against the real state reveals a 0x00 value in the randao_mixes at index 320. This is interesting because that corresponds to epoch 320, i.e. 64 epochs prior to slot 12288 (epoch 384).

I think the bug must be in store_updated_vector, which is responsible for writing the randao mixes in the flat format used by the database:

pub fn store_updated_vector<F: Field<E>, E: EthSpec, S: KeyValueStore<E>>(
field: F,
store: &S,
state: &BeaconState<E>,
spec: &ChainSpec,
ops: &mut Vec<KeyValueStoreOp>,
) -> Result<(), Error> {
let chunk_size = F::chunk_size();
let (start_vindex, end_vindex) = F::start_and_end_vindex(state.slot(), spec);
let start_cindex = start_vindex / chunk_size;
let end_cindex = end_vindex / chunk_size;
// Store the genesis value if we have access to it, and it hasn't been stored already.
if F::slot_needs_genesis_value(state.slot(), spec) {
let genesis_value = F::extract_genesis_value(state, spec)?;
F::check_and_store_genesis_value(store, genesis_value, ops)?;
}
// Start by iterating backwards from the last chunk, storing new chunks in the database.
// Stop once a chunk in the database matches what we were about to store, this indicates
// that a previously stored state has already filled-in a portion of the indices covered.
let full_range_checked = store_range(
field,
(start_cindex..=end_cindex).rev(),
start_vindex,
end_vindex,
store,
state,
spec,
ops,
)?;
// If the previous `store_range` did not check the entire range, it may be the case that the
// state's vector includes elements at low vector indices that are not yet stored in the
// database, so run another `store_range` to ensure these values are also stored.
if !full_range_checked {
store_range(
field,
start_cindex..end_cindex,
start_vindex,
end_vindex,
store,
state,
spec,
ops,
)?;
}
Ok(())
}

It's possible that we're somehow re-writing the old state at 12288 which inapproriately zeroes some entries and corrupts all subsequent states. I don't think the corruption can occur the first time state 12288 is written else it would have failed the block root check at that point or shortly after.

Will update this issue with more info soon.

Metadata

Metadata

Assignees

Labels

bugSomething isn't workingdatabase

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions