Skip to content

Conversation

tillrohrmann
Copy link
Contributor

@tillrohrmann tillrohrmann commented Apr 28, 2025

This commit introduces the ReadIsolation level for Storage transactions. This allows to choose between no or snapshot read isolation. The latter is useful when reading from outside of the partition processor where concurrent writes can happen (e.g. the invoker reading the journal, the metadata and the state of an invocation). The snapshot isolation level creates a Rocksdb snapshot which will affect compaction. It is, therefore, recommended to not keep snapshot transactions open for too long.

This fixes #3192.

@@ -415,6 +423,11 @@ impl PartitionStore {
)
});

let snapshot = match read_isolation {
ReadIsolation::None => None,
ReadIsolation::Snapshot => Some(self.rocksdb.inner().as_raw_db().snapshot()),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't run any benchmarks on the negative effects of using snapshot reads in the invoker yet.

Comment on lines 42 to 54
pub enum ReadIsolation {
/// No read isolation. This allows the storage implementation to return dirty reads. Use this
/// level, if you know that you are the only writer.
None,
/// Reads are served from a consistent snapshot of the underlying storage implementation. This
/// ensures that multiple reads get a consistent view of the stored information. Use this level,
/// if you need to do multiple reads and tolerate concurrent write operations.
///
/// Note that using this level is more costly than [`ReadIsolation::None`].
Snapshot,
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if the documentation matches reality here. What do we mean by "dirty reads" here? The default is that the writebatch/tx's changes is not visible to other transactions unless it's committed.

Do we have other uses of the API that split a single logical operation across multiple write batches?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is true for the concrete implementation of our Storage trait. I just didn't want to overfit to it (even though there is probably no other impl in the foreseeable future).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest we call this IsolationLevel and maybe rename Snapshot to RepeatableReads?

Copy link
Contributor Author

@tillrohrmann tillrohrmann Apr 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good. Will update it accordingly.

@tillrohrmann tillrohrmann force-pushed the issues/3192 branch 2 times, most recently from 54df4b9 to 27376da Compare April 28, 2025 20:58
@tillrohrmann
Copy link
Contributor Author

I've updated this PR and included a commit which shows how a transaction with repeatable read isolation can be used by the invoker to ensure the reading the journal, journal metadata and state is consistent.

@tillrohrmann tillrohrmann marked this pull request as ready for review April 28, 2025 21:58
Copy link
Contributor

@AhmedSoliman AhmedSoliman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change looks good from my side. Only a small note on the test to reduce the amount of work it needs to do.

@@ -143,8 +144,7 @@ pub(super) struct InvocationTask<SR, JR, EE, DMR> {
retry_count_since_last_stored_entry: u32,

// Invoker tx/rx
state_reader: SR,
journal_reader: JR,
invocation_reader: IR,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❤️

This commit introduces the ReadIsolation level for Storage transactions.
This allows to choose between no or snapshot read isolation. The latter
is useful when reading from outside of the partition processor where
concurrent writes can happen (e.g. the invoker reading the journal, the
metadata and the state of an invocation). The snapshot isolation level
creates a Rocksdb snapshot which will affect compaction. It is,
therefore, recommended to not keep snapshot transactions open for too long.

This fixes restatedev#3192.
…eads

In order to tolerate concurrent writes happening in the PP, the invoker is now
using the InvokerReaderTransaction when reading the journal, journal metadata
and the state. The transaction ensures that all reads see a consistent view of
the underlying storage until the transaction is closed.
@tillrohrmann tillrohrmann merged commit d1a411b into restatedev:main Apr 29, 2025
6 checks passed
@tillrohrmann tillrohrmann deleted the issues/3192 branch April 29, 2025 09:36
@github-actions github-actions bot locked and limited conversation to collaborators Apr 29, 2025
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support consistent partition store reads when accessing multiple keys/cfs
2 participants