-
-
Notifications
You must be signed in to change notification settings - Fork 2.1k
AS transaction isolation tweaks #12032
Conversation
"complete_appservice_txn", | ||
_complete_appservice_txn, | ||
isolation_level=IsolationLevel.READ_COMMITTED, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't realise one of the transactions in question was non-trivial! That complicates things a little since last_txn
can now change out from under us in between _get_last_txn
and the upsert. Previously, we'd get a SerializationFailure and retry until we get a consistent view throughout the transaction.
I think this is okay though since last_txn_id
is only used for a debug check, which will now be less reliable - if two complete_appservice_txn
transactions run simultaneously with the same txn_id
, neither of them may trip the debug check (whereas before, one of them would have to retry and trip the check). It's best to leave a comment noting that last_txn_id
can change out from under us before the upsert because we're in READ COMMITTED mode, or something to that effect.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
An alternative is to use a RETURNING
clause on the upsert to get the last_txn_id
. The downside is you'd need a different path for SQLite (which doesn't support RETURNING
).
If we then also replace the DELETE
with a DELETE WHERE txn_id <= ?
then I don't think this even needs to be a transaction, and we can do the two queries with db_autocommit
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another thought - should txn_id
be generated using a sequence? The debug check and manual generation of TXN ID seems unncessary to me (this would potentially open up running multiple appservice workers with additional work I think?).
Currently I believe this is safe but only because there is only a single process-per-AS working on transactions (enforced by https://github.com/matrix-org/synapse/blob/develop/synapse/appservice/scheduler.py#L181).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For some reason the debug check trips on matrix.org from time to time, perhaps around the time synapse workers are restarting(?). It's not immediately clear to me why it happens.
I can't see why we don't use a sequence for txn_id
. Something to do with sqlite support maybe?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Possibly historical reasons? The AS codepaths look old on blame to me! Not super familiar with the rest of synapse but sequences in use/defined in https://github.com/matrix-org/synapse/blob/develop/synapse/storage/util/sequence.py so potentially would be beneficial to switch to those. Need more context to make that call though, and probably beyond the scope of this PR!
This has now been fixed in #12209! |
Probably fixes #11620.
Let me know if you'd prefer this in two PRs!
Signed-off-by: Nick Mills-Barrett nick@beeper.com
Pull Request Checklist
(run the linters)