Skip to content
This repository was archived by the owner on Apr 26, 2024. It is now read-only.
This repository was archived by the owner on Apr 26, 2024. It is now read-only.

Could not serialize access due to concurrent DELETE from presence_stream #15827

@realtyem

Description

@realtyem
2023-06-24 20:17:27.496 CDT [29164] ERROR:  could not serialize access due to concurrent delete
2023-06-24 20:17:27.496 CDT [29164] STATEMENT:  DELETE FROM presence_stream WHERE stream_id < 31609735 AND user_id = ANY(ARRAY[<a list of 34 different users>])
2023-06-25 00:19:10.120 CDT [29528] ERROR:  could not serialize access due to concurrent delete
2023-06-25 00:19:10.120 CDT [29528] STATEMENT:  DELETE FROM presence_stream WHERE stream_id < 31624045 AND user_id = ANY(ARRAY['@user:someserver'])
2023-06-25 00:21:01.669 CDT [29511] ERROR:  could not serialize access due to concurrent delete
2023-06-25 00:21:01.669 CDT [29511] STATEMENT:  DELETE FROM presence_stream WHERE stream_id < 31624211 AND user_id = ANY(ARRAY['@otheruser:someotherserver'])
2023-06-25 00:21:01.669 CDT [29526] ERROR:  could not serialize access due to concurrent delete
2023-06-25 00:21:01.669 CDT [29526] STATEMENT:  DELETE FROM presence_stream WHERE stream_id < 31624212 AND user_id = ANY(ARRAY['@otheruser:someotherserver'])
2023-06-25 00:21:01.679 CDT [29526] ERROR:  could not serialize access due to concurrent delete
2023-06-25 00:21:01.679 CDT [29526] STATEMENT:  DELETE FROM presence_stream WHERE stream_id < 31624212 AND user_id = ANY(ARRAY['@otheruser:someotherserver'])
2023-06-25 00:21:02.085 CDT [29527] ERROR:  could not serialize access due to concurrent delete
2023-06-25 00:21:02.085 CDT [29527] STATEMENT:  DELETE FROM presence_stream WHERE stream_id < 31624214 AND user_id = ANY(ARRAY['@thirduser:yetanotherserver'])
2023-06-25 00:21:02.117 CDT [29498] ERROR:  could not serialize access due to concurrent delete
2023-06-25 00:21:02.117 CDT [29498] STATEMENT:  DELETE FROM presence_stream WHERE stream_id < 31624217 AND user_id = ANY(ARRAY['@thirduser:yetanotherserver'])
2023-06-25 00:21:02.117 CDT [29524] ERROR:  could not serialize access due to concurrent delete
2023-06-25 00:21:02.117 CDT [29524] STATEMENT:  DELETE FROM presence_stream WHERE stream_id < 31624218 AND user_id = ANY(ARRAY['@thirduser:yetanotherserver'])
2023-06-25 00:21:02.128 CDT [29524] ERROR:  could not serialize access due to concurrent delete
2023-06-25 00:21:02.128 CDT [29524] STATEMENT:  DELETE FROM presence_stream WHERE stream_id < 31624218 AND user_id = ANY(ARRAY['@thirduser:yetanotherserver'])

So this is annoying. And while it doesn't always occur so regularly, it does seem to come in batches. I took a little while to investigate this and found out:

  1. The function this occurs in, PresenceStore.update_presence(), takes between 35-47ms to complete when this happens, but only 5-25ms when it doesn't(on my two user workerized-homeserver). This appears to be due to the rollback of the transaction before retrying.(These numbers should not be considered standardized, as variance across several factors can influence them. e.g. Monolith Postgres would sometimes take 11-95ms without any contention. The point is that a rollback transaction can almost double the time it takes to complete it, or more if it has to try multiple times.)
  2. Inside the presence_handler, there are 3 places this PresenceStore function is called:
    1. _on_shutdown() - Fairly self-evident what this does.
    2. _persist_unpersisted_changes() - used like a drag net to make sure anything that still needs to be written to the database gets done, as otherwise it will stack up and affect shutdown times. Run as a looping call every 60 seconds.
    3. handle_presence_timeouts(which is actually _handle_timeouts()) -> _update_states() -> _persist_and_notify() - which is the main point of entry for changes in presence states, and is responsible for sending changes over federation and persisting those changes. Run as a looping call every 5 seconds.

The main reason this seems to occur is that _persist_unpersisted_changes() and _persist_and_notify() sometimes run at the same time, due to overlapping timeouts. The process of persisting presence changes includes running a DELETE before the INSERT, in order to clean up previous Stream ID's. When _persist_unpersisted_changes() runs at the same time as _persist_and_notify() it has a lower Stream ID which seems to be confusing the database/transaction. The solution seems to be deal with this in a way that allows the transaction to serialize.

Metadata

Metadata

Assignees

No one assigned

    Labels

    A-DatabaseDB stuff like queries, migrations, new/remove columns, indexes, unexpected entries in the dbA-PresenceO-UncommonMost users are unlikely to come across this or unexpected workflowS-TolerableMinor significance, cosmetic issues, low or no impact to users.T-DefectBugs, crashes, hangs, security vulnerabilities, or other reported issues.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions