Skip to content
This repository was archived by the owner on Apr 26, 2024. It is now read-only.
This repository was archived by the owner on Apr 26, 2024. It is now read-only.

User directory is performing state resolution, which results in unnecessary CPU usage #9797

@anoadragon453

Description

@anoadragon453

On April 11th, 2021 at ~12:00 UTC we saw matrix.org's user directory worker start using 100% CPU consistently, and continued doing so until restarted on April 12th 16:10 UTC.

It turns out that it was stuck doing state resolution for an IRC room with 123,000+ state events.

It's a little bit surprising that the user directory is doing state resolution at all though, as it should just be listening for membership changes happening on the current_state_deltas_stream, and updating tables used for user directory search accordingly.

In the logs, we see the following repeated multiple times per second:

2021-04-12 00:00:44,506 - synapse.replication.tcp.handler - 496 - INFO - replication_command_handler@7f0b5b2e2268 - Handling 'POSITION events event_persister-2 1939721421 1939721422'
2021-04-12 00:00:44,506 - synapse.replication.tcp.handler - 549 - INFO - process-replication-data-48623630 - Caught up with stream 'events' to 1939721422
2021-04-12 00:00:44,507 - synapse.replication.tcp.handler - 496 - INFO - replication_command_handler@7f0b5b2e2268 - Handling 'POSITION events event_persister-2 1939721422 1939721423'
2021-04-12 00:00:44,507 - synapse.replication.tcp.handler - 549 - INFO - process-replication-data-48623632 - Caught up with stream 'events' to 1939721423
2021-04-12 00:00:44,610 - synapse.state - 576 - INFO - Measure[resolve_state_groups_for_events]@7f09dc222840 - Resolving state for !xxx:domain with groups [596595428, 596513551]
2021-04-12 00:00:44,714 - synapse.state.v1 - 84 - INFO - Measure[state._resolve_events]@7f09dc222d68 - Asking for 104/104 conflicted events
2021-04-12 00:00:44,715 - synapse.state.v1 - 118 - INFO - Measure[state._resolve_events]@7f09dc222d68 - Asking for 3/3 auth events

(Note that we are using redis replication, even if that code is in the tcp/handler.py class).

So it seems that the user directory is listening to the events stream (I think), in addition to the current_state_deltas stream:

max_pos, deltas = await self.store.get_current_state_deltas(
self.pos, room_max_stream_ordering
)

Ideally the user directory would just accept membership updates from other worker processes without needing to perform state resolution itself in the meantime.

Metadata

Metadata

Assignees

No one assigned

    Labels

    S-MajorMajor functionality / product severely impaired, no satisfactory workaround.T-DefectBugs, crashes, hangs, security vulnerabilities, or other reported issues.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions