Skip to content
This repository was archived by the owner on Apr 26, 2024. It is now read-only.
This repository was archived by the owner on Apr 26, 2024. It is now read-only.

Synapse workers Complement image results in flaky tests due to inconsistent worker process init #10065

@anoadragon453

Description

@anoadragon453

The worker-version of Synapse running in Complement (as described here) currently uses Supervisor as an init system to start all worker processes in the container. See the current config template we're using.

This works, and eventually all processes start up. However, Complement checks whether a homeserver is ready to start testing by the fact that it responds successfully to a GET /_matrix/client/versions call. This endpoint may be successfully responded to by a worker that has started, while other workers are still starting up. This inconsistency can lead to test failures, where Complement finds a 502 from a call to a different endpoint that should be handled by a different worker. Since that worker hasn't started yet, nginx returns a 502, and the test fails.

The result of this is flaky Complement tests - which nobody wants.

I believe the solution is to start groups of processes in the container through a priority system. Only should the next group be started once the previous has successfully responded to healthchecks (indicating the process is ready to receive connections):

  1. Redis
  2. the main Synapse process
    • thus all database migrations are handled before any workers start up.
  3. all worker Synapse processes
  4. nginx

(Note that caddy [just used for custom CA stuff] and Postgres are started before even Supervisor is.) By starting nginx at the very end, which is the reverse proxy that actually routes matrix requests to the appropriate Synapse process, Complement will not receive a successful response to /_matrix/client/versions until everything else has started.

Initially I had hoped to use systemd as an init system to replace Supervisor, but systemd apparently doesn't work in docker containers. Additionally, we need each process to output its logs to stdout, as otherwise Complement won't be able to display the homeserver logs after a test failure. systemd would make this a bit tricky as it tries to capture logs. Currently this has worked by having Supervisor simply redirecting all process logs to stdout, which the ENTRYPOINT of the docker container, configure_workers_and_start.py, would simply relay.

I don't believe we want to use synctl here, as the team has been trying to phase that out for a while now. We could simply do all of this via subprocess in configure_workers_and_start.py, but I'm hoping there's a better, less manual way. Any ideas? I'd also love to be proven wrong as to whether Supervisor actually can do the following:

  • Wait for another process to start up before starting the next.
  • Healthchecks using HTTP.

@richvdh and @erikjohnston also mentioned that Synapse has a way to signal to processes that it's ready to receive connections (that may potentially be better than just polling the /health endpoint), which may be useful for this discussion. Edit: I've just had a look, and it looks like we use a systemd-specific method called sdnotify, which won't be useful here unfortunately.

Metadata

Metadata

Assignees

No one assigned

    Labels

    A-TestingIssues related to testing in complement, synapse, etcA-WorkersProblems related to running Synapse in Worker Mode (or replication)T-DefectBugs, crashes, hangs, security vulnerabilities, or other reported issues.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions