Skip to content
This repository was archived by the owner on Apr 26, 2024. It is now read-only.
This repository was archived by the owner on Apr 26, 2024. It is now read-only.

workers stop working after elevated traffic #2738

@turt2live

Description

@turt2live

Description

There appears to be nothing indicating a problem in the logs, however there's circumstantial evidence that when synapse receives higher than normal traffic it can cause the federation_sender to stop working (no activity), therefore not federating with remote servers. The federation_sender logs don't seem to have anything out of the ordinary - it just stops sending requests. The main synapse process complains about the events stream falling behind, but doesn't seem to cause problems until 12 minutes later.

This has happened about 10 times in the past to t2bot.io, and each time the number of events being persisted was always elevated (double it's normal rate) before the federation_sender stopped working. For t2bot.io "normal" is defined as 2-3Hz. Each time the federation_sender has stopped the persisted events were going through at >6Hz (this latest being ~6-10Hz).

Here's the timeline for the problem (in UTC):

  • 01:56:09 Synapse crosses the 6Hz persisted events line
  • 03:07:28 The main synapse process started complaining that the events stream was falling behind
  • 03:10:03 Synapse falls below the 6Hz persisted events line
  • 03:19:56 The federation_sender officially stopped working
  • 04:27:22 The entire stack was restarted, restoring federation

During this time the only error spat out was (repeated every few seconds):

homeserver - 2017-12-17 03:13:04,014 - twisted - 131 - CRITICAL - - 
Traceback (most recent call last):
  File "/home/matrix/.synapse/local/lib/python2.7/site-packages/twisted/internet/defer.py", line 1384, in _inlineCallbacks
    result = result.throwExceptionIntoGenerator(g)
  File "/home/matrix/.synapse/local/lib/python2.7/site-packages/twisted/python/failure.py", line 408, in throwExceptionIntoGenerator
    return g.throw(self.type, self.value, self.tb)
  File "/home/matrix/.synapse/local/lib/python2.7/site-packages/synapse/replication/tcp/resource.py", line 164, in on_notifier_poke
    updates, current_token = yield stream.get_updates()
  File "/home/matrix/.synapse/local/lib/python2.7/site-packages/twisted/internet/defer.py", line 1384, in _inlineCallbacks
    result = result.throwExceptionIntoGenerator(g)
  File "/home/matrix/.synapse/local/lib/python2.7/site-packages/twisted/python/failure.py", line 408, in throwExceptionIntoGenerator
    return g.throw(self.type, self.value, self.tb)
  File "/home/matrix/.synapse/local/lib/python2.7/site-packages/synapse/replication/tcp/streams.py", line 169, in get_updates
    updates, current_token = yield self.get_updates_since(self.last_token)
  File "/home/matrix/.synapse/local/lib/python2.7/site-packages/twisted/internet/defer.py", line 1386, in _inlineCallbacks
    result = g.send(result)
  File "/home/matrix/.synapse/local/lib/python2.7/site-packages/synapse/replication/tcp/streams.py", line 200, in get_updates_since
    raise Exception("stream %s has fallen behined" % (self.NAME))
Exception: stream current_state_deltas has fallen behined

Further, during this time incoming federation was unaffected. Synapse was still processing events and passing them along to appservices. Only outbound federation was affected.

More in-depth logs are available upon request.

Version information

  • Homeserver: t2bot.io
  • Version: 0.26.0-rc1
  • Install method: pip
  • Platform: container, ubuntu host.

Metadata

Metadata

Assignees

Labels

A-WorkersProblems related to running Synapse in Worker Mode (or replication)z-bug(Deprecated Label)

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions