Skip to content
This repository was archived by the owner on Apr 26, 2024. It is now read-only.
This repository was archived by the owner on Apr 26, 2024. It is now read-only.

run_background_updates doesn't give up on faulty background updates, opting instead to fill the disk with StopIteration exceptions #12780

@reivilibre

Description

@reivilibre

This part of run_background_updates:

while self.enabled:
try:
result = await self.do_next_background_update(sleep)
except Exception:
logger.exception("Error doing update")
else:
if result:
logger.info(
"No more background updates to do."
" Unscheduling background update task."
)
self._all_done = True
return None

will repeatedly run a faulty background update until the disk fills up from the log spam.

For a simple reproduction case, apply this:

diff --git a/synapse/storage/background_updates.py b/synapse/storage/background_updates.py
index 3cd4a78f0..0db89d52b 100644
--- a/synapse/storage/background_updates.py
+++ b/synapse/storage/background_updates.py
@@ -393,7 +393,7 @@ class BackgroundUpdater:
         # We have a background update to run, otherwise we would have returned
         # early.
         assert self._current_background_update is not None
-        update_info = self._background_update_handlers[self._current_background_update]
+        update_info = self._background_update_handlers[self._current_background_update + 'x']
 
         async with self._get_context_manager_for_update(
             sleep=sleep,

then try and run a trial test.
You'll get reams and reams of:

Traceback (most recent call last):
  File "/home/rei/work/synapse/synapse/storage/background_updates.py", line 280, in run_background_updates
    result = await self.do_next_background_update(sleep)
  File "/home/rei/work/synapse/synapse/storage/background_updates.py", line 396, in do_next_background_update
    update_info = self._background_update_handlers[self._current_background_update + 'x']
KeyError: 'add_rooms_room_version_columnx'
2022-05-18 11:25:06,177 - synapse.storage.background_updates - 282 - ERROR - run_bg_updates - Error doing update
Traceback (most recent call last):
  File "/home/rei/.cache/pypoetry/virtualenvs/matrix-synapse-0QZlT1cR-py3.10/lib/python3.10/site-packages/twisted/internet/defer.py", line 1660, in _inlineCallbacks
    result = current_context.run(gen.send, result)
StopIteration: [{'update_name': 'add_rooms_room_version_column', 'depends_on': None}, {'update_name': 'current_state_events_membership', 'depends_on': None}, {'update_name': 'delete_old_current_state_events', 'depends_on': 'current_state_events_membership'}, {'update_name': 'devices_last_seen', 'depends_on': None}, {'update_name': 'event_store_labels', 'depends_on': None}, {'update_name': 'insert_room_retention', 'depends_on': None}, {'update_name': 'populate_stats_process_users', 'depends_on': 'populate_stats_process_rooms'}, {'update_name': 'redactions_have_censored_ts_idx', 'depends_on': None}, {'update_name': 'redactions_received_ts', 'depends_on': None}, {'update_name': 'remove_tombstoned_rooms_from_directory', 'depends_on': None}, {'update_name': 'room_membership_forgotten_idx', 'depends_on': None}, {'update_name': 'state_groups_room_id_idx', 'depends_on': None}, {'update_name': 'users_set_deactivated_flag', 'depends_on': None}, {'update_name': 'remove_dup_outbound_pokes', 'depends_on': None}, {'update_name': 'populate_stats_process_rooms', 'depends_on': None}, {'update_name': 'users_have_local_media', 'depends_on': None}, {'update_name': 'e2e_cross_signing_keys_idx', 'depends_on': None}, {'update_name': 'user_external_ids_user_id_idx', 'depends_on': None}, {'update_name': 'rejected_events_metadata', 'depends_on': None}, {'update_name': 'chain_cover', 'depends_on': 'rejected_events_metadata'}, {'update_name': 'remove_deactivated_pushers', 'depends_on': None}, {'update_name': 'remove_stale_pushers', 'depends_on': None}, {'update_name': 'purged_chain_cover', 'depends_on': None}, {'update_name': 'populate_rooms_creator_column', 'depends_on': None}, {'update_name': 'remove_deleted_email_pushers', 'depends_on': None}, {'update_name': 'presence_stream_not_offline_index', 'depends_on': None}, {'update_name': 'remove_hidden_devices_from_device_inbox', 'depends_on': None}, {'update_name': 'local_group_updates_index', 'depends_on': None}, {'update_name': 'remove_deleted_devices_from_device_inbox', 'depends_on': None}, {'update_name': 'event_arbitrary_relations', 'depends_on': None}, {'update_name': 'remove_dead_devices_from_device_inbox', 'depends_on': None}, {'update_name': 'delete_account_data_for_deactivated_users', 'depends_on': None}, {'update_name': 'event_search_sqlite_delete_non_strings', 'depends_on': None}, {'update_name': 'state_group_edges_unique_idx', 'depends_on': None}]

During handling of the above exception, another exception occurred:

(repeated for a very long time..)

My PR #12687 triggers this but only in SyTest and only in CI, for whatever reason. The problem is that filling up the disk is a dreadful failure mode for CI, as that prevents you from getting access to the logs. (The best approach I've found is to run a script in parallel that will kill SyTest after a fixed time.)

It's not clear to me that spinning on an error is the correct behaviour. Shouldn't we just bubble it up? Perhaps after a fixed number of retries (I'm not sure why it retries, but for the sake of it I could be tempted to only bail if there have been e.g. 5 back-to-back failures.)?

I don't really know what the StopIteration is about; I suspect it's related to Twisted's _inlineCallbacks though.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions