-
-
Notifications
You must be signed in to change notification settings - Fork 2.1k
run_background_updates
doesn't give up on faulty background updates, opting instead to fill the disk with StopIteration
exceptions #12780
Description
This part of run_background_updates
:
synapse/synapse/storage/background_updates.py
Lines 278 to 290 in 2e6068b
while self.enabled: | |
try: | |
result = await self.do_next_background_update(sleep) | |
except Exception: | |
logger.exception("Error doing update") | |
else: | |
if result: | |
logger.info( | |
"No more background updates to do." | |
" Unscheduling background update task." | |
) | |
self._all_done = True | |
return None |
will repeatedly run a faulty background update until the disk fills up from the log spam.
For a simple reproduction case, apply this:
diff --git a/synapse/storage/background_updates.py b/synapse/storage/background_updates.py
index 3cd4a78f0..0db89d52b 100644
--- a/synapse/storage/background_updates.py
+++ b/synapse/storage/background_updates.py
@@ -393,7 +393,7 @@ class BackgroundUpdater:
# We have a background update to run, otherwise we would have returned
# early.
assert self._current_background_update is not None
- update_info = self._background_update_handlers[self._current_background_update]
+ update_info = self._background_update_handlers[self._current_background_update + 'x']
async with self._get_context_manager_for_update(
sleep=sleep,
then try and run a trial test.
You'll get reams and reams of:
Traceback (most recent call last):
File "/home/rei/work/synapse/synapse/storage/background_updates.py", line 280, in run_background_updates
result = await self.do_next_background_update(sleep)
File "/home/rei/work/synapse/synapse/storage/background_updates.py", line 396, in do_next_background_update
update_info = self._background_update_handlers[self._current_background_update + 'x']
KeyError: 'add_rooms_room_version_columnx'
2022-05-18 11:25:06,177 - synapse.storage.background_updates - 282 - ERROR - run_bg_updates - Error doing update
Traceback (most recent call last):
File "/home/rei/.cache/pypoetry/virtualenvs/matrix-synapse-0QZlT1cR-py3.10/lib/python3.10/site-packages/twisted/internet/defer.py", line 1660, in _inlineCallbacks
result = current_context.run(gen.send, result)
StopIteration: [{'update_name': 'add_rooms_room_version_column', 'depends_on': None}, {'update_name': 'current_state_events_membership', 'depends_on': None}, {'update_name': 'delete_old_current_state_events', 'depends_on': 'current_state_events_membership'}, {'update_name': 'devices_last_seen', 'depends_on': None}, {'update_name': 'event_store_labels', 'depends_on': None}, {'update_name': 'insert_room_retention', 'depends_on': None}, {'update_name': 'populate_stats_process_users', 'depends_on': 'populate_stats_process_rooms'}, {'update_name': 'redactions_have_censored_ts_idx', 'depends_on': None}, {'update_name': 'redactions_received_ts', 'depends_on': None}, {'update_name': 'remove_tombstoned_rooms_from_directory', 'depends_on': None}, {'update_name': 'room_membership_forgotten_idx', 'depends_on': None}, {'update_name': 'state_groups_room_id_idx', 'depends_on': None}, {'update_name': 'users_set_deactivated_flag', 'depends_on': None}, {'update_name': 'remove_dup_outbound_pokes', 'depends_on': None}, {'update_name': 'populate_stats_process_rooms', 'depends_on': None}, {'update_name': 'users_have_local_media', 'depends_on': None}, {'update_name': 'e2e_cross_signing_keys_idx', 'depends_on': None}, {'update_name': 'user_external_ids_user_id_idx', 'depends_on': None}, {'update_name': 'rejected_events_metadata', 'depends_on': None}, {'update_name': 'chain_cover', 'depends_on': 'rejected_events_metadata'}, {'update_name': 'remove_deactivated_pushers', 'depends_on': None}, {'update_name': 'remove_stale_pushers', 'depends_on': None}, {'update_name': 'purged_chain_cover', 'depends_on': None}, {'update_name': 'populate_rooms_creator_column', 'depends_on': None}, {'update_name': 'remove_deleted_email_pushers', 'depends_on': None}, {'update_name': 'presence_stream_not_offline_index', 'depends_on': None}, {'update_name': 'remove_hidden_devices_from_device_inbox', 'depends_on': None}, {'update_name': 'local_group_updates_index', 'depends_on': None}, {'update_name': 'remove_deleted_devices_from_device_inbox', 'depends_on': None}, {'update_name': 'event_arbitrary_relations', 'depends_on': None}, {'update_name': 'remove_dead_devices_from_device_inbox', 'depends_on': None}, {'update_name': 'delete_account_data_for_deactivated_users', 'depends_on': None}, {'update_name': 'event_search_sqlite_delete_non_strings', 'depends_on': None}, {'update_name': 'state_group_edges_unique_idx', 'depends_on': None}]
During handling of the above exception, another exception occurred:
(repeated for a very long time..)
My PR #12687 triggers this but only in SyTest and only in CI, for whatever reason. The problem is that filling up the disk is a dreadful failure mode for CI, as that prevents you from getting access to the logs. (The best approach I've found is to run a script in parallel that will kill SyTest after a fixed time.)
It's not clear to me that spinning on an error is the correct behaviour. Shouldn't we just bubble it up? Perhaps after a fixed number of retries (I'm not sure why it retries, but for the sake of it I could be tempted to only bail if there have been e.g. 5 back-to-back failures.)?
I don't really know what the StopIteration
is about; I suspect it's related to Twisted's _inlineCallbacks
though.