Skip to content

Conversation

untitaker
Copy link
Member

This will cause the sentry SDK, if initialized, to pick up the crash and
send it to sentry.

signals/KeyboardInterrupt shouldn't be caught by except Exception:

This will cause the sentry SDK, if initialized, to pick up the crash and
send it to sentry.

signals/KeyboardInterrupt shouldn't be caught by `except Exception:`
@@ -132,7 +132,7 @@ def run(self) -> None:

self._shutdown()
except Exception as error:
logger.info("Caught %r, shutting down...", error)
logger.exception("Caught exception, shutting down...")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should not need to log this as an exception here as we are raising it again at line 145. What was the issue ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for some reason i don't fully understand, sys.excepthook is broken in the child process. i suspect that multiprocessing.Pool overwrites it to propagate the exception to the parent process.

for some other reason, the exception that makes the parent process shut down is also not reported.

now i can choose where i want to instrument: do I want to fix instrumentation in the parent process, where the exception stacktrace is broken and local variables are missing (because multiprocessing is pickling the exception to make it cross process boundaries), or in the child process where i risk losing errors if the consumer crash-loops in the parent process?

our answer for celery integration is that we instrument in the child process because there we get better contextual data.

I think we might want to instrument both processes just to be safe, but it would inherently mean duplicated error reports, unless we remove the raise in the child process

@untitaker
Copy link
Member Author

@fpacifici I investigated three parts wrt arroyo multiprocessing:

  • Why does a crash of the main process not report sentry errors. I still don't have an answer to that and I don't expect to find one. I definitely know that something is broken here but i cannot reproduce it in a clean dev environment.
  • What can we do to get the best stacktrace and contextual data when process_message crashes in the metrics string indexer. I think this PR is the answer and it should be merged.
  • Why don't metrics get reported in subprocesses. Fix is in fix(metrics): Initialize arroyo all the time, so it runs in every process [sns-1622] sentry#37950

I would suggest we investigate double-sending of errors when it becomes a practical concern. In my experience of writing the celery integration, it is tricky to proactively suppress double-reporting of errors without risking losing some insight into what is going on.

Copy link
Member

@nikhars nikhars left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks fine to me. I agree that having the ability to catch the error has more value than duplicate exceptions being reported.

@untitaker untitaker merged commit 59743dc into main Aug 31, 2022
@untitaker untitaker deleted the fix/log-exception branch August 31, 2022 14:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants