-
-
Notifications
You must be signed in to change notification settings - Fork 654
Fix segmentation fault in libgap function call #40594
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix segmentation fault in libgap function call #40594
Conversation
@orlitzky Please test, thanks. |
Before the last commit, the code should be free of setjmp/longjmp bug, but it may have object lifetime bug. In particular,
I'm not entirely clear how object lifetime in GAP works, but in any case, this version should be safer than the old version. |
Documentation preview for this PR (built with commit a0e12f5; changes) is ready! 🎉 |
This also solves the problem and I think the explanation is believable too. |
Though one of these just popped up randomly after enough iterations with
|
It seems a quite simple way. We have an improve file,Then add you extend test into the file as a doctest @orlitzky |
This one does not use the setjmp/longjmp mechanism at all, so it's likely a different bug. On the other hand, it may be that the previous |
I had the same thought and deleted that test. It's been passing now for a few hours running continuously. You should probably add one or more HERE BE DRAGONS comments, but otherwise this makes one less problem for me. Thank you! |
looking at it more, it may be safe to, instead of mixing After all, in GAP's own signal handler (in GAP, |
@user202729 - Cool !!! |
It is much simpler and useful for me, too. Can we merge this fastly |
sagemathgh-40594: Fix segmentation fault in libgap function call Fixes sagemath#37026. As explained in https://trofi.github.io/posts/312-the-sagemath- saga.html, the relevant code are the following. ```c static PyObject *__pyx_pf_4sage_4libs_3gap_7element_19GapElement_Functio n_2__call__(struct __pyx_obj_4sage_4libs_3gap_7element_GapElement_Function *__pyx_v_self, PyObject *__pyx_v_args) { PyObject *__pyx_t_6 = NULL; sig_GAP_Enter(); __pyx_t_6 = __Pyx_GetItemInt_List(__pyx_v_a, 2, long, 1, __Pyx_PyInt_From_long, 1, 0, 1); if (unlikely(!__pyx_t_6)) __PYX_ERR(0, 2528, __pyx_L14_error) __pyx_v_result = GAP_CallFunc3Args(__pyx_v_self->__pyx_base.value, ((struct __pyx_obj_4sage_4libs_3gap_7element_GapElement *)__pyx_t_5)->value, ((struct __pyx_obj_4sage_4libs_3gap_7element_GapElement *)__pyx_t_4)->value, ((struct __pyx_obj_4sage_4libs_3gap_7element_GapElement *)__pyx_t_6)->value); __Pyx_DECREF(__pyx_t_6); __pyx_t_6 = 0; GAP_Leave(); __Pyx_XDECREF(__pyx_t_6); } ``` where ```c #define __Pyx_XDECREF(r) do { if((r) == NULL); else {__Pyx_DECREF(r); }} while(0) ``` so that simplifies to ```c t_6 = 0; if (!setjmp()) { t_6 = 1; GAP_CallFunc3Args(); // longjmp inside here t_6 = 0; } if (t_6 != 0) access(t_6); ``` The "easy" fix is to declare `t_6` volatile, but we cannot do that since it's a Cython-generated temporary variable. A better way is to avoid mutating `t_6` between `setjmp()` and `longjmp()`. We look at the original code to see why `t_6` is being mutated in the first place. It's because we're accessing `a[i]`, where `a` is a Python list, so Cython needs to store the result into a Python-reference-counted temporary variable. By changing `a` from a Python list to a C array, the new code becomes ```c sig_GAP_Enter(); __pyx_t_12 = sig_on(); if (unlikely(__pyx_t_12 == ((int)0))) __PYX_ERR(0, 2514, __pyx_L9_error) switch (__pyx_v_n) { case 0: __pyx_v_result = GAP_CallFunc0Args(__pyx_v_self->__pyx_base.value); break; case 1: __pyx_v_result = GAP_CallFunc1Args(__pyx_v_self->__pyx_base.value, (__pyx_v_a[0])); break; case 2: __pyx_v_result = GAP_CallFunc2Args(__pyx_v_self->__pyx_base.value, (__pyx_v_a[0]), (__pyx_v_a[1])); break; case 3: __pyx_v_result = GAP_CallFunc3Args(__pyx_v_self->__pyx_base.value, (__pyx_v_a[0]), (__pyx_v_a[1]), (__pyx_v_a[2])); break; default: __pyx_v_result = GAP_CallFuncList(__pyx_v_self->__pyx_base.value, __pyx_v_arg_list); break; } sig_off(); } GAP_Leave(); goto __pyx_L10; } ``` Moral of the story: **do not touch any Python object within `sig_on()`...`sig_off()`!** The [cysignals documentation](https://cysignals.readthedocs.io/en/latest /interrupt.html) also warned about this: > The code inside `sig_on()` should be pure C or Cython code. If you call any Python code or manipulate any Python object (even something trivial like `x = []`), an interrupt can mess up Python’s internal state. When in doubt, try to use sig_check() instead. ------ This does not violate the GAP API either. ``` Code which uses the GAP API exposed by this header file should sandwich any such calls between uses of the GAP_Enter() and GAP_Leave() macro as follows: int ok = GAP_Enter(); if (ok) { ... // any number of calls to GAP APIs } GAP_Leave(); This is in particular crucial if your code keeps references to any GAP functions in local variables: Calling GAP_Enter() ensures that GAP is aware of such references, and will not garbage collect the referenced objects. Failing to use these macros properly can lead to crashes, or worse, silent memory corruption. You have been warned! ``` ### 📝 Checklist <!-- Put an `x` in all the boxes that apply. --> - [ ] The title is concise and informative. - [ ] The description explains in detail what this PR is about. - [ ] I have linked a relevant issue or discussion. - [ ] I have created tests covering the changes. - [ ] I have updated the documentation and checked the documentation preview. ### ⌛ Dependencies <!-- List all open PRs that this PR logically depends on. For example, --> <!-- - sagemath#12345: short description why this is a dependency --> <!-- - sagemath#34567: ... --> URL: sagemath#40594 Reported by: user202729 Reviewer(s):
With this PR applied to an (otherwise clean) 10.7 source, I am still getting segfaults, although it seems to be a different segfault. The way to trigger the segfaults is:
Here's the last part of the backtrace:
NB: before the PR, I was still getting this segfault mixed with the one in I won't have time to debug this, but perhaps with this information a similar fix to the one in this PR will solve it @user202729 ? |
Try #40613 then. |
sagemathgh-40613: Fix Ctrl-C segfaults in sage.libs.gap Replace cysignals' `sig_on()` and `sig_off()` with custom `gap_sig_on()` and `gap_sig_off()` wrappers that use GAP's own `SIGINT` handling. This fixes an uncommon, but ultimately reproducible segfault when Ctrl-C is used to interrupt a GAP computation. In concert with sagemath#40594 this has allowed `sage/libs/gap/element.pyx` to pass many thousands of test iterations. ### Fixes * sagemath#40598 ### Dependencies * sagemath#40594 URL: sagemath#40613 Reported by: Michael Orlitzky Reviewer(s): Dima Pasechnik, Michael Orlitzky, user202729
Fixes #37026.
As explained in https://trofi.github.io/posts/312-the-sagemath-saga.html, the relevant code are the following.
where
so that simplifies to
The "easy" fix is to declare
t_6
volatile, but we cannot do that since it's a Cython-generated temporary variable.A better way is to avoid mutating
t_6
betweensetjmp()
andlongjmp()
.We look at the original code to see why
t_6
is being mutated in the first place. It's because we're accessinga[i]
, wherea
is a Python list, so Cython needs to store the result into a Python-reference-counted temporary variable.By changing
a
from a Python list to a C array, the new code becomesMoral of the story: do not touch any Python object within
sig_on()
...sig_off()
!The cysignals documentation also warned about this:
This does not violate the GAP API either.
📝 Checklist
⌛ Dependencies