Skip to content

Conversation

user202729
Copy link
Contributor

@user202729 user202729 commented Aug 15, 2025

Fixes #37026.

As explained in https://trofi.github.io/posts/312-the-sagemath-saga.html, the relevant code are the following.

static PyObject *__pyx_pf_4sage_4libs_3gap_7element_19GapElement_Function_2__call__(struct __pyx_obj_4sage_4libs_3gap_7element_GapElement_Function *__pyx_v_self, PyObject *__pyx_v_args) {

  PyObject *__pyx_t_6 = NULL;

    sig_GAP_Enter();

      __pyx_t_6 = __Pyx_GetItemInt_List(__pyx_v_a, 2, long, 1, __Pyx_PyInt_From_long, 1, 0, 1); if (unlikely(!__pyx_t_6)) __PYX_ERR(0, 2528, __pyx_L14_error)
      __pyx_v_result = GAP_CallFunc3Args(__pyx_v_self->__pyx_base.value, ((struct __pyx_obj_4sage_4libs_3gap_7element_GapElement *)__pyx_t_5)->value, ((struct __pyx_obj_4sage_4libs_3gap_7element_GapElement *)__pyx_t_4)->value, ((struct __pyx_obj_4sage_4libs_3gap_7element_GapElement *)__pyx_t_6)->value);
      __Pyx_DECREF(__pyx_t_6); __pyx_t_6 = 0;

      GAP_Leave();
  __Pyx_XDECREF(__pyx_t_6);
}

where

  #define __Pyx_XDECREF(r)  do { if((r) == NULL); else {__Pyx_DECREF(r); }} while(0)

so that simplifies to

t_6 = 0;
if (!setjmp()) {
	t_6 = 1;
	GAP_CallFunc3Args(); // longjmp inside here
	t_6 = 0;
}
if (t_6 != 0) access(t_6);

The "easy" fix is to declare t_6 volatile, but we cannot do that since it's a Cython-generated temporary variable.

A better way is to avoid mutating t_6 between setjmp() and longjmp().

We look at the original code to see why t_6 is being mutated in the first place. It's because we're accessing
a[i], where a is a Python list, so Cython needs to store the result into a Python-reference-counted temporary variable.

By changing a from a Python list to a C array, the new code becomes

    sig_GAP_Enter();
    __pyx_t_12 = sig_on(); if (unlikely(__pyx_t_12 == ((int)0))) __PYX_ERR(0, 2514, __pyx_L9_error)
    switch (__pyx_v_n) {
      case 0:
      __pyx_v_result = GAP_CallFunc0Args(__pyx_v_self->__pyx_base.value);
      break;
      case 1:
      __pyx_v_result = GAP_CallFunc1Args(__pyx_v_self->__pyx_base.value, (__pyx_v_a[0]));
      break;
      case 2:
      __pyx_v_result = GAP_CallFunc2Args(__pyx_v_self->__pyx_base.value, (__pyx_v_a[0]), (__pyx_v_a[1]));
      break;
      case 3:
      __pyx_v_result = GAP_CallFunc3Args(__pyx_v_self->__pyx_base.value, (__pyx_v_a[0]), (__pyx_v_a[1]), (__pyx_v_a[2]));
      break;
      default:
      __pyx_v_result = GAP_CallFuncList(__pyx_v_self->__pyx_base.value, __pyx_v_arg_list);
      break;
    }
    sig_off();
  }
      GAP_Leave();
      goto __pyx_L10;
    }

Moral of the story: do not touch any Python object within sig_on()...sig_off()!

The cysignals documentation also warned about this:

The code inside sig_on() should be pure C or Cython code. If you call any Python code or manipulate any Python object (even something trivial like x = []), an interrupt can mess up Python’s internal state. When in doubt, try to use sig_check() instead.


This does not violate the GAP API either.

Code which uses the GAP API exposed by this header file should sandwich
any such calls between uses of the GAP_Enter() and GAP_Leave() macro as
follows:

int ok = GAP_Enter();
if (ok) {
    ... // any number of calls to GAP APIs
}
GAP_Leave();

This is in particular crucial if your code keeps references to any GAP
functions in local variables: Calling GAP_Enter() ensures that GAP is
aware of such references, and will not garbage collect the referenced
objects. Failing to use these macros properly can lead to crashes, or
worse, silent memory corruption. You have been warned!

📝 Checklist

  • The title is concise and informative.
  • The description explains in detail what this PR is about.
  • I have linked a relevant issue or discussion.
  • I have created tests covering the changes.
  • I have updated the documentation and checked the documentation preview.

⌛ Dependencies

@user202729
Copy link
Contributor Author

@orlitzky Please test, thanks.

@user202729 user202729 marked this pull request as draft August 15, 2025 19:43
@user202729 user202729 marked this pull request as ready for review August 15, 2025 19:45
@user202729
Copy link
Contributor Author

Before the last commit, the code should be free of setjmp/longjmp bug, but it may have object lifetime bug.

In particular, GapElement.__dealloc__ is called much before CallFunc<x>Args are called, which leads to dereference_obj gets called. The effect of this function is the following

cdef void reference_obj(Obj obj) noexcept:
    """
    Reference ``obj``
    """
    cdef ObjWrapper wrapped = wrap_obj(obj)
    global owned_objects_refcount
#    print("reference_obj called "+ crepr(obj) +"\n")
    if wrapped in owned_objects_refcount:
        owned_objects_refcount[wrapped] += 1
    else:
        owned_objects_refcount[wrapped] = 1


cdef void dereference_obj(Obj obj) noexcept:
    """
    Reference ``obj``
    """
    cdef ObjWrapper wrapped = wrap_obj(obj)
    global owned_objects_refcount
    refcount = owned_objects_refcount.pop(wrapped)
    if refcount > 1:
        owned_objects_refcount[wrapped] = refcount - 1


cdef void gasman_callback() noexcept with gil:
    """
    Callback before each GAP garbage collection
    """
    global owned_objects_refcount
    for obj in owned_objects_refcount:
        GAP_MarkBag((<ObjWrapper>obj).value)

I'm not entirely clear how object lifetime in GAP works, but in any case, this version should be safer than the old version.

Copy link

Documentation preview for this PR (built with commit a0e12f5; changes) is ready! 🎉
This preview will update shortly after each push to this PR.

@orlitzky
Copy link
Contributor

This also solves the problem and I think the explanation is believable too.

@orlitzky
Copy link
Contributor

Though one of these just popped up randomly after enough iterations with sage -t. I shouldn't be surprised:

sage: G0 = libgap.SymplecticGroup(4,2) ## line 1321 ##
sage: P = G0.IsomorphismFpGroup().Range() ## line 1322 ##

**********************************************************************
Traceback (most recent call last):
  File "/home/mjo/.local/lib/python3.13/site-packages/sage/doctest/forker.py", line 2631, in __call__
    doctests, extras = self._run(runner, options, results)
                       ~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/mjo/.local/lib/python3.13/site-packages/sage/doctest/forker.py", line 2679, in _run
    result = runner.run(test)
  File "/home/mjo/.local/lib/python3.13/site-packages/sage/doctest/forker.py", line 925, in run
    return self._run(test, compileflags, out)
           ~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/mjo/.local/lib/python3.13/site-packages/sage/doctest/forker.py", line 733, in _run
    self.compile_and_execute(example, compiler, test.globs)
    ~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/mjo/.local/lib/python3.13/site-packages/sage/doctest/forker.py", line 1157, in compile_and_execute
    exec(compiled, globs)
    ~~~~^^^^^^^^^^^^^^^^^
  File "<doctest sage.libs.gap.element.GapElement.sage[23]>", line 1, in <module>
  File "sage/libs/gap/element.pyx", line 2646, in sage.libs.gap.element.GapElement_MethodProxy.__call__
  File "sage/libs/gap/element.pyx", line 2526, in sage.libs.gap.element.GapElement_Function.__call__
cysignals.signals.SignalError: Segmentation fault

@cxzhong
Copy link
Contributor

cxzhong commented Aug 15, 2025

Though one of these just popped up randomly after enough iterations with sage -t. I shouldn't be surprised:

sage: G0 = libgap.SymplecticGroup(4,2) ## line 1321 ##
sage: P = G0.IsomorphismFpGroup().Range() ## line 1322 ##

**********************************************************************
Traceback (most recent call last):
  File "/home/mjo/.local/lib/python3.13/site-packages/sage/doctest/forker.py", line 2631, in __call__
    doctests, extras = self._run(runner, options, results)
                       ~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/mjo/.local/lib/python3.13/site-packages/sage/doctest/forker.py", line 2679, in _run
    result = runner.run(test)
  File "/home/mjo/.local/lib/python3.13/site-packages/sage/doctest/forker.py", line 925, in run
    return self._run(test, compileflags, out)
           ~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/mjo/.local/lib/python3.13/site-packages/sage/doctest/forker.py", line 733, in _run
    self.compile_and_execute(example, compiler, test.globs)
    ~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/mjo/.local/lib/python3.13/site-packages/sage/doctest/forker.py", line 1157, in compile_and_execute
    exec(compiled, globs)
    ~~~~^^^^^^^^^^^^^^^^^
  File "<doctest sage.libs.gap.element.GapElement.sage[23]>", line 1, in <module>
  File "sage/libs/gap/element.pyx", line 2646, in sage.libs.gap.element.GapElement_MethodProxy.__call__
  File "sage/libs/gap/element.pyx", line 2526, in sage.libs.gap.element.GapElement_Function.__call__
cysignals.signals.SignalError: Segmentation fault

It seems a quite simple way. We have an improve file,Then add you extend test into the file as a doctest @orlitzky

@user202729
Copy link
Contributor Author

Though one of these just popped up randomly after enough iterations with sage -t. I shouldn't be surprised:

sage: G0 = libgap.SymplecticGroup(4,2) ## line 1321 ##
sage: P = G0.IsomorphismFpGroup().Range() ## line 1322 ##

**********************************************************************
Traceback (most recent call last):
  File "/home/mjo/.local/lib/python3.13/site-packages/sage/doctest/forker.py", line 2631, in __call__
    doctests, extras = self._run(runner, options, results)
                       ~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/mjo/.local/lib/python3.13/site-packages/sage/doctest/forker.py", line 2679, in _run
    result = runner.run(test)
  File "/home/mjo/.local/lib/python3.13/site-packages/sage/doctest/forker.py", line 925, in run
    return self._run(test, compileflags, out)
           ~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/mjo/.local/lib/python3.13/site-packages/sage/doctest/forker.py", line 733, in _run
    self.compile_and_execute(example, compiler, test.globs)
    ~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/mjo/.local/lib/python3.13/site-packages/sage/doctest/forker.py", line 1157, in compile_and_execute
    exec(compiled, globs)
    ~~~~^^^^^^^^^^^^^^^^^
  File "<doctest sage.libs.gap.element.GapElement.sage[23]>", line 1, in <module>
  File "sage/libs/gap/element.pyx", line 2646, in sage.libs.gap.element.GapElement_MethodProxy.__call__
  File "sage/libs/gap/element.pyx", line 2526, in sage.libs.gap.element.GapElement_Function.__call__
cysignals.signals.SignalError: Segmentation fault

This one does not use the setjmp/longjmp mechanism at all, so it's likely a different bug.

On the other hand, it may be that the previous ensure_interruptible_after corrupts some memory in GAP, but this is an different (harder to fix) issue. To debug it, I would need the element.pyx.c file, element.cpython-311-x86_64-linux-gnu.so file, and a gdb backtrace. (to get the back trace, you may find #39079 useful.)

@user202729 user202729 requested a review from dimpase August 16, 2025 02:48
@orlitzky
Copy link
Contributor

On the other hand, it may be that the previous ensure_interruptible_after corrupts some memory in GAP, but this is an different (harder to fix) issue. To debug it, I would need the element.pyx.c file, element.cpython-311-x86_64-linux-gnu.so file, and a gdb backtrace. (to get the back trace, you may find #39079 useful.)

I had the same thought and deleted that test. It's been passing now for a few hours running continuously. You should probably add one or more HERE BE DRAGONS comments, but otherwise this makes one less problem for me. Thank you!

@user202729
Copy link
Contributor Author

user202729 commented Aug 16, 2025

looking at it more, it may be safe to, instead of mixing sig_on... sig_off with GAP_Enter and GAP_Leave, just use InterruptExecStat instead. But that's the topic for another pull request.

After all, in GAP's own signal handler syAnswerIntr, it uses precisely this mechanism.

(in GAP, Stat means statements, not statistics.)

@dimpase
Copy link
Member

dimpase commented Aug 16, 2025

@user202729 - Cool !!!

@cxzhong
Copy link
Contributor

cxzhong commented Aug 16, 2025

It is much simpler and useful for me, too. Can we merge this fastly

@user202729 user202729 added the p: CI Fix merged before running CI tests label Aug 19, 2025
vbraun pushed a commit to vbraun/sage that referenced this pull request Aug 21, 2025
sagemathgh-40594: Fix segmentation fault in libgap function call
    
Fixes sagemath#37026.

As explained in https://trofi.github.io/posts/312-the-sagemath-
saga.html, the relevant code are the following.

```c
static PyObject *__pyx_pf_4sage_4libs_3gap_7element_19GapElement_Functio
n_2__call__(struct
__pyx_obj_4sage_4libs_3gap_7element_GapElement_Function *__pyx_v_self,
PyObject *__pyx_v_args) {

  PyObject *__pyx_t_6 = NULL;

    sig_GAP_Enter();

      __pyx_t_6 = __Pyx_GetItemInt_List(__pyx_v_a, 2, long, 1,
__Pyx_PyInt_From_long, 1, 0, 1); if (unlikely(!__pyx_t_6)) __PYX_ERR(0,
2528, __pyx_L14_error)
      __pyx_v_result = GAP_CallFunc3Args(__pyx_v_self->__pyx_base.value,
((struct __pyx_obj_4sage_4libs_3gap_7element_GapElement
*)__pyx_t_5)->value, ((struct
__pyx_obj_4sage_4libs_3gap_7element_GapElement *)__pyx_t_4)->value,
((struct __pyx_obj_4sage_4libs_3gap_7element_GapElement
*)__pyx_t_6)->value);
      __Pyx_DECREF(__pyx_t_6); __pyx_t_6 = 0;

      GAP_Leave();
  __Pyx_XDECREF(__pyx_t_6);
}
```

where

```c
  #define __Pyx_XDECREF(r)  do { if((r) == NULL); else {__Pyx_DECREF(r);
}} while(0)
```

so that simplifies to

```c
t_6 = 0;
if (!setjmp()) {
        t_6 = 1;
        GAP_CallFunc3Args(); // longjmp inside here
        t_6 = 0;
}
if (t_6 != 0) access(t_6);
```

The "easy" fix is to declare `t_6` volatile, but we cannot do that since
it's a Cython-generated temporary variable.

A better way is to avoid mutating `t_6` between `setjmp()` and
`longjmp()`.

We look at the original code to see why `t_6` is being mutated in the
first place. It's because we're accessing
`a[i]`, where `a` is a Python list, so Cython needs to store the result
into a Python-reference-counted temporary variable.

By changing `a` from a Python list to a C array, the new code becomes

```c
    sig_GAP_Enter();
    __pyx_t_12 = sig_on(); if (unlikely(__pyx_t_12 == ((int)0)))
__PYX_ERR(0, 2514, __pyx_L9_error)
    switch (__pyx_v_n) {
      case 0:
      __pyx_v_result =
GAP_CallFunc0Args(__pyx_v_self->__pyx_base.value);
      break;
      case 1:
      __pyx_v_result = GAP_CallFunc1Args(__pyx_v_self->__pyx_base.value,
(__pyx_v_a[0]));
      break;
      case 2:
      __pyx_v_result = GAP_CallFunc2Args(__pyx_v_self->__pyx_base.value,
(__pyx_v_a[0]), (__pyx_v_a[1]));
      break;
      case 3:
      __pyx_v_result = GAP_CallFunc3Args(__pyx_v_self->__pyx_base.value,
(__pyx_v_a[0]), (__pyx_v_a[1]), (__pyx_v_a[2]));
      break;
      default:
      __pyx_v_result = GAP_CallFuncList(__pyx_v_self->__pyx_base.value,
__pyx_v_arg_list);
      break;
    }
    sig_off();
  }
      GAP_Leave();
      goto __pyx_L10;
    }
```

Moral of the story: **do not touch any Python object within
`sig_on()`...`sig_off()`!**

The [cysignals documentation](https://cysignals.readthedocs.io/en/latest
/interrupt.html) also warned about this:

> The code inside `sig_on()` should be pure C or Cython code. If you
call any Python code or manipulate any Python object (even something
trivial like `x = []`), an interrupt can mess up Python’s internal
state. When in doubt, try to use sig_check() instead.

------

This does not violate the GAP API either.

```
Code which uses the GAP API exposed by this header file should sandwich
any such calls between uses of the GAP_Enter() and GAP_Leave() macro as
follows:

int ok = GAP_Enter();
if (ok) {
    ... // any number of calls to GAP APIs
}
GAP_Leave();

This is in particular crucial if your code keeps references to any GAP
functions in local variables: Calling GAP_Enter() ensures that GAP is
aware of such references, and will not garbage collect the referenced
objects. Failing to use these macros properly can lead to crashes, or
worse, silent memory corruption. You have been warned!
```

### 📝 Checklist

<!-- Put an `x` in all the boxes that apply. -->

- [ ] The title is concise and informative.
- [ ] The description explains in detail what this PR is about.
- [ ] I have linked a relevant issue or discussion.
- [ ] I have created tests covering the changes.
- [ ] I have updated the documentation and checked the documentation
preview.

### ⌛ Dependencies

<!-- List all open PRs that this PR logically depends on. For example,
-->
<!-- - sagemath#12345: short description why this is a dependency -->
<!-- - sagemath#34567: ... -->
    
URL: sagemath#40594
Reported by: user202729
Reviewer(s):
@tornaria
Copy link
Contributor

With this PR applied to an (otherwise clean) 10.7 source, I am still getting segfaults, although it seems to be a different segfault.

The way to trigger the segfaults is:

$ sage -tp 16 /usr/lib/python3.13/site-packages/sage/libs/gap/element.pyx{,,,}{,,,}{,,,}
[...]
----------------------------------------------------------------------
sage -t --warn-long 5.0 --random-seed=62234612783095899964699722888145863316 /usr/lib/python3.13/site-packages/sage/libs/gap/element.pyx  # SignalError in doctesting framework
sage -t --warn-long 5.0 --random-seed=62234612783095899964699722888145863316 /usr/lib/python3.13/site-packages/sage/libs/gap/element.pyx  # SignalError in doctesting framework
----------------------------------------------------------------------
[...]

Here's the last part of the backtrace:

  [...]
  File "sage/structure/sage_object.pyx", line 756, in sage.structure.sage_object.SageObject._libgap_ (build/cythonized/sage/structure/sage_object.c:8808)
    return libgap.eval(self)
  File "sage/libs/gap/libgap.pyx", line 399, in sage.libs.gap.libgap.Gap.eval (build/cythonized/sage/libs/gap/libgap.c:6956)
    elem = make_any_gap_element(self, gap_eval(gap_command))
  File "sage/libs/gap/util.pyx", line 358, in sage.libs.gap.util.gap_eval (build/cythonized/sage/libs/gap/util.c:7895)
    sig_on()
cysignals.signals.SignalError: Segmentation fault

NB: before the PR, I was still getting this segfault mixed with the one in sage.libs.gap.element.GapElement_Function.__call__. After the PR, it seems I only get the segfault in sage.libs.gap.util.gap_eval.

I won't have time to debug this, but perhaps with this information a similar fix to the one in this PR will solve it @user202729 ?

@user202729
Copy link
Contributor Author

Try #40613 then.

vbraun pushed a commit to vbraun/sage that referenced this pull request Aug 25, 2025
sagemathgh-40613: Fix Ctrl-C segfaults in sage.libs.gap
    
Replace cysignals' `sig_on()` and `sig_off()` with custom `gap_sig_on()`
and `gap_sig_off()` wrappers that use GAP's own `SIGINT` handling. This
fixes an uncommon, but ultimately reproducible segfault when Ctrl-C is
used to interrupt a GAP computation.

In concert with sagemath#40594 this has
allowed `sage/libs/gap/element.pyx` to pass many thousands of test
iterations.

### Fixes

* sagemath#40598

### Dependencies

* sagemath#40594
    
URL: sagemath#40613
Reported by: Michael Orlitzky
Reviewer(s): Dima Pasechnik, Michael Orlitzky, user202729
@vbraun vbraun merged commit 91b1b76 into sagemath:develop Aug 27, 2025
24 of 25 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
p: CI Fix merged before running CI tests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Segfault testing src/sage/libs/gap/element.pyx on python 3.12
6 participants