Skip to content

Conversation

soehms
Copy link
Contributor

@soehms soehms commented May 24, 2023

Purpose

The goal of this PR is to provide a stable PyPI package for the Khoca code. The underlying branch from November 2021 already resulted in a prerelease PyPI package that can be installed by

pip install khoca

This goal is a first step towards integrating Khoca into the SageMath khovanov_homology method.

Open tasks

Unfortunately, this version has a memory management issues that need to be fixed before a stable release can be announced. In addition, the current PyPI package is installed from source, which takes more than 10 minutes. This should be upgraded to a manylinux binary distribution.

Memory management issues

Since I'm not really familiar with C++ templates, I need help regarding the memory issues. As far as I understand the problem, it is related to using global variables. Specifically, there must be a gap in the ComplexStack class destructor. As a naive attempt to fix it I add this line to the destructor, but to no avail. This bug shows up in the following lines because the Frobenius algebra doesn't remain resistant when the instantiation of InteractiveCalculator is repeated:

from khoca import InteractiveCalculator
>>> InteractiveCalculator(0, (0, 0), 0)
Khovanov homology calculator for Frobenius algebra: Z[X] / (1*X^2).
>>> InteractiveCalculator(0, (0, 0), 0)
Khovanov homology calculator for None
>>> InteractiveCalculator(0, (0, 0), 0)
Khovanov homology calculator for None

Another issue occurs when using InteractiveCalculator in Sage (i.e. after sage -pip install khoca). This appears to be an interference with Pari functionality, but I'm not sure if it has the same origin:

sage: b5_2 = BraidGroup(3)((1, 1, 1, 2, -1, 2))
sage: K5_2 = Link(b5_2)
sage: from khoca import InteractiveCalculator
sage: KH = InteractiveCalculator()
sage: KH(b5_2.Tietze())
[[[-5, 12, 0, 1],
  [-4, 10, 0, 1],
......
  [0, 3, 0, 0],
  [-1, 5, 0, 0],
  [0, 5, 0, 0]]]
sage:
sage: K5_2.khovanov_homology()
Segmentation fault

The segmentation error does not occur if you call the methods in reverse:

sage: b5_2 = BraidGroup(3)((1, 1, 1, 2, -1, 2))
sage: K5_2 = Link(b5_2)
sage: from khoca import InteractiveCalculator
sage: KH = InteractiveCalculator()
sage: K5_2.khovanov_homology()
{1: {-1: 0, 0: Z, 1: 0, 2: 0},
 3: {-1: 0, 0: Z, 1: Z, 2: 0, 3: 0},
 5: {-1: 0, 0: 0, 1: 0, 2: Z x C2, 3: 0, 4: 0, 5: 0},
 7: {0: 0, 1: 0, 2: Z, 3: C2, 4: 0, 5: 0},
 9: {1: 0, 2: 0, 3: Z, 4: Z, 5: 0},
 11: {2: 0, 3: 0, 4: 0, 5: C2},
 13: {5: Z}}
sage: KH(b5_2.Tietze())
[[[-5, 12, 0, 1],
  [-4, 10, 0, 1],
...
  [0, 3, 0, 0],
  [-1, 5, 0, 0],
  [0, 5, 0, 0]]]
sage:

The intreference shows up in the inline function zeromatcopy of the Pari interface, explicitly here:

cdef GEN _new_GEN_from_fmpz_mat_t_rotate90(fmpz_mat_t B):
    r"""
    Create a new PARI ``t_MAT`` with ``nr`` rows and ``nc`` columns
    from a ``fmpz_mat_t`` and rotate the matrix 90 degrees
    counterclockwise.  So the resulting matrix will have ``nc`` rows
    and ``nr`` columns.

    For internal use only; this directly uses the PARI stack.
    One should call ``sig_on()`` before and ``sig_off()`` after.
    """
    cdef GEN x
    cdef GEN A = zeromatcopy(fmpz_mat_ncols(B), fmpz_mat_nrows(B))
    cdef Py_ssize_t i, j
    for i in range(fmpz_mat_nrows(B)):
        for j in range(fmpz_mat_ncols(B)):
            x = _new_GEN_from_fmpz_t(fmpz_mat_entry(B, i, fmpz_mat_ncols(B) - j - 1))
            set_gcoeff(A, j+1, i+1, x)  # A[j+1, i+1] = x (using 1-based indexing)
    return A

@soehms soehms force-pushed the pypi branch 12 times, most recently from bd0fb31 to bc56529 Compare March 8, 2025 22:01
@soehms soehms force-pushed the pypi branch 10 times, most recently from 233d851 to d5fc92d Compare March 30, 2025 20:32
@soehms soehms force-pushed the pypi branch 2 times, most recently from 0823478 to 2b01541 Compare April 2, 2025 13:25
@soehms soehms force-pushed the pypi branch 5 times, most recently from f5be7c9 to 22ce636 Compare May 7, 2025 06:32
@soehms
Copy link
Contributor Author

soehms commented May 10, 2025

As of May 2025

I've made some progress over the past three months:

  1. Binary wheels are now available on PyPI. They cover the major Linux and macOS distributions. Work is still underway for Windows.

  2. The two memory issues mentioned in the header comment are already resolved, right now. The first issue was addressed in 5f48583 by adding static vector initialization. The second issue (the segmentation fault) was fixed in fd671c5 by saving a previously used Pari stack before calculation and then restoring it. Note that this second issue only affected the source distribution.

  3. I have already opened PR Integration of a new optional package for Khovanov homology sagemath/sage#40081 for the Sage integration.

  4. Unfortunately, an incompatibility with Sage persists, which occurs sporadically. Most of the time, all doctests under src\sage\knots for PR #40081 pass, but sometimes the following happens:

      File "/home/sebastian/devel/sage/src/sage/knots/link.py", line 1235, in _khovanov_homology_cached
        homologies = ChainComplex(complexes).homology()
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
      File "/home/sebastian/devel/sage/src/sage/homology/chain_complex.py", line 1259, in homology
        answer[deg] = self._homology_in_degree(deg, base_ring, verbose, generators, algorithm)
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
      File "/home/sebastian/devel/sage/src/sage/homology/chain_complex.py", line 1328, in _homology_in_degree
        all_divs = d_in.elementary_divisors(algorithm)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
      File "sage/matrix/matrix_integer_dense.pyx", line 2346, in sage.matrix.matrix_integer_dense.Matrix_integer_dense.elementary_divisors
        d = self.__pari__().matsnf(0).sage()
      File "cypari2/gen.pyx", line 1967, in cypari2.gen.Gen.sage
        return gen_to_sage(self, locals)
      File "sage/libs/pari/convert_sage.pyx", line 55, in sage.libs.pari.convert_sage.gen_to_sage
        cpdef gen_to_sage(Gen z, locals=None):
      File "sage/libs/pari/convert_sage.pyx", line 316, in sage.libs.pari.convert_sage.gen_to_sage
        return [gen_to_sage(x, locals) for x in z.python_list()]
      File "cypari2/gen.pyx", line 1934, in cypari2.gen.Gen.python_list
        return [self[n] for n in range(glength(self.g))]
      File "cypari2/gen.pyx", line 1404, in cypari2.gen.Gen.__getitem__
        val = self.new_ref(gel(self.fixGEN(), i+1))
      File "cypari2/gen.pyx", line 209, in cypari2.gen.Gen.fixGEN
        move_gens_to_heap(self.sp())
      File "cypari2/stack.pyx", line 123, in cypari2.stack.move_gens_to_heap
        sig_on()
    SystemError: calling remove_from_pari_stack() inside sig_on()

Unlike the now-fixed segfault issue, this problem also occurs in binary distributions. Therefore, it's likely not due to the shared Pari stack. Any help in fixing it is welcome!

When building binary Windows wheels natively, the workflow hangs at the linker:

2025-05-07T22:12:15.9322944Z   C:\Program Files\Microsoft Visual Studio\2022\Enterprise\VC\Tools\MSVC\14.43.34808\bin\HostX86\x64\link.exe /nologo /INCREMENTAL:NO /LTCG /DLL /MANIFEST:EMBED,ID=2 /MANIFESTUAC:NO /LIBPATH:C:\msys64\mingw64\lib /LIBPATH:C:\msys64\mingw64\bin /LIBPATH:Pari42\bin /LIBPATH:C:\msys64\tmp\cibw-run-epap9e06\cp36-win_amd64\build\venv\libs /LIBPATH:C:\Users\runneradmin\AppData\Local\pypa\cibuildwheel\Cache\nuget-cpython\python.3.6.8\tools\libs /LIBPATH:C:\Users\runneradmin\AppData\Local\pypa\cibuildwheel\Cache\nuget-cpython\python.3.6.8\tools /LIBPATH:C:\msys64\tmp\cibw-run-epap9e06\cp36-win_amd64\build\venv\PCbuild\amd64 "/LIBPATH:C:\Program Files\Microsoft Visual Studio\2022\Enterprise\VC\Tools\MSVC\14.43.34808\ATLMFC\lib\x64" "/LIBPATH:C:\Program Files\Microsoft Visual Studio\2022\Enterprise\VC\Tools\MSVC\14.43.34808\lib\x64" "/LIBPATH:C:\Program Files (x86)\Windows Kits\NETFXSDK\4.8\lib\um\x64" "/LIBPATH:C:\Program Files (x86)\Windows Kits\10\lib\10.0.26100.0\ucrt\x64" "/LIBPATH:C:\Program Files (x86)\Windows Kits\10\\lib\10.0.26100.0\\um\x64" /EXPORT:PyInit_pui build\temp.win-amd64-3.6\Release\src\compilerFlagsInfo.obj build\temp.win-amd64-3.6\Release\src\shared.obj build\temp.win-amd64-3.6\Release\src\krasner\krasner.obj build\temp.win-amd64-3.6\Release\src\planar_algebra\coefficient_rings.obj build\temp.win-amd64-3.6\Release\src\planar_algebra\planar_algebra.obj build\temp.win-amd64-3.6\Release\src\planar_algebra\smith.obj build\temp.win-amd64-3.6\Release\src\planar_algebra\sparsemat.obj build\temp.win-amd64-3.6\Release\src\python_interface\pui.obj build\temp.win-amd64-3.6\Release\src\python_interface\pythonInterface.obj /OUT:build\lib.win-amd64-3.6\khoca\bin\pui.cp36-win_amd64.pyd /IMPLIB:build\temp.win-amd64-3.6\Release\src\pui.cp36-win_amd64.lib C:\msys64\mingw64\lib\libgmp.dll.a C:\msys64\mingw64\lib\libgmpxx.dll.a Pari42\bin\libpari.dll.a
2025-05-07T22:12:15.9323401Z      Creating library build\temp.win-amd64-3.6\Release\src\pui.cp36-win_amd64.lib and object build\temp.win-amd64-3.6\Release\src\pui.cp36-win_amd64.exp
2025-05-07T22:12:15.9324319Z   coefficient_rings.obj : error LNK2001: unresolved external symbol "__declspec(dllimport) class std::basic_ostream<char,struct std::char_traits<char> > & __cdecl operator<<(class std::basic_ostream<char,struct std::char_traits<char> > &,struct __mpz_struct const *)" (__imp_??6@YAAEAV?$basic_ostream@DU?$char_traits@D@std@@@std@@AEAV01@PEBU__mpz_struct@@@Z)
2025-05-07T22:12:15.9325211Z   coefficient_rings.obj : error LNK2001: unresolved external symbol "__declspec(dllimport) class std::basic_ostream<char,struct std::char_traits<char> > & __cdecl operator<<(class std::basic_ostream<char,struct std::char_traits<char> > &,struct __mpq_struct const *)" (__imp_??6@YAAEAV?$basic_ostream@DU?$char_traits@D@std@@@std@@AEAV01@PEBU__mpq_struct@@@Z)
2025-05-07T22:12:15.9325356Z   smith.obj : error LNK2001: unresolved external symbol win32ctrlc
2025-05-07T22:12:15.9325583Z   smith.obj : error LNK2001: unresolved external symbol "__int64 const CATCH_ALL" (?CATCH_ALL@@3_JB)
2025-05-07T22:12:15.9325789Z   smith.obj : error LNK2001: unresolved external symbol "unsigned __int64 avma" (?avma@@3_KA)
2025-05-07T22:12:15.9325987Z   smith.obj : error LNK2001: unresolved external symbol "__int64 * gen_1" (?gen_1@@3PEA_JEA)
2025-05-07T22:12:15.9326330Z   smith.obj : error LNK2001: unresolved external symbol "struct _SETJMP_FLOAT128 (* iferr_env)[16]" (?iferr_env@@3PEAY0BA@U_SETJMP_FLOAT128@@EA)
2025-05-07T22:12:15.9326633Z   smith.obj : error LNK2001: unresolved external symbol "struct pari_mainstack * pari_mainstack" (?pari_mainstack@@3PEAU0@EA)
2025-05-07T22:12:15.9326831Z   smith.obj : error LNK2001: unresolved external symbol "__int64 * gen_0" (?gen_0@@3PEA_JEA)
2025-05-07T22:12:15.9327064Z   build\lib.win-amd64-3.6\khoca\bin\pui.cp36-win_amd64.pyd : fatal error LNK1120: 9 unresolved externals
2025-05-07T22:12:15.9327447Z   error: command 'C:\\Program Files\\Microsoft Visual Studio\\2022\\Enterprise\\VC\\Tools\\MSVC\\14.43.34808\\bin\\HostX86\\x64\\link.exe' failed with exit status 1120

I think the missing symbols must be contained in Pari42/bin/libpari.dll.a, which was installed here:

2025-05-07T22:09:09.9411314Z ../config/install -m 644 libpari.dll.a "/d/a/khoca/khoca/Pari42/bin"/libpari.dll.a

and what's specified in the linker options. So I don't understand the problem here. Any help with this is also welcome!

@soehms soehms marked this pull request as ready for review May 10, 2025 12:19
@ptrrsn
Copy link

ptrrsn commented Jul 11, 2025

How can I reproduce the SystemError in step 4? Do I only need to:

  1. Apply Integration of a new optional package for Khovanov homology sagemath/sage#40081
  2. Build sage
  3. Run doctest under src/sage/knots multiple times?

Also, in what platform did you see the SystemError? Linux?

@soehms
Copy link
Contributor Author

soehms commented Jul 11, 2025

How can I reproduce the SystemError in step 4? Do I only need to:

  1. Apply Integration of a new optional package for Khovanov homology sagemath/sage#40081
  2. Build sage
  3. Run doctest under src/sage/knots multiple times?

Unless you set --enable-khoca on ./configure you have to install the optional package after step 2. (./sage -pip install khoca)

Also, in what platform did you see the SystemError? Linux?

Yes! I saw it under Linux Mint 21.1 on a ThinkPad with an i5 and under Ubuntu 20.04 running under WSL2 on a ThinkPad with an i7.

@ptrrsn
Copy link

ptrrsn commented Jul 15, 2025

Sorry for the late reply. As can be seen in the stack trace, the SystemError was triggered because of the following statement in remove_from_pari_stack in cypari2/stack.pyx:

if sig_on_count and not block_sigint

where block_sigint is a global counter that is increased everytime sig_block is called and decreased when sig_unblock is called anywhere, not just in cypari2. Even though the remove_from_pari_stack is preceeded by: sig_block() (in move_gens_to_heap), the block_sigint becomes zero because before move_gens_to_heap was called, block_sigint was incorrectly -1.

When I checked where block_sigint reached -1 (which can be anywhere because block_sigint is global, it leads to unrelated place outside cypari2 which seems to be the actual root cause:

/home/risan/sage/sage/local/var/lib/sage/venv-python3.12/lib/python3.12/site-packages/cysignals/signals.cpython-312-x86_64-linux-gnu.so(+0x9bb4)[0x723e2c459bb4]
/home/risan/sage/sage/local/var/lib/sage/venv-python3.12/lib/python3.12/site-packages/cysignals/signals.cpython-312-x86_64-linux-gnu.so(+0x9bb4)[0x723e2c459bb4]
/home/risan/sage/sage/src/sage/ext/memory.cpython-312-x86_64-linux-gnu.so(+0x4292)[0x723e25642292]
/lib/x86_64-linux-gnu/libgmp.so.10(__gmpz_realloc+0x44)[0x723e26457134]
/lib/x86_64-linux-gnu/libgmp.so.10(__gmpz_mul+0x3c3)[0x723e2645e773]
/home/risan/sage/sage/local/var/lib/sage/venv-python3.12/lib/python3.12/site-packages/khoca/bin/pui.cpython-312-x86_64-linux-gnu.so(_ZNK11KrasnerCoboI8MIntegerLi8EE7composeERKS1_RSt6vectorIS1_SaIS1_EERK13KrasnerTangleSA_SA_+0x96d)[0x723dd23b65bd]
/home/risan/sage/sage/local/var/lib/sage/venv-python3.12/lib/python3.12/site-packages/khoca/bin/pui.cpython-312-x86_64-linux-gnu.so(_ZN7LCCobosI11KrasnerCoboI8MIntegerLi8EEE7composeERKS3_RK13KrasnerTangleS8_S8_+0xb9)[0x723dd281f439]
/home/risan/sage/sage/local/var/lib/sage/venv-python3.12/lib/python3.12/site-packages/khoca/bin/pui.cpython-312-x86_64-linux-gnu.so(_ZN10MatLCCobosI11KrasnerCoboI8MIntegerLi8EEE11gaussThreadERKSt6vectorI7LCCobosIS2_ESaIS6_EESA_S4_ImSaImEESC_bRKS6_S1_RK10VecTanglesI13KrasnerTangleESJ_RKSG_SL_iiRS4_IPS6_SaISM_EERSt5mutexRSt18condition_variable+0x30b)[0x723dd281f7eb]
/home/risan/sage/sage/local/var/lib/sage/venv-python3.12/lib/python3.12/site-packages/khoca/bin/pui.cpython-312-x86_64-linux-gnu.so(_ZNSt6thread11_State_implINS_8_InvokerISt5tupleIJM10MatLCCobosI11KrasnerCoboI8MIntegerLi8EEEFvRKSt6vectorI7LCCobosIS6_ESaISA_EESE_S8_ImSaImEESG_bRKSA_S5_RK10VecTanglesI13KrasnerTangleESN_RKSK_SP_iiRS8_IPSA_SaISQ_EERSt5mutexRSt18condition_variableEPS7_St17reference_wrapperISC_ES12_S11_ISG_ES13_bS11_ISA_ES5_S11_ISM_ES15_S11_ISO_ES16_lmS11_ISS_ES11_ISU_ES11_ISW_EEEEEE6_M_runEv+0x23d)[0x723dd25e28bd]
/lib/x86_64-linux-gnu/libstdc++.so.6(+0xecdb4)[0x723e212ecdb4]
/lib/x86_64-linux-gnu/libc.so.6(+0x9caa4)[0x723e2c89caa4]
/home/risan/sage/sage/src/sage/ext/memory.cpython-312-x86_64-linux-gnu.so(+0x446a)[0x723e2564246a]
/lib/x86_64-linux-gnu/libc.so.6(+0x129c3c)[0x723e2c929c3c]

As you can see, there seem to be race conditions when multiple threads involving at least one gaussThread are triggered from khoca. These threads access GMP integers at the same time but the following lines from src/sage/ext/memory.pyx shows that these GMP objects allocation functions are replaced with sage's functions that are protected by cysignals - and they cannot run concurrently otherwise they will mess up the global counters above.

mp_set_memory_functions(sage_sig_malloc, sage_sig_realloc, sage_sig_free)

So it seems that it is unsafe to run things concurrently in Sage that involves GMP integers, even though they are operating on different integers because they might mess up the global counters like block_sigint above.

I am not an expert in this area, but it seems to be an unfortunate side effect of cysignals design. A non-ideal workaround here is to protect all of your access to GMP from running concurrently.

On sage side, perhaps we should protect sage_sig_* methods with some global mutex? @dimpase @jdemeyer is this race condition familiar to you and/or actually expected?

@ptrrsn
Copy link

ptrrsn commented Jul 15, 2025

I am not at all an expert on cysignals, so my opinion here might not make any sense at all - but I am wondering if it even remotely reasonable to consider adding some cysignals methods that are more "local" (creating a local counters instead of using global counters) on cysignals side (I am throwing this thought because I added @dimpase @jdemeyer here)...

It just seems scary that a really different part of sage can affect others due to such global state.

@soehms
Copy link
Contributor Author

soehms commented Jul 15, 2025

Sorry for the late reply.

No problem, there's no rush. Thank you for investigating the issue!

As you can see, there seem to be race conditions when multiple threads involving at least one gaussThread are triggered from khoca. These threads access GMP integers at the same time but the following lines from src/sage/ext/memory.pyx shows that these GMP objects allocation functions are replaced with sage's functions that are protected by cysignals - and they cannot run concurrently otherwise they will mess up the global counters above.

I probably haven't figured out everything you've discovered so far: Does this mean that the gaussThread method is messing up memory uncontrollably? It has access to GMP integers (how does this relate?), but not explicitly to block_sigint. I've also observed the error when running the doctests in single-threaded mode (which, of course, doesn't mean that every test with Khoca is single-threaded, too) and when running tests in an IPython session. I don't understand how Cypari2 and Khoca threads can run concurrently in these tests.

@ptrrsn
Copy link

ptrrsn commented Jul 16, 2025

Now I think about it again, it is also possible that the problem happened before khoca, I will try to do a deeper test to see if it is actually started in khoca in this case. But there is also a possibility that it was corrupted in khoca (which I meant above that I described in more detail below).

As you correctly mentioned, gaussThread has access to GMP integers. But even though gaussThread has no explicit access to block_sigint, the GMP integers are modifying the block_sigint through this:
(1) https://github.com/sagemath/sage/blob/931cc5e8dc03580c9fecbd0535544a725e9baeb7/src/sage/ext/memory.pyx#L94

In the code above, it was configured so that the GMP integers allocation will be done through sage_sig_* functions. And as can be seen in the stack trace that I posted above, from gaussThread, there was access to __gmpz_mul (which seems to be multiplication of GMP integers) that leads to __gmpz_realloc. But due to (1), the __gmpz_realloc leads to sage_sig_realloc which
https://github.com/sagemath/sage/blob/931cc5e8dc03580c9fecbd0535544a725e9baeb7/src/sage/ext/memory.pyx#L72
which calls sig_realloc from cysignals:
https://github.com/sagemath/cysignals/blob/fbfc5592a5750729ff9fe6f5e19cfee3f1fb8b2a/src/cysignals/memory.pxd#L43
which calls sig_block and sig_unblock that modifies the global block_sigint:
https://github.com/sagemath/cysignals/blob/fbfc5592a5750729ff9fe6f5e19cfee3f1fb8b2a/src/cysignals/macros.h#L246

And the problem is not necessarily that gaussThread is running concurrently with cypari2. But from my understanding, there are these threads running concurrently in khoca that accesses GMP integers. Perhaps they access different GMP integers, but some of those accesses indirectly modify the same global block_sigint and causing race condition.

And for cypari2: note that this problem occured not necessarily because cypari2 is running concurrently with khoca threads. It just happened that when we are reaching cypari2 code (which might run way after those concurrent threads), the block_sigint variable is already a mess (i.e., it reaches -1 which should not be possible as counter must always be nonnegative) and cypari2 caught and complained about it (giving the error that you saw). But the root cause seems to be outside cypari2 as explained above.

And note that, (1) gives a different behavior compare with if you run khoca outside sage. So it is a sage-specific problem.

But as I mentioned in the beginning, it is also entirely possible that the race condition actually happened before khoca and then block_sigint value is already a mess (maybe 1 less than what it should be) when reaching the khoca threads. I will do a deeper investigation on it.

@soehms
Copy link
Contributor Author

soehms commented Jul 16, 2025

@ptrrsn, thank you for the very clear explanation. I now fully understand your arguments and agree with your conclusion that these Cysignals globals like block_sigint should at least be made multithreading-safe.

For now I'm thinking about protecting Khoca from the swapped GMP memory allocation functions. I've already included code to protect Cypari2's Pari stack from the Pari stack used by Khoca (see, for example, pari_backup and pari_rollback). This is only required if Khoca was built from source in the Sage environment. Otherwise, Khoca uses the Pari stack from its own shared library, which amazingly also includes mp_set_memory_function (but was obviously not selected by the linker in this case).

sebastian@SebaThinkPad:~/devel/sage$ grep -r mp_set_memory_function local/var/lib/sage/venv-python3.12.5/lib/python3.12/site-packages/khoca.libs/
grep: local/var/lib/sage/venv-python3.12.5/lib/python3.12/site-packages/khoca.libs/libpari-gmp-0b366a25.so.2.17.1: binary file matches
grep: local/var/lib/sage/venv-python3.12.5/lib/python3.12/site-packages/khoca.libs/libgmp-afec2dd4.so.10.2.0: binary file matches

Do you think it makes sense to use a similar construction for the GMP memory allocation functions, i.e. to write the function pointers to a backup when entering Khoca and restore them when leaving?

@ptrrsn
Copy link

ptrrsn commented Jul 17, 2025

Hmm, that's interesting. But I am still a bit hesitant to make an actual suggestion before finding the exact reason for the block_sigint=-1 in gaussThread, as my explanation above still has some element of guesses. I will look at this deeper and come back when I have something more accurate.

@ptrrsn
Copy link

ptrrsn commented Jul 20, 2025

Continuing the story, it appears that the root cause was really the race condition during the Gaussian Thread execution, as I elaborated above. The cysigs.block_sigint unfortunately is not correctly configured in sage (or at least in my installation). The cysigs.block_sigint type is called cy_atomic_int which unfortunately incorrectly configured to a non-atomic type in Sage installation:
https://github.com/sagemath/cysignals/blob/fbfc5592a5750729ff9fe6f5e19cfee3f1fb8b2a/src/cysignals/struct_signals.h#L51

I will make a PR to fix that in cysignals so you don't need to change your code. That will also fix potential race conditions in other parts of sage.

@ptrrsn
Copy link

ptrrsn commented Jul 20, 2025

Just to double check, can you post the content of: sage/local/var/lib/sage/venv-python3.x/lib/python3.x/site-packages/cysignals/cysignals_config.h in your sage source code (change 3.x to the python 3 version)? The CYSIGNALS_* macros there are probably all zeros.

@ptrrsn
Copy link

ptrrsn commented Jul 20, 2025

I had a PR in cysignals that will fix this problem: sagemath/cysignals#230

@soehms
Copy link
Contributor Author

soehms commented Jul 21, 2025

I had a PR in cysignals that will fix this problem: sagemath/cysignals#230

Great! Looks like very good work! So the issue was caused by a regression that occurred eight months ago when switching cysignals to Meson, right?

Just to double check, can you post the content of: sage/local/var/lib/sage/venv-python3.x/lib/python3.x/site-packages/cysignals/cysignals_config.h in your sage source code (change 3.x to the python 3 version)?

For the sake of completeness, I confirm your assumption:

sebastian@SebaThinkPad:~/devel/sage$ cat local/var/lib/sage/venv-python3.12.5/lib/python3.12/site-packages/cysignals/cysignals_config.h
/*
 * Autogenerated by the Meson build system.
 * Do not edit, your changes will be lost.
 */

#pragma once

#define CYSIGNALS_CXX_ATOMIC 0

#define CYSIGNALS_CXX_ATOMIC_WITH_OPENMP 0

#define CYSIGNALS_C_ATOMIC 0

#define CYSIGNALS_C_ATOMIC_WITH_OPENMP 0

#define CYSIGNALS_STD_ATOMIC 0

#define CYSIGNALS_STD_ATOMIC_WITH_OPENMP 0

#define CYSIGNALS_USE_SIGSETJMP 1

#define ENABLE_DEBUG_CYSIGNALS 0

#define HAVE_BACKTRACE 1
...

@ptrrsn
Copy link

ptrrsn commented Jul 21, 2025

Thank you for providing the header content. And I agree that the problem seems to be introduced by regression from the meson migration.

@soehms
Copy link
Contributor Author

soehms commented Jul 22, 2025

@LLewark, I suggest merging this PR so that the PyPI package aligns with the code in your repository's master branch. According to recent comments above, there is no longer a serious issue here. Furthermore, there are successful CI runs that ensure there are no regressions regarding existing functionality, i.e., local builds from source via the Makefile (see the Run Tests CI runs) and the creation of Docker images on DockerHub (see the Publish Docker CI runs).

Furthermore, I think you should also merge PR #2, as it fixes a mathematical error.

vbraun pushed a commit to vbraun/sage that referenced this pull request Aug 25, 2025
sagemathgh-40081: Integration of a new optional package for Khovanov homology
    
<!-- ^ Please provide a concise and informative title. -->
<!-- ^ Don't put issue numbers in the title, do this in the PR
description below. -->
<!-- ^ For example, instead of "Fixes sagemath#12345" use "Introduce new method
to calculate 1 + 2". -->
<!-- v Describe your changes below in detail. -->
<!-- v Why is this change required? What problem does it solve? -->
<!-- v If this PR resolves an open issue, please link to it here. For
example, "Fixes sagemath#12345". -->

Sage has a native implementation of Khovanov homology. Unfortunately,
it's not very performant. For knots and links with more than six
crossings, the calculation takes more than a second on a ThinkPad with
an i7 (i7-1165G7 @ 2.80GHz) processor (running Ubuntu under WSL2). For a
knot with nine crossings, it takes more than a minute, and for 10
crossings, the calculation breaks (after about an hour) with a `Pari
stack overflow`:

```
sage: K7 = Knots().from_table(7,1)
....: K8 = Knots().from_table(8,1)
....: K9 = Knots().from_table(9,1)
....: K10 = Knots().from_table(10,1)
sage: %time K7.khovanov_homology()
CPU times: user 679 ms, sys: 0 ns, total: 679 ms
Wall time: 679 ms
{-21: {-7: Z},
 -19: {-7: 0, -6: C2},
 -17: {-7: 0, -6: Z, -5: Z},
 -15: {-7: 0, -6: 0, -5: 0, -4: C2},
 -13: {-7: 0, -6: 0, -5: 0, -4: Z, -3: Z},
 -11: {-7: 0, -6: 0, -5: 0, -4: 0, -3: 0, -2: C2},
 -9: {-7: 0, -6: 0, -5: 0, -4: 0, -3: 0, -2: Z, -1: 0, 0: 0},
 -7: {-7: 0, -6: 0, -5: 0, -4: 0, -3: 0, -2: 0, -1: 0, 0: Z},
 -5: {0: Z}}
sage: %time K8.khovanov_homology()
CPU times: user 22.9 s, sys: 74.1 ms, total: 23 s
Wall time: 23 s
{-13: {-6: Z},
 -11: {-6: 0, -5: C2, -4: 0, -3: 0},
 -9: {-6: 0, -5: Z, -4: Z, -3: 0, -2: 0, -1: 0},
 -7: {-6: 0, -5: 0, -4: 0, -3: Z x C2, -2: 0, -1: 0, 0: 0, 1: 0},
 -5: {-6: 0, -5: 0, -4: 0, -3: Z, -2: Z x C2, -1: 0, 0: 0, 1: 0, 2: 0},
 -3: {-6: 0, -5: 0, -4: 0, -3: 0, -2: Z, -1: Z x C2, 0: 0, 1: 0, 2: 0,
3: 0},
 -1: {-4: 0, -3: 0, -2: 0, -1: Z, 0: Z x C2, 1: 0, 2: 0, 3: 0, 4: 0},
 1: {-3: 0, -2: 0, -1: 0, 0: Z x Z, 1: Z, 2: 0, 3: 0, 4: 0},
 3: {-1: 0, 0: 0, 1: 0, 2: C2, 3: 0, 4: 0},
 5: {1: 0, 2: Z, 3: 0, 4: 0}}
sage: %time K9.khovanov_homology()
CPU times: user 1min 23s, sys: 260 ms, total: 1min 23s
Wall time: 1min 23s
{-27: {-9: Z},
 -25: {-9: 0, -8: C2},
 -23: {-9: 0, -8: Z, -7: Z},
 -21: {-9: 0, -8: 0, -7: 0, -6: C2},
 -19: {-9: 0, -8: 0, -7: 0, -6: Z, -5: Z},
 -17: {-9: 0, -8: 0, -7: 0, -6: 0, -5: 0, -4: C2},
 -15: {-9: 0, -8: 0, -7: 0, -6: 0, -5: 0, -4: Z, -3: Z},
 -13: {-9: 0, -8: 0, -7: 0, -6: 0, -5: 0, -4: 0, -3: 0, -2: C2},
 -11: {-9: 0, -8: 0, -7: 0, -6: 0, -5: 0, -4: 0, -3: 0, -2: Z, -1: 0, 0:
0},
 -9: {-9: 0, -8: 0, -7: 0, -6: 0, -5: 0, -4: 0, -3: 0, -2: 0, -1: 0, 0:
Z},
 -7: {0: Z}}
sage: %time K10.khovanov_homology()
Traceback (most recent call last):
...
PariError: the PARI stack overflows (current size: 1073741824; maximum
size: 1073741824)
You can use pari.allocatemem() to change the stack size and try again
```

If we do the same with the optional package `Khoca` implemented in this
PR, we get the following: The knot with 9 crossings takes less than half
a second and the example with 10 crossings terminates after less than
two seconds:

```
sage: %time K9.khovanov_homology(implementation='Khoca')
CPU times: user 268 ms, sys: 44.7 ms, total: 313 ms
Wall time: 293 ms
{-27: {-9: Z},
 -25: {-9: 0, -8: C2},
 -23: {-9: 0, -8: Z, -7: Z},
 -21: {-8: 0, -7: 0, -6: C2},
 -19: {-7: 0, -6: Z, -5: Z},
 -17: {-6: 0, -5: 0, -4: C2},
 -15: {-5: 0, -4: Z, -3: Z},
 -13: {-4: 0, -3: 0, -2: C2},
 -11: {-3: 0, -2: Z},
 -9: {0: Z},
 -7: {0: Z}}
sage: %time K10.khovanov_homology(implementation='Khoca')
CPU times: user 1.55 s, sys: 123 ms, total: 1.68 s
Wall time: 1.65 s
{-17: {-8: Z},
 -15: {-8: 0, -7: C2},
 -13: {-8: 0, -7: Z, -6: Z},
 -11: {-7: 0, -6: 0, -5: Z x C2},
 -9: {-6: 0, -5: Z, -4: Z x C2},
 -7: {-5: 0, -4: Z, -3: Z x C2},
 -5: {-4: 0, -3: Z, -2: Z x C2},
 -3: {-3: 0, -2: Z, -1: Z x C2},
 -1: {-2: 0, -1: Z, 0: Z x C2},
 1: {-1: 0, 0: Z x Z, 1: Z},
 3: {0: 0, 1: 0, 2: C2},
 5: {1: 0, 2: Z}}
```

Even knots with 13 crossing only need a few seconds:

```
sage: K13 = KnotInfo.K13a_1.link()
sage: %time K13.khovanov_homology(implementation='Khoca')
CPU times: user 2.28 s, sys: 578 ms, total: 2.86 s
Wall time: 2.6 s
{-17: {-8: Z},
 -15: {-8: 0, -7: Z x Z x Z x C2},
 -13: {-8: 0, -7: Z, -6: Z^7 x C2 x C2 x C2},
 -11: {-7: 0, -6: Z x Z x Z, -5: Z^13 x C2^7},
 -9: {-6: 0, -5: Z^7, -4: Z^19 x C2^13},
 -7: {-5: 0, -4: Z^13, -3: Z^24 x C2^19},
 -5: {-4: 0, -3: Z^19, -2: Z^27 x C2^24},
 -3: {-3: 0, -2: Z^24, -1: Z^25 x C2^27},
 -1: {-2: 0, -1: Z^27, 0: Z^22 x C2^25},
 1: {-1: 0, 0: Z^26, 1: Z^16 x C2^21},
 3: {0: 0, 1: Z^21, 2: Z^9 x C2^16},
 5: {1: 0, 2: Z^16, 3: Z x Z x Z x Z x C2^9},
 7: {2: 0, 3: Z^9, 4: Z x C2 x C2 x C2 x C2},
 9: {3: 0, 4: Z x Z x Z x Z, 5: C2},
 11: {4: 0, 5: Z}}
```

The software behind the optional package is
[Khoca](http://lewark.de/lukas/khoca.html). It was written in C++ by
[Lukas Lewark](https://people.math.ethz.ch/~llewark/index.php). I
packaged it on [PyPI](https://pypi.org/project/khoca/) for integration
into Sage (see LLewark/khoca#3).

Currently, there is still an issue with the PyPI package regarding
integration with Sage, as both use the Pari library (see
LLewark/khoca#3 (comment)). This
occurs sporadically and triggers a system error that occurs at
`cypari2/stack.pyx`. The error disappears when the last action is re-
executed. I'm leaving this PR as a draft until the issue is resolved.



### 📝 Checklist

<!-- Put an `x` in all the boxes that apply. -->

- [x] The title is concise and informative.
- [x] The description explains in detail what this PR is about.
- [x] I have linked a relevant issue or discussion.
- [x] I have created tests covering the changes.
- [x] I have updated the documentation and checked the documentation
preview.

### ⌛ Dependencies

<!-- List all open PRs that this PR logically depends on. For example,
-->
<!-- - sagemath#12345: short description why this is a dependency -->
<!-- - sagemath#34567: ... -->
    
URL: sagemath#40081
Reported by: Sebastian Oehms
Reviewer(s): Sebastian Oehms, Travis Scrimshaw
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants