Skip to content

Conversation

oscarbenjamin
Copy link
Collaborator

References to other Issues or PRs

Makes use of fraction-free routines as discussed in gh-21834.

This based on the benchmark measurements in gh-25410.

This is built on top of the PR gh-25395 (which I had hoped could be merged before opening this PR). The sparse version of rref_den added there is always faster than the dense implementation added in gh-25346 and should be the main workhorse for most RREF operations.

Previous related PRs are gh-25346, gh-25369, gh-25367 and gh-25336.

Brief description of what is fixed or changed

This PR changes Matrix.rref so that for a matrix of only integers or rational numbers it will use DomainMatrix.rref or DomainMatrix.rref_den which should be entirely equivalent but a lot faster in all cases.

The DomainMatrix.rref method itself was also made faster for the ZZ and QQ domains by using empirically determined rules for deciding whch algorithm to use. For ZZ there are two extremes in which different algorithms are preferred: large dense matrices with large integers should use fraction-free Gauss-Jordan (GJFF) elimination over ZZ whereas very sparse matrices with small integers should use Gauss-Jordan with division (GJDIV) over QQ. For QQ the primary consideration is the nature of the denominators: small denominators should just be cleared to use GJFF over ZZ whereas very large denominators should use GJDIV over QQ. In between there are some cases where GJFF over QQ is best.

RREF is the main component of a number of matrix operations like Matrix.inv, Matrix.solve, Matrix.nullspace. In turn nullspace is used by eigenvects and jordan_normal_form but eigenvects was already previously made a lot faster by having it use DomainMatrix directly (it would benefit from using the new fraction-free routines but I haven't made any change to that here). Also linsolve uses DomainMatrix.rref directly and should be changed to use rref_den instead but I haven't changed that here. Also rref is used as part of the risch and heurisch integration routines although these already use DomainMatrix more directly: they benefit from the improvements to DomainMatrix.rref but not the fact that Matrix.rref now uses DomainMatrix.rref.

Here are some timings to illustrate the effects of this. With master the the time to invert a matrix of integers (values from 1 to 100) is:

In [1]: for n in [1,2,5,10,15,16,17,18,19,20,30,40,50,100,200]:
   ...:     M = randMatrix(n)
   ...:     print()
   ...:     print(f'Time to invert a {n} by {n} matrix:')
   ...:     %time ok = M.inv()
   ...: 

Time to invert a 1 by 1 matrix:
CPU times: user 1.36 ms, sys: 0 ns, total: 1.36 ms
Wall time: 1.36 ms

Time to invert a 2 by 2 matrix:
CPU times: user 2.48 ms, sys: 0 ns, total: 2.48 ms
Wall time: 2.48 ms

Time to invert a 5 by 5 matrix:
CPU times: user 17.7 ms, sys: 0 ns, total: 17.7 ms
Wall time: 17 ms

Time to invert a 10 by 10 matrix:
CPU times: user 84.9 ms, sys: 3.7 ms, total: 88.6 ms
Wall time: 88 ms

Time to invert a 15 by 15 matrix:
CPU times: user 3.15 s, sys: 19.9 ms, total: 3.17 s
Wall time: 3.17 s

Time to invert a 16 by 16 matrix:
CPU times: user 7.41 s, sys: 39.9 ms, total: 7.45 s
Wall time: 7.45 s

Time to invert a 17 by 17 matrix:
CPU times: user 30.6 s, sys: 52 ms, total: 30.7 s
Wall time: 30.7 s

Time to invert a 18 by 18 matrix:
CPU times: user 1min 42s, sys: 108 ms, total: 1min 43s
Wall time: 1min 43s

Time to invert a 19 by 19 matrix:
CPU times: user 4min 32s, sys: 320 ms, total: 4min 32s
Wall time: 4min 32s

Time to invert a 20 by 20 matrix:
^C---------------------------------------------------------------------------
KeyboardInterrupt

We can see that as we go from 16x16 to 17x17 and so on each time n increases by 1 the runtime increases by a factor of about 3 meaning that the runtime scales exponentially like 3^n. We can guess that the time to invert a 20x20 matrix would be something like 15 minutes but I haven't waited that long to test that.

With this PR the equivalent timings are (note that I have gmpy2 installed which affects this):

In [1]: for n in [1,2,5,10,15,16,17,18,19,20,30,40,50,100,200]:
   ...:     M = randMatrix(n)
   ...:     print()
   ...:     print(f'Time to invert a {n} by {n} matrix:')
   ...:     %time ok = M.inv()
   ...: 

Time to invert a 1 by 1 matrix:
CPU times: user 1.28 ms, sys: 235 µs, total: 1.51 ms
Wall time: 1.53 ms

Time to invert a 2 by 2 matrix:
CPU times: user 958 µs, sys: 177 µs, total: 1.14 ms
Wall time: 1.15 ms

Time to invert a 5 by 5 matrix:
CPU times: user 2.77 ms, sys: 2.74 ms, total: 5.51 ms
Wall time: 5.45 ms

Time to invert a 10 by 10 matrix:
CPU times: user 15.8 ms, sys: 0 ns, total: 15.8 ms
Wall time: 15.7 ms

Time to invert a 15 by 15 matrix:
CPU times: user 21.9 ms, sys: 0 ns, total: 21.9 ms
Wall time: 21.6 ms

Time to invert a 16 by 16 matrix:
CPU times: user 23 ms, sys: 217 µs, total: 23.2 ms
Wall time: 22.8 ms

Time to invert a 17 by 17 matrix:
CPU times: user 24.7 ms, sys: 16 µs, total: 24.7 ms
Wall time: 24.7 ms

Time to invert a 18 by 18 matrix:
CPU times: user 27.2 ms, sys: 0 ns, total: 27.2 ms
Wall time: 27.2 ms

Time to invert a 19 by 19 matrix:
CPU times: user 30.6 ms, sys: 0 ns, total: 30.6 ms
Wall time: 30.6 ms

Time to invert a 20 by 20 matrix:
CPU times: user 24.7 ms, sys: 0 ns, total: 24.7 ms
Wall time: 24.8 ms

Time to invert a 30 by 30 matrix:
CPU times: user 57.9 ms, sys: 0 ns, total: 57.9 ms
Wall time: 57.9 ms

Time to invert a 40 by 40 matrix:
CPU times: user 122 ms, sys: 0 ns, total: 122 ms
Wall time: 122 ms

Time to invert a 50 by 50 matrix:
CPU times: user 216 ms, sys: 182 µs, total: 216 ms
Wall time: 216 ms

Time to invert a 100 by 100 matrix:
CPU times: user 1.67 s, sys: 12.8 ms, total: 1.68 s
Wall time: 1.68 s

Time to invert a 200 by 200 matrix:
CPU times: user 17.6 s, sys: 32.8 ms, total: 17.6 s
Wall time: 17.6 s

So this is 2x faster for a 2x2 matrix but that scales up to about 10000x faster for a 19x19 and probably a much bigger ration at 200x200 but no one would wait long enough to see timings for 200x200 using the existing Matrix.rref routine. These timings seem to scale somewhere in between n^3 and n^4. I believe that asymptotically for extremely large matrices of extremely large integers this algorithm here should be O(n^5) if asymptotically optimal integer multiplication is used.

Note that CPython does not have asymptotically optimal multiplication whereas gmpy2 does. I think that the reasons that the existing Matrix.rref method are so slow as n scales up are that:

  • It uses division-free rather than fraction-free Gauss-Jordan. This approach has exponential runtime in n even if asymptotically optimal multiplication is used.
  • It uses Rational rather than MPQ. The pure Python version of MPQ has better arithmetic than Rational but when gmpy2 is installed its mpq is much faster for smaller rationals but asymptotically much faster for large rationals.

Another approach that could be used is to change Rational (and Integer) to use MPQ and MPZ so that they can use gmpy2 when it is installed and also can take advantage of PythonMPQ's faster arithmetic even when gmpy2 is not installed. This would speed up many things but RREF would still be slow because of the division-free approach that is used. Another possibility is that Matrix.rref could check to see if the matrix is all integers and then use fraction-free RREF but that is exactly what DomainMatrix does for us: identify an appropriate domain and then use the best algorithms for that domain.

Other comments

The approach here is to speed up rref for Matrix by checking to see if the matrix is all rational numbers or all integers. This is usually a very quick operation since the internal _rep attribute of a Matrix is already a DomainMatrix and is usually already over ZZ or QQ if the matrix is all rationals. Since matrices are mutable though it is possible that the internal _rep is over EXRAW even though its entries are all rational so we check all entries in that case. This should still be a small cost compared to the time taken to compute the actual RREF which is a heavy operation.

This approach could be extended to other simple domains like RR, CC, ZZ_I, QQ_I, QQ<a>. In the case of QQ<a> we need to be careful that
finding a primitive element can be slower than the rref itself so there should be some limits.

Other more interesting domains are polynomial rings and function fields like QQ[x,y] or QQ(x,y). For these kinds of domains rref is fundamentally not right though and rref_den with clearing denominators should be used so the way to speed these up is to change methods like inv so that they do not attempt to use rref rather than speeding up rref itself.

This also sets the scene to be able to make use of python_flint to speed up matrix operations: once we know we are performing an operation like rref over QQ then the corresponding flint routines could be used which are a lot faster for dense matrices. Here are comparable timings with python_flint:

In [1]: import flint

In [2]: to_FM = lambda M: flint.fmpq_mat([[int(e) for e in row] for row in M.tolist()])

In [3]: for n in [1,2,5,10,15,16,17,18,19,20,30,40,50,100,200]:
   ...:     M = randMatrix(n)
   ...:     print()
   ...:     print(f'Time to invert a {n} by {n} matrix:')
   ...:     fM = to_FM(M)
   ...:     %time ok = fM.inv()
   ...: 

Time to invert a 1 by 1 matrix:
CPU times: user 27 µs, sys: 1 µs, total: 28 µs
Wall time: 31.9 µs

Time to invert a 2 by 2 matrix:
CPU times: user 962 µs, sys: 51 µs, total: 1.01 ms
Wall time: 4.7 ms

Time to invert a 5 by 5 matrix:
CPU times: user 0 ns, sys: 779 µs, total: 779 µs
Wall time: 2.85 ms

Time to invert a 10 by 10 matrix:
CPU times: user 177 µs, sys: 9 µs, total: 186 µs
Wall time: 212 µs

Time to invert a 15 by 15 matrix:
CPU times: user 1.12 ms, sys: 59 µs, total: 1.17 ms
Wall time: 1.86 ms

Time to invert a 16 by 16 matrix:
CPU times: user 2.38 ms, sys: 127 µs, total: 2.51 ms
Wall time: 15.6 ms

Time to invert a 17 by 17 matrix:
CPU times: user 1.55 ms, sys: 52 µs, total: 1.6 ms
Wall time: 1.61 ms

Time to invert a 18 by 18 matrix:
CPU times: user 1.83 ms, sys: 0 ns, total: 1.83 ms
Wall time: 1.87 ms

Time to invert a 19 by 19 matrix:
CPU times: user 1.42 ms, sys: 0 ns, total: 1.42 ms
Wall time: 1.43 ms

Time to invert a 20 by 20 matrix:
CPU times: user 2.02 ms, sys: 0 ns, total: 2.02 ms
Wall time: 2.04 ms

Time to invert a 30 by 30 matrix:
CPU times: user 6.09 ms, sys: 3.97 ms, total: 10.1 ms
Wall time: 10.1 ms

Time to invert a 40 by 40 matrix:
CPU times: user 16.8 ms, sys: 0 ns, total: 16.8 ms
Wall time: 16.8 ms

Time to invert a 50 by 50 matrix:
CPU times: user 36.3 ms, sys: 0 ns, total: 36.3 ms
Wall time: 36.4 ms

Time to invert a 100 by 100 matrix:
CPU times: user 284 ms, sys: 3.87 ms, total: 288 ms
Wall time: 288 ms

Time to invert a 200 by 200 matrix:
CPU times: user 3.29 s, sys: 23.7 ms, total: 3.31 s
Wall time: 3.31 s

Broadly this is about 5x faster than DomainMatrix although I think that the difference widens as you go up to extremely large matrices because flint uses asymptotically optimal matrix multiplication and for large matrices delegates tasks like inverse to matrix multiplication. What these timings show though is that the pure Python implementation of DomainMatrix is not far from the performance of a highly optimised C implementation.

Here are comparable timings using symengine which has a C++ implementation of Matrix. A symengine Matrix is like a sympy Matrix in that it allows arbitrary expressions in the matrix and uses its equivalent of Expr rather than choosing particular domains like python_flint or DomainMatrix. This means that it is not as fast when there is a well defined domain as in this case but otherwise I am not sure what algorithm it uses for inv or rref:

In [1]: import symengine

In [2]: to_SE = lambda M: symengine.Matrix([[int(e) for e in row] for row in M.tolist()])

In [3]: for n in [1,2,5,10,15,16,17,18,19,20,30,40,50,100,200]:
   ...:     M = randMatrix(n)
   ...:     print()
   ...:     print(f'Time to invert a {n} by {n} matrix:')
   ...:     sM = to_SE(M)
   ...:     %time ok = sM.inv()
   ...: 

Time to invert a 1 by 1 matrix:
CPU times: user 41 µs, sys: 5 µs, total: 46 µs
Wall time: 50.5 µs

Time to invert a 2 by 2 matrix:
CPU times: user 44 µs, sys: 5 µs, total: 49 µs
Wall time: 53.6 µs

Time to invert a 5 by 5 matrix:
CPU times: user 212 µs, sys: 0 ns, total: 212 µs
Wall time: 216 µs

Time to invert a 10 by 10 matrix:
CPU times: user 4.29 ms, sys: 0 ns, total: 4.29 ms
Wall time: 4.31 ms

Time to invert a 15 by 15 matrix:
CPU times: user 13.6 ms, sys: 162 µs, total: 13.8 ms
Wall time: 13.4 ms

Time to invert a 16 by 16 matrix:
CPU times: user 18.6 ms, sys: 0 ns, total: 18.6 ms
Wall time: 18.5 ms

Time to invert a 17 by 17 matrix:
CPU times: user 19.8 ms, sys: 0 ns, total: 19.8 ms
Wall time: 19.8 ms

Time to invert a 18 by 18 matrix:
CPU times: user 23.7 ms, sys: 0 ns, total: 23.7 ms
Wall time: 23.8 ms

Time to invert a 19 by 19 matrix:
CPU times: user 26.8 ms, sys: 0 ns, total: 26.8 ms
Wall time: 26.9 ms

Time to invert a 20 by 20 matrix:
CPU times: user 32.9 ms, sys: 0 ns, total: 32.9 ms
Wall time: 32.9 ms

Time to invert a 30 by 30 matrix:
CPU times: user 131 ms, sys: 0 ns, total: 131 ms
Wall time: 131 ms

Time to invert a 40 by 40 matrix:
CPU times: user 386 ms, sys: 0 ns, total: 386 ms
Wall time: 386 ms

Time to invert a 50 by 50 matrix:
CPU times: user 866 ms, sys: 2.72 ms, total: 869 ms
Wall time: 870 ms

Time to invert a 100 by 100 matrix:
CPU times: user 12.5 s, sys: 2.05 ms, total: 12.5 s
Wall time: 12.5 s

Time to invert a 200 by 200 matrix:
CPU times: user 3min 24s, sys: 95.2 ms, total: 3min 24s
Wall time: 3min 24s

So SymEngine is faster for very small matrices but becomes slower than DomainMatrix as we scale up to very large matrices. My guess is that these differences are explained by the fact that at small sizes pure Python overhead slows down DomainMatrix but at larger sizes the effect of an asymptotically better algorithm (fraction-free Gauss Jordan) means that DomainMatrix is faster.

Release Notes

If gh-25395 is not merged before this then the release notes from there should be included here if this is merged.

  • matrices
    • The DomainMatrix.rref and rref_den methods have been optimised for the ZZ and QQ domains. The new methods for ZZ select between using division or fraction-free Gauss-Jordan elimination depending on sparseness, bit counts, and matrix dimensions. For QQ the denominators are checked to consider whether to clear denominators and use ZZ or otherwise use division over QQ.
    • Matrix.rref now uses DomainMatrix.rref when the matrix consists only of integers or rational numbers. This makes the rref operation a lot faster and speeds up many other matrix operations that use rref such as computing a matrix inverse, nullspace or solving a matrix equation. The exact timing difference depends on a number of factors and is larger for larger matrices. As an example computing the inverse of a dense 19x19 matrix of integers values between 0 and 100 is about 10000x faster. The speed difference is bigger if gmpy2 is installed since that speeds up elementary integer and rational arithmetic and will now be used by Matrix.rref when possible.

@sympy-bot
Copy link

sympy-bot commented Jul 29, 2023

Hi, I am the SymPy bot (v170). I'm here to help you write a release notes entry. Please read the guide on how to write release notes.

Your release notes are in good order.

Here is what the release notes will look like:

  • matrices
    • The DomainMatrix.rref and rref_den methods have been optimised for the ZZ and QQ domains. The new methods for ZZ select between using division or fraction-free Gauss-Jordan elimination depending on sparseness, bit counts, and matrix dimensions. For QQ the denominators are checked to consider whether to clear denominators and use ZZ or otherwise use division over QQ. (#25443 by @oscarbenjamin)

    • Matrix.rref now uses DomainMatrix.rref when the matrix consists only of integers or rational numbers. This makes the rref operation a lot faster and speeds up many other matrix operations that use rref such as computing a matrix inverse, nullspace or solving a matrix equation. The exact timing difference depends on a number of factors and is larger for larger matrices. As an example computing the inverse of a dense 19x19 matrix of integers values between 0 and 100 is about 10000x faster. The speed difference is bigger if gmpy2 is installed since that speeds up elementary integer and rational arithmetic and will now be used by Matrix.rref when possible. (#25443 by @oscarbenjamin)

This will be added to https://github.com/sympy/sympy/wiki/Release-Notes-for-1.13.

Click here to see the pull request description that was parsed.
<!-- Your title above should be a short description of what
was changed. Do not include the issue number in the title. -->

#### References to other Issues or PRs
<!-- If this pull request fixes an issue, write "Fixes #NNNN" in that exact
format, e.g. "Fixes #1234" (see
https://tinyurl.com/auto-closing for more information). Also, please
write a comment on that issue linking back to this pull request once it is
open. -->

Makes use of fraction-free routines as discussed in gh-21834.

This based on the benchmark measurements in gh-25410.

This is built on top of the PR gh-25395 (which I had hoped could be merged before opening this PR). The sparse version of `rref_den` added there is always faster than the dense implementation added in gh-25346 and should be the main workhorse for most RREF operations.

Previous related PRs are gh-25346, gh-25369, gh-25367 and gh-25336.

#### Brief description of what is fixed or changed

This PR changes `Matrix.rref` so that for a matrix of only integers or rational numbers it will use `DomainMatrix.rref` or `DomainMatrix.rref_den` which should be entirely equivalent but a lot faster in all cases.

The `DomainMatrix.rref` method itself was also made faster for the ZZ and QQ domains by using empirically determined rules for deciding whch algorithm to use. For ZZ there are two extremes in which different algorithms are preferred: large dense matrices with large integers should use fraction-free Gauss-Jordan (GJFF) elimination over ZZ whereas very sparse matrices with small integers should use Gauss-Jordan with division (GJDIV) over QQ. For QQ the primary consideration is the nature of the denominators: small denominators should just be cleared to use GJFF over ZZ whereas very large denominators should use GJDIV over QQ. In between there are some cases where GJFF over QQ is best.

RREF is the main component of a number of matrix operations like `Matrix.inv`, `Matrix.solve`, `Matrix.nullspace`. In turn nullspace is used by `eigenvects` and `jordan_normal_form` but `eigenvects` was already previously made a lot faster by having it use DomainMatrix directly (it would benefit from using the new fraction-free routines but I haven't made any change to that here). Also `linsolve` uses `DomainMatrix.rref` directly and should be changed to use `rref_den` instead but I haven't changed that here. Also rref is used as part of the `risch` and `heurisch` integration routines although these already use `DomainMatrix` more directly: they benefit from the improvements to DomainMatrix.rref but not the fact that Matrix.rref now uses DomainMatrix.rref.

Here are some timings to illustrate the effects of this. With master the the time to invert a matrix of integers (values from 1 to 100) is:
```python
In [1]: for n in [1,2,5,10,15,16,17,18,19,20,30,40,50,100,200]:
   ...:     M = randMatrix(n)
   ...:     print()
   ...:     print(f'Time to invert a {n} by {n} matrix:')
   ...:     %time ok = M.inv()
   ...: 

Time to invert a 1 by 1 matrix:
CPU times: user 1.36 ms, sys: 0 ns, total: 1.36 ms
Wall time: 1.36 ms

Time to invert a 2 by 2 matrix:
CPU times: user 2.48 ms, sys: 0 ns, total: 2.48 ms
Wall time: 2.48 ms

Time to invert a 5 by 5 matrix:
CPU times: user 17.7 ms, sys: 0 ns, total: 17.7 ms
Wall time: 17 ms

Time to invert a 10 by 10 matrix:
CPU times: user 84.9 ms, sys: 3.7 ms, total: 88.6 ms
Wall time: 88 ms

Time to invert a 15 by 15 matrix:
CPU times: user 3.15 s, sys: 19.9 ms, total: 3.17 s
Wall time: 3.17 s

Time to invert a 16 by 16 matrix:
CPU times: user 7.41 s, sys: 39.9 ms, total: 7.45 s
Wall time: 7.45 s

Time to invert a 17 by 17 matrix:
CPU times: user 30.6 s, sys: 52 ms, total: 30.7 s
Wall time: 30.7 s

Time to invert a 18 by 18 matrix:
CPU times: user 1min 42s, sys: 108 ms, total: 1min 43s
Wall time: 1min 43s

Time to invert a 19 by 19 matrix:
CPU times: user 4min 32s, sys: 320 ms, total: 4min 32s
Wall time: 4min 32s

Time to invert a 20 by 20 matrix:
^C---------------------------------------------------------------------------
KeyboardInterrupt
```
We can see that as we go from 16x16 to 17x17 and so on each time `n` increases by 1 the runtime increases by a factor of about 3 meaning that the runtime scales exponentially like `3^n`. We can guess that the time to invert a 20x20 matrix would be something like 15 minutes but I haven't waited that long to test that.

With this PR the equivalent timings are (note that I have gmpy2 installed which affects this):
```python
In [1]: for n in [1,2,5,10,15,16,17,18,19,20,30,40,50,100,200]:
   ...:     M = randMatrix(n)
   ...:     print()
   ...:     print(f'Time to invert a {n} by {n} matrix:')
   ...:     %time ok = M.inv()
   ...: 

Time to invert a 1 by 1 matrix:
CPU times: user 1.28 ms, sys: 235 µs, total: 1.51 ms
Wall time: 1.53 ms

Time to invert a 2 by 2 matrix:
CPU times: user 958 µs, sys: 177 µs, total: 1.14 ms
Wall time: 1.15 ms

Time to invert a 5 by 5 matrix:
CPU times: user 2.77 ms, sys: 2.74 ms, total: 5.51 ms
Wall time: 5.45 ms

Time to invert a 10 by 10 matrix:
CPU times: user 15.8 ms, sys: 0 ns, total: 15.8 ms
Wall time: 15.7 ms

Time to invert a 15 by 15 matrix:
CPU times: user 21.9 ms, sys: 0 ns, total: 21.9 ms
Wall time: 21.6 ms

Time to invert a 16 by 16 matrix:
CPU times: user 23 ms, sys: 217 µs, total: 23.2 ms
Wall time: 22.8 ms

Time to invert a 17 by 17 matrix:
CPU times: user 24.7 ms, sys: 16 µs, total: 24.7 ms
Wall time: 24.7 ms

Time to invert a 18 by 18 matrix:
CPU times: user 27.2 ms, sys: 0 ns, total: 27.2 ms
Wall time: 27.2 ms

Time to invert a 19 by 19 matrix:
CPU times: user 30.6 ms, sys: 0 ns, total: 30.6 ms
Wall time: 30.6 ms

Time to invert a 20 by 20 matrix:
CPU times: user 24.7 ms, sys: 0 ns, total: 24.7 ms
Wall time: 24.8 ms

Time to invert a 30 by 30 matrix:
CPU times: user 57.9 ms, sys: 0 ns, total: 57.9 ms
Wall time: 57.9 ms

Time to invert a 40 by 40 matrix:
CPU times: user 122 ms, sys: 0 ns, total: 122 ms
Wall time: 122 ms

Time to invert a 50 by 50 matrix:
CPU times: user 216 ms, sys: 182 µs, total: 216 ms
Wall time: 216 ms

Time to invert a 100 by 100 matrix:
CPU times: user 1.67 s, sys: 12.8 ms, total: 1.68 s
Wall time: 1.68 s

Time to invert a 200 by 200 matrix:
CPU times: user 17.6 s, sys: 32.8 ms, total: 17.6 s
Wall time: 17.6 s
```
So this is 2x faster for a 2x2 matrix but that scales up to about 10000x faster for a 19x19 and probably a much bigger ration at 200x200 but no one would wait long enough to see timings for 200x200 using the existing `Matrix.rref` routine. These timings seem to scale somewhere in between `n^3` and `n^4`. I believe that asymptotically for extremely large matrices of extremely large integers this algorithm here should be `O(n^5)` if asymptotically optimal integer multiplication is used.

Note that CPython does not have asymptotically optimal multiplication whereas gmpy2 does. I think that the reasons that the existing `Matrix.rref` method are so slow as n scales up are that:

- It uses division-free rather than fraction-free Gauss-Jordan. This approach has exponential runtime in `n` even if asymptotically optimal multiplication is used.
- It uses `Rational` rather than `MPQ`. The pure Python version of `MPQ` has better arithmetic than `Rational` but when `gmpy2` is installed its `mpq` is much faster for smaller rationals but asymptotically much faster for large rationals.

Another approach that could be used is to change `Rational` (and `Integer`) to use `MPQ` and `MPZ` so that they can use gmpy2 when it is installed and also can take advantage of `PythonMPQ`'s faster arithmetic even when gmpy2 is not installed. This would speed up many things but RREF would still be slow because of the division-free approach that is used. Another possibility is that `Matrix.rref` could check to see if the matrix is all integers and then use fraction-free RREF but that is exactly what `DomainMatrix` does for us: identify an appropriate domain and then use the best algorithms for that domain.

#### Other comments

The approach here is to speed up `rref` for `Matrix` by checking to see if the matrix is all rational numbers or all integers. This is usually a very quick operation since the internal `_rep` attribute of a `Matrix` is already a `DomainMatrix` and is usually already over `ZZ` or `QQ` if the matrix is all rationals. Since matrices are mutable though it is possible that the internal `_rep` is over `EXRAW` even though its entries are all rational so we check all entries in that case. This should still be a small cost compared to the time taken to compute the actual `RREF` which is a heavy operation.

This approach could be extended to other simple domains like `RR`, `CC`, `ZZ_I`, `QQ_I`, `QQ<a>`. In the case of `QQ<a>` we need to be careful that
finding a primitive element can be slower than the rref itself so there should be some limits.

Other more interesting domains are polynomial rings and function fields like `QQ[x,y]` or `QQ(x,y)`. For these kinds of domains `rref` is fundamentally not right though and `rref_den` with clearing denominators should be used so the way to speed these up is to change methods like `inv` so that they do not attempt to use `rref` rather than speeding up `rref` itself.

This also sets the scene to be able to make use of `python_flint` to speed up matrix operations: once we know we are performing an operation like `rref` over `QQ` then the corresponding flint routines could be used which are a lot faster for dense matrices. Here are comparable timings with `python_flint`:
```python
In [1]: import flint

In [2]: to_FM = lambda M: flint.fmpq_mat([[int(e) for e in row] for row in M.tolist()])

In [3]: for n in [1,2,5,10,15,16,17,18,19,20,30,40,50,100,200]:
   ...:     M = randMatrix(n)
   ...:     print()
   ...:     print(f'Time to invert a {n} by {n} matrix:')
   ...:     fM = to_FM(M)
   ...:     %time ok = fM.inv()
   ...: 

Time to invert a 1 by 1 matrix:
CPU times: user 27 µs, sys: 1 µs, total: 28 µs
Wall time: 31.9 µs

Time to invert a 2 by 2 matrix:
CPU times: user 962 µs, sys: 51 µs, total: 1.01 ms
Wall time: 4.7 ms

Time to invert a 5 by 5 matrix:
CPU times: user 0 ns, sys: 779 µs, total: 779 µs
Wall time: 2.85 ms

Time to invert a 10 by 10 matrix:
CPU times: user 177 µs, sys: 9 µs, total: 186 µs
Wall time: 212 µs

Time to invert a 15 by 15 matrix:
CPU times: user 1.12 ms, sys: 59 µs, total: 1.17 ms
Wall time: 1.86 ms

Time to invert a 16 by 16 matrix:
CPU times: user 2.38 ms, sys: 127 µs, total: 2.51 ms
Wall time: 15.6 ms

Time to invert a 17 by 17 matrix:
CPU times: user 1.55 ms, sys: 52 µs, total: 1.6 ms
Wall time: 1.61 ms

Time to invert a 18 by 18 matrix:
CPU times: user 1.83 ms, sys: 0 ns, total: 1.83 ms
Wall time: 1.87 ms

Time to invert a 19 by 19 matrix:
CPU times: user 1.42 ms, sys: 0 ns, total: 1.42 ms
Wall time: 1.43 ms

Time to invert a 20 by 20 matrix:
CPU times: user 2.02 ms, sys: 0 ns, total: 2.02 ms
Wall time: 2.04 ms

Time to invert a 30 by 30 matrix:
CPU times: user 6.09 ms, sys: 3.97 ms, total: 10.1 ms
Wall time: 10.1 ms

Time to invert a 40 by 40 matrix:
CPU times: user 16.8 ms, sys: 0 ns, total: 16.8 ms
Wall time: 16.8 ms

Time to invert a 50 by 50 matrix:
CPU times: user 36.3 ms, sys: 0 ns, total: 36.3 ms
Wall time: 36.4 ms

Time to invert a 100 by 100 matrix:
CPU times: user 284 ms, sys: 3.87 ms, total: 288 ms
Wall time: 288 ms

Time to invert a 200 by 200 matrix:
CPU times: user 3.29 s, sys: 23.7 ms, total: 3.31 s
Wall time: 3.31 s
```
Broadly this is about 5x faster than DomainMatrix although I think that the difference widens as you go up to extremely large matrices because flint uses asymptotically optimal *matrix* multiplication and for large matrices delegates tasks like inverse to matrix multiplication. What these timings show though is that the pure Python implementation of `DomainMatrix` is not far from the performance of a highly optimised C implementation.

Here are comparable timings using `symengine` which has a C++ implementation of `Matrix`. A symengine Matrix is like a sympy Matrix in that it allows arbitrary expressions in the matrix and uses its equivalent of `Expr` rather than choosing particular domains like `python_flint` or `DomainMatrix`. This means that it is not as fast when there is a well defined domain as in this case but otherwise I am not sure what algorithm it uses for `inv` or `rref`:
```python
In [1]: import symengine

In [2]: to_SE = lambda M: symengine.Matrix([[int(e) for e in row] for row in M.tolist()])

In [3]: for n in [1,2,5,10,15,16,17,18,19,20,30,40,50,100,200]:
   ...:     M = randMatrix(n)
   ...:     print()
   ...:     print(f'Time to invert a {n} by {n} matrix:')
   ...:     sM = to_SE(M)
   ...:     %time ok = sM.inv()
   ...: 

Time to invert a 1 by 1 matrix:
CPU times: user 41 µs, sys: 5 µs, total: 46 µs
Wall time: 50.5 µs

Time to invert a 2 by 2 matrix:
CPU times: user 44 µs, sys: 5 µs, total: 49 µs
Wall time: 53.6 µs

Time to invert a 5 by 5 matrix:
CPU times: user 212 µs, sys: 0 ns, total: 212 µs
Wall time: 216 µs

Time to invert a 10 by 10 matrix:
CPU times: user 4.29 ms, sys: 0 ns, total: 4.29 ms
Wall time: 4.31 ms

Time to invert a 15 by 15 matrix:
CPU times: user 13.6 ms, sys: 162 µs, total: 13.8 ms
Wall time: 13.4 ms

Time to invert a 16 by 16 matrix:
CPU times: user 18.6 ms, sys: 0 ns, total: 18.6 ms
Wall time: 18.5 ms

Time to invert a 17 by 17 matrix:
CPU times: user 19.8 ms, sys: 0 ns, total: 19.8 ms
Wall time: 19.8 ms

Time to invert a 18 by 18 matrix:
CPU times: user 23.7 ms, sys: 0 ns, total: 23.7 ms
Wall time: 23.8 ms

Time to invert a 19 by 19 matrix:
CPU times: user 26.8 ms, sys: 0 ns, total: 26.8 ms
Wall time: 26.9 ms

Time to invert a 20 by 20 matrix:
CPU times: user 32.9 ms, sys: 0 ns, total: 32.9 ms
Wall time: 32.9 ms

Time to invert a 30 by 30 matrix:
CPU times: user 131 ms, sys: 0 ns, total: 131 ms
Wall time: 131 ms

Time to invert a 40 by 40 matrix:
CPU times: user 386 ms, sys: 0 ns, total: 386 ms
Wall time: 386 ms

Time to invert a 50 by 50 matrix:
CPU times: user 866 ms, sys: 2.72 ms, total: 869 ms
Wall time: 870 ms

Time to invert a 100 by 100 matrix:
CPU times: user 12.5 s, sys: 2.05 ms, total: 12.5 s
Wall time: 12.5 s

Time to invert a 200 by 200 matrix:
CPU times: user 3min 24s, sys: 95.2 ms, total: 3min 24s
Wall time: 3min 24s
```
So SymEngine is faster for very small matrices but becomes slower than DomainMatrix as we scale up to very large matrices. My guess is that these differences are explained by the fact that at small sizes pure Python overhead slows down DomainMatrix but at larger sizes the effect of an asymptotically better algorithm (fraction-free Gauss Jordan) means that DomainMatrix is faster.

#### Release Notes

<!-- Write the release notes for this release below between the BEGIN and END
statements. The basic format is a bulleted list with the name of the subpackage
and the release note for this PR. For example:

* solvers
  * Added a new solver for logarithmic equations.

* functions
  * Fixed a bug with log of integers. Formerly, `log(-x)` incorrectly gave `-log(x)`.

* physics.units
  * Corrected a semantical error in the conversion between volt and statvolt which
    reported the volt as being larger than the statvolt.

or if no release note(s) should be included use:

NO ENTRY

See https://github.com/sympy/sympy/wiki/Writing-Release-Notes for more
information on how to write release notes. The bot will check your release
notes automatically to see if they are formatted correctly. -->

If gh-25395 is not merged before this then the release notes from there should be included here if this is merged.

<!-- BEGIN RELEASE NOTES -->
* matrices
    * The DomainMatrix.rref and rref_den methods have been optimised for the ZZ and QQ domains. The new methods for ZZ select between using division or fraction-free Gauss-Jordan elimination depending on sparseness, bit counts, and matrix dimensions. For QQ the denominators  are checked to consider whether to clear denominators and use ZZ or otherwise use division over QQ.
    * `Matrix.rref` now uses `DomainMatrix.rref` when the matrix consists only of integers or rational numbers. This makes the `rref` operation a lot faster and speeds up many other matrix operations that use `rref` such as computing a matrix inverse, nullspace or solving a matrix equation. The exact timing difference depends on a number of factors and is larger for larger matrices. As an example computing the inverse of a dense 19x19 matrix of integers values between 0 and 100 is about 10000x faster. The speed difference is bigger if gmpy2 is installed since that speeds up elementary integer and rational arithmetic and will now be used by `Matrix.rref` when possible.
<!-- END RELEASE NOTES -->

Update

The release notes on the wiki have been updated.

@sympy-bot
Copy link

sympy-bot commented Jul 29, 2023

🟠

Hi, I am the SymPy bot (v170). I've noticed that some of your commits add or delete files. Since this is sometimes done unintentionally, I wanted to alert you about it.

This is an experimental feature of SymPy Bot. If you have any feedback on it, please comment at sympy/sympy-bot#75.

The following commits add new files:

  • 24b6f37:
    • sympy/polys/matrices/rref.py

If these files were added/deleted on purpose, you can ignore this message.

@oscarbenjamin
Copy link
Collaborator Author

This is built on top of the PR gh-25395 (which I had hoped could be merged before opening this PR). The sparse version of rref_den added there is always faster than the dense implementation added in gh-25346 and should be the main workhorse for most RREF operations.

Note that the diff here would be a lot smaller if gh-25395 was merged. I wanted this to be two separate PRs where one adds the algorithm for sparse fraction-free elimination to DomainMatrix and then this PR adds the code for selecting the fastest algorithm and for making Matrix.rref use DomainMatrix.rref.

@oscarbenjamin
Copy link
Collaborator Author

There is some complex logic in rref.py to choose between different algorithms depending on the sparseness, shape and bit-counts of the matrices. This is discussed more in gh-25410 but I will just summarise the key points here with some plots from the timing script at:
https://github.com/oscarbenjamin/sympy_bench/blob/main/run_timings.py

For matrices over ZZ the fastest algorithm/implementation is always either the sparse implementation of fraction-free elimination over ZZ (sdm_rref_den) or the sparse implementation of elimination with division over QQ (sdm_irref). For ZZ we only need to choose between these two implementations which are asymptotically optimal in different regimes. In the extreme of large dense matrices with large integers fraction-free sdm_rref_den is asymptotically fastest. In the extreme of very sparse matrices with small integers plain elimination sdm_irref with division is asymptotically faster. In between we need to choose between these and I have used empirically determined thresholds.

These plot shows timings for Matrix with the PR and compare timings for 4 different things:

  1. DMs_QQ means use elimination with division over QQ (sdm_irref).
  2. DMs_ZZ2 means use fraction-free elimination over ZZ (sdm_rref_den).
  3. DMs_ZZ_hybrid shows the hybrid approach that attempts to choose the fastest method.
  4. Matrix shows timings for Matrix with this PR which effectively makes Matrix use the DMs_ZZ_hybrid approach.

This first plot shows that fraction-free elimination is fastest for dense matrices. Here we have dense matrices of 100 bit integers with varying size:

./run_timings.py --matrices=ZZnn_d --n=1,2,3,4,5,6,7,8,9,10,12,15,17,20,25,30 --repeat=100 --plot --timeout=1 --bits=100 --exclude=DMd_QQ,DMd_ZZ2

The output is:

      | Matrix   DMs_QQ   DMs_ZZ2  DMs_ZZ_hybrid
------------------------------------------------
  1x1 | 160.1 us 128.9 us 162.6 us 177.3 us     
  2x2 | 212.0 us 117.9 us 175.9 us 197.8 us     
  3x3 | 279.5 us 141.8 us 213.6 us 231.8 us     
  4x4 | 268.3 us 302.6 us 276.6 us 317.4 us     
  5x5 | 423.8 us 434.8 us 320.5 us 339.1 us     
  6x6 | 412.1 us 598.4 us 434.7 us 382.5 us     
  7x7 | 443.2 us   1.1 ms 462.8 us 504.5 us     
  8x8 | 670.8 us   1.5 ms 591.9 us 686.7 us     
  9x9 | 781.3 us   2.3 ms 749.7 us 783.4 us     
10x10 |   1.1 ms   3.5 ms   1.1 ms   1.2 ms     
12x12 |   1.5 ms   6.7 ms   1.2 ms   1.3 ms     
15x15 |   2.4 ms  14.9 ms   2.3 ms   3.0 ms     
17x17 |   3.5 ms  26.4 ms   3.4 ms   3.2 ms     
20x20 |   6.4 ms  49.7 ms   6.9 ms   5.9 ms     
25x25 |  13.8 ms 105.9 ms  13.7 ms  14.5 ms     
30x30 |  28.8 ms 221.8 ms  29.0 ms  29.8 ms  

rref_timings
Asymptotically for large matrices DMs_QQ is slower and we can see that the hybrid approach correctly selects the fraction-free method so its timings follow DMs_ZZ2 instead. For very small matrices here DMs_QQ is faster so perhaps it should be chosen but it would not be faster at very large bitsizes and at very small bitsizes even just checking the bitsize might slow it down by the sort of difference that you can see here (because the total time is so small). A future improvement could be to carefully optimise exactly how to handle matrices with e.g. 1, 2 or 3 rows because probably a dedicated routine for these cases would be best.

The next plot shows the other extreme of very sparse matrices having approximately one nonzero entry per row and with small bit sizes:

./run_timings.py --matrices=ZZnn_s --n=10,20,50,100,200,500,1000,2000,5000,10000 --repeat=100 --plot --timeout=1 --bits=10 --exclude=DMd_QQ,DMd_ZZ2
            | Matrix   DMs_QQ   DMs_ZZ2  DMs_ZZ_hybrid
------------------------------------------------------
      10x10 | 440.8 us 149.0 us 261.8 us 444.6 us     
      20x20 | 637.2 us 286.3 us 414.3 us 581.9 us     
      50x50 |   1.1 ms 461.6 us 971.7 us   1.3 ms     
    100x100 |   2.2 ms   1.0 ms   3.2 ms   3.1 ms     
    200x200 |   4.8 ms   1.7 ms   8.3 ms   4.0 ms     
    500x500 |  13.0 ms   5.9 ms  82.9 ms  10.0 ms     
  1000x1000 |  21.9 ms  10.6 ms 749.1 ms  19.6 ms     
  2000x2000 |  40.7 ms  22.0 ms >1.0 s    42.0 ms     
  5000x5000 |  92.5 ms  53.9 ms >1.0 s    90.1 ms     
10000x10000 | 178.5 ms  96.3 ms >1.0 s   178.7 ms 

rref_timings
In this limit DMs_QQ is fastest and the hybrid approach chooses to follow that giving asymptotically the best performance. The hybrid method is a bit slower than DMs_QQ itself because it clears denominators to return a result over ZZ. (Thinking about it as I write this Matrix.rref always wants to have an end result over QQ so the best thing would be to have an option for DomainMatrix.rref over ZZ to return a result over QQ and not bother clearing denominators - I will make this change for a 2x speedup).

So we see that there are extremes of dense vs sparse where we need to choose between the two DomainMatrix RREF algorithms. In between there is a crossover where one becomes faster than the other and I have tried to measure that crossover empirically. Note that the most important thing is to get the extremes right because getting the crossover right is impossible in general. The critical parameter is the "density" of the matrix which is defined as the average number of nonzero entry per row (but only in rows that are not completely zero). The best density threshold depends on the size, shape and the bit count of the integers. The formula used in the code is:

    wideness = max(1, 2/3*ncols/nrows_nz)

    max_density = (5 + 10000/(nrows_nz*bits**2)) * wideness

    if density < max_density:
        return 'rref_QQ'
    else:
        return 'rref_den_ZZ'

Here 10000 is just a number I found empirically as is the rest of the formula but I have tested this with different combinations of bit-counts, shape, size etc. This next plot shows what happens as we vary the density for sparse 100x100 matrices of 10-bit integers:

./run_timings.py --matrices=ZZd_d --n=1,2,3,4,5,6,7,8,9,10 --size=100 --repeat=10 --plot --timeout=1 --bits=10 --exclude=DMd_QQ,DMd_ZZ2
     | Matrix   DMs_QQ   DMs_ZZ2  DMs_ZZ_hybrid
-----------------------------------------------
 d=1 |   1.6 ms   1.3 ms   2.8 ms   1.6 ms     
 d=2 |   3.8 ms   1.8 ms  10.0 ms   3.3 ms     
 d=3 |   9.4 ms   6.3 ms  44.6 ms   8.2 ms     
 d=4 |  70.5 ms  67.5 ms  98.8 ms  68.0 ms     
 d=5 | 244.9 ms 246.1 ms 134.9 ms 248.6 ms     
 d=6 | 198.3 ms 428.4 ms 169.8 ms 201.1 ms     
 d=7 | 191.9 ms 607.5 ms 191.7 ms 192.2 ms     
 d=8 | 211.0 ms 765.5 ms 211.4 ms 213.0 ms     
 d=9 | 220.2 ms >1.0 s   219.9 ms 222.4 ms     
d=10 | 225.4 ms >1.0 s   224.0 ms 228.8 ms

rref_timings
We can see that at low densities DMs_QQ is fastest and at high densities DMs_ZZ2 is fastest. The empirical formula tries to select between these to find the crossover point so that the performance of Matrix should be equal to the better of the two algorithms at all densities. You can see that the threshold is not perfect at d=5 (I'm not sure that it can be) but it does capture the right asymptotics in the extremes of low vs high density. Also at low densities Matrix is a bit slower than DMs_QQ which is again due to clearing denominators (again I'll fix that).

@oscarbenjamin
Copy link
Collaborator Author

A very odd test case has failed:

def test_issue_23718():
f = 1/(b*cos(x) + a*sin(x))
Fpos = (-log(-a/b + tan(x/2) - sqrt(a**2 + b**2)/b)/sqrt(a**2 + b**2)
+log(-a/b + tan(x/2) + sqrt(a**2 + b**2)/b)/sqrt(a**2 + b**2))
F = Piecewise(
# XXX: The zoo case here is for a=b=0 so it should just be zoo or maybe
# it doesn't really need to be included at all given that the original
# integrand is really undefined in that case anyway.
(zoo*(-log(tan(x/2) - 1) + log(tan(x/2) + 1)), Eq(a, 0) & Eq(b, 0)),
(log(tan(x/2))/a, Eq(b, 0)),
(-I/(-I*b*sin(x) + b*cos(x)), Eq(a, -I*b)),
(I/(I*b*sin(x) + b*cos(x)), Eq(a, I*b)),
(Fpos, True),
)
assert integrate(f, x) == F

I'm not sure that the test is completely valid but at least the intention was that this PR should not change anything because the RREF over ZZ or QQ is well defined and unique etc so the difference in algorithm should not lead to any change in behaviour. I will need to investigate what has caused that.

The result is computed by heurisch so somehow this change has changed the output of heurisch. I think that I have seen this case before and that the reason that heurisch still uses the dense implementation of ddm_irref instead of the sparse implementation is because of this. That was because of domains other than ZZ or QQ though: I think the problem was to do with a nonsensical domain like QQ[a, sqrt(a**2)]. Actually that domain is not constructed and the EX domain is used but the expressions involved contain both a and sqrt(a**2) which

Oh, I see the problem. This is causing the sparse implementation to be used instead of the dense implementation for DomainMatrix.rref when the domain is not ZZ or QQ. Theoretically those should give equivalent results but when we have both a and sqrt(a**2) in the coefficients it just is not possible to give sensible results.

A quick fix is that heurisch could use a use_dense=True flag for nontrivial domains: this would achieve the intention of this PR that its effect should only be on speed and that it should not otherwise change the output of anything. I will do that now.

A better fix would be to prevent heurisch from creating the expression sqrt(a**2) in the first place. It comes from calling solve and then substituting somewhere, possibly here:

slns0 += solve([d], dict=True, exclude=(x,))

(Calling solve on an arbitrary expression without specifying what to solve for is not really very well defined and should be avoided in core algorithms.)

@oscarbenjamin
Copy link
Collaborator Author

Looking into it the integration problem is actually caused by this matrix:

a, b = symbols('a, b')

M = Matrix([
[0,                 0,                 0,                  0,                  0,                  0,                  0,                 0,               b**3,                  0,                                         0,                                         0,                 0,               0,       0],
[0,                 0,                 0,                  0,                  0,                  0,                  0,                 0,          -4*a*b**2,                  0,                                         0,                                         0,                 0,               0,       0],
[0,                 0,                 0,                  0,             6*b**3,                  0,                  0,                 0,           4*a**2*b,                  0,                                         0,                                         0,                 0,               0,       0],
[0,                 0,                 0,                  0,         -24*a*b**2,                  0,                  0,                 0,          -4*a*b**2,                  0,                                         0,                                         0,                 0,               0,       0],
[0,                 0,                 0,                  0, 24*a**2*b - 6*b**3,                  0,                  0,                 0,  8*a**2*b - 2*b**3,                  0,                                         0,                                         0,                 0,               0,       0],
[0,                 0,                 0,                  0,                  0,                  0,                  0,                 0,           4*a*b**2,                  0,                                         0,                                         0,                 0,               0,       0],
[0,                 0,                 0,                  0, 24*a**2*b - 6*b**3,                  0,                  0,                 0,           4*a**2*b,                  0,                                         0,                                         0,                 0,               0,       0],
[0,                 0,                 0,                  0,          24*a*b**2,                  0,                  0,                 0,           4*a*b**2,                  0,                                         0,                                         0,                 0,               0,       0],
[0,                 0,                 0,                  0,             6*b**3,                  0,                  0,                 0,               b**3,                  0,                                         0,                                         0,                 0,               0,       0],
[0,                 0,                 0,                  0,                  0,                  0,                  0,                 0,                  0,             2*b**3,                                         0,                                         0,                 0,               0,       0],
[0,                 0,                 0,                  0,                  0,                  0,                  0,              b**3,                  0,          -8*a*b**2,                                         0,                                         0,                 0,               0,       0],
[0,                 0,                 0,                  0,                  0,                  0,                  0,         -4*a*b**2,             4*b**3,           8*a**2*b,                                         0,                                         0,                 0,               0,       0],
[0,                 0,                 0,             4*b**3,                  0,                  0,                  0,          4*a**2*b,         -16*a*b**2,          -8*a*b**2,                                         0,                                         0,                 0,               0,       0],
[0,                 0,                 0,         -16*a*b**2,                  0,                  0,                  0,         -4*a*b**2, 16*a**2*b - 4*b**3, 16*a**2*b - 4*b**3,                                         0,                                         0,                 0,               0,       0],
[0,                 0,                 0, 16*a**2*b - 4*b**3,                  0,                  0,                  0, 8*a**2*b - 2*b**3,                  0,           8*a*b**2,                                         0,                                         0,                 0,               0,       0],
[0,                 0,                 0,                  0,                  0,                  0,                  0,          4*a*b**2, 16*a**2*b - 4*b**3,           8*a**2*b,                                         0,                                         0,                 0,               0,       0],
[0,                 0,                 0, 16*a**2*b - 4*b**3,                  0,                  0,                  0,          4*a**2*b,          16*a*b**2,           8*a*b**2,                                         0,                                         0,                 0,               0,       0],
[0,                 0,                 0,          16*a*b**2,                  0,                  0,                  0,          4*a*b**2,             4*b**3,             2*b**3,                                         0,                                         0,                 0,               0,       0],
[0,                 0,                 0,             4*b**3,                  0,                  0,                  0,              b**3,                  0,                  0,                                         0,                                         0,                 0,               0,       0],
[0,                 0,                 0,                  0,                  0,                  0,             3*b**3,                 0,                  0,                  0,                                         0,                                         0,                 0,               0,       0],
[0,                 0,                 0,                  0,                  0,             2*b**3,         -12*a*b**2,                 0,                  0,                  0,                                         0,                                         0,                 0,               0,       0],
[0,                 0,              b**3,                  0,                  0,          -8*a*b**2,          12*a**2*b,                 0,                  0,             2*b**3,                                         0,                                         0,                 0,               0,       0],
[0,                 0,         -4*a*b**2,                  0,                  0,           8*a**2*b,         -12*a*b**2,            2*b**3,                  0,          -8*a*b**2,                                      b**3,                                      b**3,            2*b**3,               0,       0],
[0,            2*b**3,          4*a**2*b,                  0,                  0,          -8*a*b**2, 24*a**2*b - 6*b**3,         -8*a*b**2,                  0,  8*a**2*b - 2*b**3,        -3*a*b**2 + b**2*sqrt(a**2 + b**2),        -3*a*b**2 - b**2*sqrt(a**2 + b**2),         -8*a*b**2,            b**3, -2*b**2],
[0,         -8*a*b**2,         -4*a*b**2,                  0,                  0, 16*a**2*b - 4*b**3,          12*a*b**2, 8*a**2*b - 2*b**3,                  0,                  0, 2*a**2*b - 2*a*b*sqrt(a**2 + b**2) + b**3, 2*a**2*b + 2*a*b*sqrt(a**2 + b**2) + b**3, 8*a**2*b - 2*b**3,       -4*a*b**2,   4*a*b],
[0, 8*a**2*b - 2*b**3, 8*a**2*b - 2*b**3,                  0,                  0,           8*a*b**2,          12*a**2*b,                 0,                  0,  8*a**2*b - 2*b**3,        -5*a*b**2 + b**2*sqrt(a**2 + b**2),        -5*a*b**2 - b**2*sqrt(a**2 + b**2),                 0, 4*a**2*b - b**3, -2*b**2],
[0,                 0,          4*a*b**2,                  0,                  0,           8*a**2*b,          12*a*b**2, 8*a**2*b - 2*b**3,                  0,           8*a*b**2, 4*a**2*b - 4*a*b*sqrt(a**2 + b**2) - b**3, 4*a**2*b + 4*a*b*sqrt(a**2 + b**2) - b**3, 8*a**2*b - 2*b**3,               0,   8*a*b],
[0, 8*a**2*b - 2*b**3,          4*a**2*b,                  0,                  0,           8*a*b**2,             3*b**3,          8*a*b**2,                  0,             2*b**3,          -a*b**2 - b**2*sqrt(a**2 + b**2),          -a*b**2 + b**2*sqrt(a**2 + b**2),          8*a*b**2, 4*a**2*b - b**3,  2*b**2],
[0,          8*a*b**2,          4*a*b**2,                  0,                  0,             2*b**3,                  0,            2*b**3,                  0,                  0, 2*a**2*b - 2*a*b*sqrt(a**2 + b**2) - b**3, 2*a**2*b + 2*a*b*sqrt(a**2 + b**2) - b**3,            2*b**3,        4*a*b**2,   4*a*b],
[0,            2*b**3,              b**3,                  0,                  0,                  0,                  0,                 0,                  0,                  0,           a*b**2 - b**2*sqrt(a**2 + b**2),           a*b**2 + b**2*sqrt(a**2 + b**2),                 0,            b**3,  2*b**2]])

M1 = M.to_DM(EX).rref_gj_div()[0].to_Matrix()
M2 = M.to_DM(EX).to_dense().rref_gj_div()[0].to_Matrix()

print(M1 == M2)

The two algorithms process arithmetic in a different order and end up with different end results because the domain is EX and canonical results are not guaranteed. Actually the two results are equivalent but one the result from the sparse implementation is much more complicated. This is somehow because cancel (which is used by EX for simplification and psuedo-canonicalisation) has not been able to achieve the same simplification in both cases. The RREF generated her has this sort of shape:

In [19]: M1.print_nonzero()
[ X           X ]
[  X            ]
[   X           ]
[    X          ]
[     X         ]
[      X        ]
[       X       ]
[        X      ]
[         X     ]
[          X   X]
[           X  X]
[            X  ]
[               ]
[               ]
[               ]
[               ]
[               ]
[               ]
[               ]
[               ]
[               ]
[               ]
[               ]
[               ]
[               ]
[               ]
[               ]
[               ]
[               ]
[               ]

The issue is the two nonzero expressions in the final column. From the dense implementation we have:

In [21]: M2[9,-1]
Out[21]: 
    -1      
────────────
   _________2    2 
╲╱  a  + b  

In [22]: M2[10,-1]
Out[22]: 
     1      
────────────
   _________2    2 
╲╱  a  + b  

However the sparse implementation ends up with:

In [23]: print(M1[9,-1])
16*a**5/(16*a**6 - 16*a**5*sqrt(a**2 + b**2) + 28*a**4*b**2 - 20*a**3*b**2*sqrt(a**2 + b**2) + 13*a**2*b**4 - 5*a*b**4*sqrt(a**2 + b**2) + b**6) - 16*a**4*sqrt(a**2 + b**2)/(16*a**6 - 16*a**5*sqrt(a**2 + b**2) + 28*a**4*b**2 - 20*a**3*b**2*sqrt(a**2 + b**2) + 13*a**2*b**4 - 5*a*b**4*sqrt(a**2 + b**2) + b**6) + 20*a**3*b**2/(16*a**6 - 16*a**5*sqrt(a**2 + b**2) + 28*a**4*b**2 - 20*a**3*b**2*sqrt(a**2 + b**2) + 13*a**2*b**4 - 5*a*b**4*sqrt(a**2 + b**2) + b**6) - 12*a**2*b**2*sqrt(a**2 + b**2)/(16*a**6 - 16*a**5*sqrt(a**2 + b**2) + 28*a**4*b**2 - 20*a**3*b**2*sqrt(a**2 + b**2) + 13*a**2*b**4 - 5*a*b**4*sqrt(a**2 + b**2) + b**6) + 5*a*b**4/(16*a**6 - 16*a**5*sqrt(a**2 + b**2) + 28*a**4*b**2 - 20*a**3*b**2*sqrt(a**2 + b**2) + 13*a**2*b**4 - 5*a*b**4*sqrt(a**2 + b**2) + b**6) - b**4*sqrt(a**2 + b**2)/(16*a**6 - 16*a**5*sqrt(a**2 + b**2) + 28*a**4*b**2 - 20*a**3*b**2*sqrt(a**2 + b**2) + 13*a**2*b**4 - 5*a*b**4*sqrt(a**2 + b**2) + b**6)

In [24]: print(M1[10,-1])
-4*a**3/(4*a**4 - 4*a**3*sqrt(a**2 + b**2) + 5*a**2*b**2 - 3*a*b**2*sqrt(a**2 + b**2) + b**4) + 4*a**2*sqrt(a**2 + b**2)/(4*a**4 - 4*a**3*sqrt(a**2 + b**2) + 5*a**2*b**2 - 3*a*b**2*sqrt(a**2 + b**2) + b**4) - 3*a*b**2/(4*a**4 - 4*a**3*sqrt(a**2 + b**2) + 5*a**2*b**2 - 3*a*b**2*sqrt(a**2 + b**2) + b**4) + b**2*sqrt(a**2 + b**2)/(4*a**4 - 4*a**3*sqrt(a**2 + b**2) + 5*a**2*b**2 - 3*a*b**2*sqrt(a**2 + b**2) + b**4)

These two expressions do simplify a little under cancel but radsimp can simplify them all the way:

In [25]: cancel(M1[9,-1])
Out[25]: 
                       _________                          _________                  _________     
          5       4   ╱  2    2        3  2       2  2   ╱  2    2         4    4   ╱  2    2      
      16⋅a  - 16⋅a ⋅╲╱  a  + b   + 20⋅a ⋅b  - 12⋅a ⋅b ⋅╲╱  a  + b   + 5⋅a⋅b  - b ⋅╲╱  a  + b       
───────────────────────────────────────────────────────────────────────────────────────────────────
                 _________                          _________                        _________     
    6       5   ╱  2    2        4  2       3  2   ╱  2    2        2  4        4   ╱  2    2     6
16⋅a  - 16⋅a ⋅╲╱  a  + b   + 28⋅a ⋅b  - 20⋅a ⋅b ⋅╲╱  a  + b   + 13⋅a ⋅b  - 5⋅a⋅b ⋅╲╱  a  + b   + b 

In [26]: cancel(M1[10,-1])
Out[26]: 
                     _________                  _________    
         3      2   ╱  2    2         2    2   ╱  2    2     
    - 4⋅a  + 4⋅a ⋅╲╱  a  + b   - 3⋅a⋅b  + b ⋅╲╱  a  + b      
─────────────────────────────────────────────────────────────
               _________                       _________     
   4      3   ╱  2    2       2  2        2   ╱  2    2     4
4⋅a  - 4⋅a ⋅╲╱  a  + b   + 5⋅a ⋅b  - 3⋅a⋅b ⋅╲╱  a  + b   + b 

In [27]: radsimp(M1[9,-1])
Out[27]: 
    -1      
────────────
   _________
  ╱  2    2 
╲╱  a  + b  

In [28]: radsimp(M1[10,-1])
Out[28]: 
     1      
────────────
   _________
  ╱  2    2 
╲╱  a  + b 

The complicated form that they come out in results from EX simplifying with cancel(expr).expand() - it is the expand that makes it much more complicated but the root problem is the fact that cancel cannot reduce to the simplest canonical form. Perhaps we should just say that if the domain is EX (or EXRAW) then the dense implementation should be used so that at least outputs are deterministic regardless of whether the initial matrix is sparse or dense. We can't get fully canonical results from the fraction-free solvers without cancellation though because the denominators can have extraneous factors.

The ideal solution here would be to have a domain that can represent these expressions. It can be done with FiniteMonogenicExtension. I have been working on a QuadRing domain that can represent any number of nested quadratic field extensions using sparse polynomials:

In [75]: K = QuadRing(QQ.frac_field(a, b), [c], [c**2 - (a**2 + b**2)])

In [76]: dM = M.subs(sqrt(a**2+b**2), c).to_DM(K)

In [77]: print(repr(dM.rref()[0].to_Matrix()))
Matrix([
[0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1/2,                0],
[0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,   0,                0],
[0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0,   0,                0],
[0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0,   0,                0],
[0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0,   0,                0],
[0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0,   0,                0],
[0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0,   0,                0],
[0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0,   0,                0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0,   0,                0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0,   0, -c/(a**2 + b**2)],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0,   0,  c/(a**2 + b**2)],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1,   0,                0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,   0,                0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,   0,                0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,   0,                0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,   0,                0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,   0,                0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,   0,                0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,   0,                0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,   0,                0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,   0,                0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,   0,                0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,   0,                0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,   0,                0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,   0,                0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,   0,                0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,   0,                0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,   0,                0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,   0,                0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,   0,                0]])

That uses a radsimp like algorithm to handle division over a field or exquo over a ring.

In any case this example is a distraction from the main purpose of this PR which is for ZZ and QQ. I will restore the previous behaviour that heurisch at least uses the dense implementation for EX.

@github-actions
Copy link

github-actions bot commented Jul 29, 2023

Benchmark results from GitHub Actions

Lower numbers are good, higher numbers are bad. A ratio less than 1
means a speed up and greater than 1 means a slowdown. Green lines
beginning with + are slowdowns (the PR is slower then master or
master is slower than the previous release). Red lines beginning
with - are speedups.

Significantly changed benchmark results (PR vs master)

       before           after         ratio
     [221d7730]       [53007970]
-         162±1μs       90.2±0.2μs     0.56  solve.TimeMatrixOperations.time_rref(3, 0)
-         311±1μs        109±0.2μs     0.35  solve.TimeMatrixOperations.time_rref(4, 0)
-     31.4±0.06ms      13.5±0.02ms     0.43  solve.TimeSolveLinSys189x49.time_solve_lin_sys

Significantly changed benchmark results (master vs previous release)

       before           after         ratio
     [8059df73]       [221d7730]
     <sympy-1.12^0>                 
-      89.6±0.9ms       56.7±0.6ms     0.63  integrate.TimeIntegrationRisch02.time_doit(10)
-        85.3±2ms       55.4±0.3ms     0.65  integrate.TimeIntegrationRisch02.time_doit_risch(10)
-        2.09±0ms          655±1μs     0.31  polys.TimePREM_LinearDenseQuadraticGCD.time_op(3, 'sparse')
-     10.3±0.02ms      1.95±0.01ms     0.19  polys.TimePREM_LinearDenseQuadraticGCD.time_op(5, 'sparse')
-       367±0.6μs       82.5±0.3μs     0.22  polys.TimePREM_QuadraticNonMonicGCD.time_op(1, 'sparse')
-        4.82±0ms        363±0.4μs     0.08  polys.TimePREM_QuadraticNonMonicGCD.time_op(3, 'sparse')
-     10.6±0.02ms         1.09±0ms     0.10  polys.TimePREM_QuadraticNonMonicGCD.time_op(5, 'sparse')
-     6.23±0.01ms      3.96±0.01ms     0.64  polys.TimeSUBRESULTANTS_LinearDenseQuadraticGCD.time_op(2, 'sparse')
-     27.1±0.03ms      12.0±0.02ms     0.44  polys.TimeSUBRESULTANTS_LinearDenseQuadraticGCD.time_op(3, 'sparse')
-     6.93±0.02ms         1.18±0ms     0.17  polys.TimeSUBRESULTANTS_QuadraticNonMonicGCD.time_op(1, 'sparse')
-     16.1±0.02ms      9.26±0.01ms     0.57  polys.TimeSUBRESULTANTS_QuadraticNonMonicGCD.time_op(2, 'sparse')
-      211±0.04ms      70.8±0.07ms     0.34  polys.TimeSUBRESULTANTS_QuadraticNonMonicGCD.time_op(3, 'sparse')
-     6.37±0.01ms          534±1μs     0.08  polys.TimeSUBRESULTANTS_SparseGCDHighDegree.time_op(3, 'sparse')
-     27.7±0.05ms          851±3μs     0.03  polys.TimeSUBRESULTANTS_SparseGCDHighDegree.time_op(5, 'sparse')
-         607±1μs        196±0.6μs     0.32  polys.TimeSUBRESULTANTS_SparseNonMonicQuadratic.time_op(1, 'sparse')
-     6.49±0.01ms          200±1μs     0.03  polys.TimeSUBRESULTANTS_SparseNonMonicQuadratic.time_op(3, 'sparse')
-     17.0±0.04ms        205±0.7μs     0.01  polys.TimeSUBRESULTANTS_SparseNonMonicQuadratic.time_op(5, 'sparse')

Full benchmark results can be found as artifacts in GitHub Actions
(click on checks at the top of the PR).

@oscarbenjamin
Copy link
Collaborator Author

In any case this example is a distraction from the main purpose of this PR which is for ZZ and QQ. I will restore the previous behaviour that heurisch at least uses the dense implementation for EX.

I have restored the previous behaviour. Now a dense DomainMatrix will not use the sparse RREF algorithm if the domain is EX which fixes the failing test. Ultimately a better fix for heurisch is needed. Either cancel should be improved to handle things like square roots or heurisch should use some simplification, or a domain should be made that can handle things like sqrt(a**2 + b**2) in combination with a and b.

Ultimately a domain that can handle this is the best solution. There will of course always be some expressions that need to use the EX domain though and it is unavoidable that outputs cannot be fully canonical in that case. The reason I don't want to just change the test is because the output comes out much less nice and I am just not sure if that is an inherent feature of the sparse algorithm (when used with EX) or if this is just a random case where one method happens to simplify nicely and the other does not.

For now I have restored previous behaviour although this PR does change heurisch to use the sparse implementation when the domain is not EX. There are certainly many cases where this will be a big speed improvement: I previously measured big improvements but ended up reverting the change because of test_issue_23718 referred to above. For any domain that has canonical representation using the sparse implementation should change only speed and not e.g. alter the form of the expressions generated as output.

@oscarbenjamin
Copy link
Collaborator Author

The hybrid method is a bit slower than DMs_QQ itself because it clears denominators to return a result over ZZ. (Thinking about it as I write this Matrix.rref always wants to have an end result over QQ so the best thing would be to have an option for DomainMatrix.rref over ZZ to return a result over QQ and not bother clearing denominators - I will make this change for a 2x speedup).

I have done this now and just pushed the change. New timing plots show that Matrix more closely matches the timings of DMs_QQ e.g. to redo the two plots shown above:

./run_timings.py --matrices=ZZnn_s --n=10,20,50,100,200,500,1000,2000,5000,10000 --repeat=100 --plot --timeout=1 --bits=10 --exclude=DMd_QQ,DMd_ZZ2

rref_timings

./run_timings.py --matrices=ZZd_d --n=1,2,3,4,5,6,7,8,9,10 --size=100 --repeat=10 --plot --timeout=1 --bits=10 --exclude=DMd_QQ,DMd_ZZ2

rref_timings
The point to note is just how closely the blue and red lines match the orange line (when it is lower than the green line).

return M.from_rep(M_rref_sdm), den, pivots


def _dm_rref_den_ZZ(Mz, keep_domain=True):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could this, and each of the other functions that starts by selecting a method, accept a method kwarg, for users who wanted to choose the method themselves, instead of accepting the heuristic choice?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess they could. I might need to think of better names for the methods though rather than rref_QQ. In future the same kinds of methods would be considered for different domains like QQ(X) etc.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps for rref_den it should be:

  1. GJ: Gauss Jordan with division. Always returns a result in the associated field.
  2. FFGJ: Use fraction-free Gauss Jordan, always returns the same domain.
  3. CD: If a field then clear denominators and convert to ring, always returns a result in the associated ring.

Those methods make sense for rref_den. For rref we don't have the option to return a result in a ring because generally the output needs to be in a field. Perhaps then it could be the same method names but with a slightly different meaning:

  1. GJ: Convert to field and use Gauss Jordan with division. Always returns a result in the associated field.
  2. FFGJ: Use fraction-free Gauss Jordan, convert to field and divide at the end, always returns a result in the associated field.
  3. CD: If a field then clear denominators and convert to ring, use FFGJ but then convert to field and divide at the end, always returns a result in the associated field.

Maybe there are better names for these algorithms/approaches...

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added a method parameter e.g.:

In [2]: M = Matrix([[1,1,1],[1,2,3]])

In [3]: M.to_DM().rref(method='GJ')
Out[3]: (DomainMatrix({0: {0: 1, 2: -1}, 1: {1: 1, 2: 2}}, (2, 3), QQ), (0, 1))

Possible methods are: auto, GJ, FF, CD, GJ_dense, FF_dense, CD_dense.

The parameter is for both rref and rref_den.

Copy link
Member

@sylee957 sylee957 Jul 30, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

accept a method kwarg, for users who wanted to choose the method themselves, instead of accepting the heuristic choice?

My opinion is against that approach.
There is little of such usage to choose methods, and little usage of educating users to pick the methods,
unless the results are very different in visuals.
I'd just use simple functions for 99% of the time.

And it's just better to publicize more lower level handles, and let users program with them, than implementing many methods.

I've seen people who are looking for 'deep' features are actually looking for things like numpy C wrappers directly, rather than fiddling with 'methods'.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added the method parameter to DomainMatrix.rref but not to Matrix.rref. There is a tension here between whether or not DomainMatrix methods should implement specific algorithms or should choose algorithms on behalf of the user. Until this PR every DomainMatrix method simply executes a specific algorithm. In the end though we will somewhere want to have something like an RREF method that can select the best algorithm if we know that different choices are better in different situations.

I can see though that method= arguments are awkward later because:

  1. Most users don't use them.
  2. A high-level operation like solve that calls e.g. Matrix.rref indirectly will not provide users any way to control the method anyway.
  3. The method arguments make it awkward to change the internal details in future because they expose the internals too much.

Many Matrix methods have arguments like iszerofunc or simplifyfunc. This makes it awkward to e.g. change the method so that it uses DomainMatrix even if there is a situation where DomainMatrix is definitely better than any combination of existing options.

I think that the difference between these methods is not quite the same as the existing Matrix methods that have a method= argument because here the differences are really about what fundamental arithmetic operations a domain supports though. Also we already need to have the code that selects a method and then dispatches based on that regardless of whether we want to expose the method argument to users.

@oscarbenjamin
Copy link
Collaborator Author

In the case of QQ the situation is a little more complicated because as well as size, shape, bit-count and density we also need to consider the bit counts of the denominators separately from the bit count of the numerators. Also in this case we have more than two options that can fastest in different cases:

  1. DMs_QQ: Use plain Gauss-Jordan with division over QQ. This is best for sparse matrices or for matrices with very large denominators.
  2. DMs_ZZc: Clear denominators and use fraction-free elimination over ZZ. This is fastest for dense matrices with large numerators and small denominators.
  3. DMs_QQff: Use fraction-free elimination over QQ. This is fastest for some cases in between the other two extremes.

Again the implementation here uses empirically defined cutoffs to choose between these different algorithms. Here is a case in which Gauss-Jordan over QQ is fastest (sparse matrices with 100 bit numerators and denominators):

./run_timings.py --matrices=QQnn_s --n=1,2,5,10,15,20,50,100,120,150,200,500,1000,2000 --repeat=10 --plot --timeout=10 --bits=100 --dbits=100 --exclude=DMd_QQ,DMd_ZZc,DMd_QQc,DMd_QQff,DMs_QQc

rref_timings
Here is a situation where clearing denominators is fastest (dense matrices with 100 bit numerators and 2 bit denominators):

./run_timings.py --matrices=QQnn_d --n=1,2,5,10,15,20,50,100,120,150,200,500,1000,2000 --repeat=10 --plot --timeout=10 --bits=100 --dbits=2 --exclude=DMd_QQ,DMd_ZZc,DMd_QQc,DMd_QQff,DMs_QQc

rref_timings
In this situation DMs_ZZc is only slightly faster than DMs_QQff. This is a very common case though: users often have matrices of rational numbers with small denominators so we want to have the best performance in that case if possible.

Here is a situation in which DMs_QQff is fastest and we can see that asymptotically DMs_QQff and DMs_ZZc diverge. This is when have dense matrices with 1000 bit numerators and 20 bit denominators:

./run_timings.py --operation=rref --n=1,2,3,4,5,6,7,8,9,10,12,14,16,20,25,30,40,50 --plot --timeout=10 --matrices=QQnn_d --bits=1000 --dbits=20 --repeats=3 --exclude=DMd_QQ,DMd_ZZc,DMd_QQc,DMd_QQff,DMs_QQc

rref_timings
Note that in this case the hybrid algorithm gets it wrong at 25x25 and starts using DMs_QQ even though DMs_QQff is still fastest. I am not sure exactly how to improve that but at least DMs_QQ and DMs_QQff appear to be asymptotically similar. In this case DMs_ZZc looks asymptotically slower and so it is important that we are not choosing that even if we don't necessarily make the best choice between the other two.

This plot shows the transition between the three different methods as the bit count of denominators is increased. These are dense 10x10 matrices with 1000 bit numerators and increasing denominator bit count:

./run_timings.py --operation=rref --n=1,2,5,10,20,50,100,200,500,1000,2000,5000 --plot --timeout=10 --matrices=QQd_b --size=10 --bits=10000 --repeats=3 --exclude=DMd_QQ,DMd_ZZc,DMd_QQc,DMd_QQff,DMs_QQc

rref_timings
There is a regime towards the left where we have very small denominators and clearing denominators is used. In this region clearing denominators (DMs_ZZc) is just about fastest but not much faster than DMs_QQff. Then in the middle DMs_QQff is fastest and is used instead. Towards the right DMs_QQ becomes fastest and the hybrid method crosses over at approximately the right point. It is not completely clear from this plot but if I was to extend it then I think it would becoe clear that DMs_QQ is asymptotically best as the denominator bit count increases so that the timing difference between the methods would increase if the plot continues.

The threshold rules for QQ are basically:

  1. If the "density" is less than 5 then use DMs_QQ.
  2. If the lcm of the denominators has a bit count greater than 5 times the numerator bit count then use DMs_QQ.
  3. Otherwise if the lcm is small (< 50 bits) then clear denominators.
  4. Otherwise use DMs_QQff in between those extremes.

It is possible that these thresholds can be improved but I am aiming really just for robust separation of the extremes in which different asymptotic complexity occurs. The exact crossover points can easily change e.g. if any of the underlying algorithms is improved (or worsened!).

@skieffer
Copy link
Member

I think the docs would greatly benefit from a tutorial page that teaches about the different RREF methods, and the properties of matrices that may make one method better than another, and finally teaches the user how to try the different methods manually, for special cases where the heuristics may not make the best choice. This would go along with allowing the user to pass a method name as a keyword arg.

You've done a lot of research, and it would be great to codify what you've learned, in condensed and simplified form. While the heuristics do a great job of stitching together a probably-optimal default method, the code you've written has a lot of power that's not yet exposed to users, if they can't try out the various methods on demand.

@oscarbenjamin
Copy link
Collaborator Author

I think the docs would greatly benefit from a tutorial page that teaches about the different RREF methods

Maybe since this adds an rref.py module to the docs the module level docstring could be use to explain some of this or perhaps it just needs a separate page in the docs. It might also be best to move the different implementations ddm_irref, ddm_irref_den, sdm_irref and sdm_rref_den all into this module so that all the relevant code can be seen together as well (but I would prefer not to move them right now in this PR).

@oscarbenjamin
Copy link
Collaborator Author

oscarbenjamin commented Jul 30, 2023

I think the docs would greatly benefit from a tutorial page that teaches about the different RREF methods

I do think that this would be good but I would like to get this PR in now and add that later. The tutorial page that I am imagining would discuss a lot more than just RREF for ZZ and QQ which is what this PR is about. I already want to move on from this PR because I have other changes planned...

In the meantime I can answer any questions though if there is some part that does not immediately make sense.

@oscarbenjamin
Copy link
Collaborator Author

The code quality tests have failed with:

Run flake8 sympy
sympy/parsing/latex/_parse_latex_antlr.py:31:5: F8[11](https://github.com/sympy/sympy/actions/runs/5703348129/job/15455910551?pr=25443#step:7:12) redefinition of unused 'MathErrorListener' from line 13
Error: Process completed with exit code 1.

I guess this is some update to flake8 or something but I don't think it's related to any changes here.

@skieffer
Copy link
Member

I do think that this would be good but I would like to get this PR in now and add that later. The tutorial page that I am imagining would discuss a lot more than just RREF for ZZ and QQ which is what this PR is about.

That sounds fine.

@sylee957
Copy link
Member

I tried pyodide test 3 multiple times, but the tests are timing out.

@oscarbenjamin
Copy link
Collaborator Author

I tried pyodide test 3 multiple times, but the tests are timing out.

If you mean that you want it to pass before merging this then that is not necessary. That job is not "required" for merging. If you are worried that anything is failing there then you can see that the tests in the job have passed:

== 2950 passed, 75 skipped, 9302 deselected, 12 xfailed in 496.97s (0:08:16) ===
pytest finished with exit code 0
`pyodide.runPython` completed successfully
Error: The operation was canceled.

Somehow it times out at shutdown. There were problems with this when migrating to use pytest (I remember @brocksam wrestling with it for a while). At the time the problem seemed to get fixed but now something flakey in pyodide has brought it back. I'm certain that it is a pyodide bug but I'm not sure that we have any meaningful information that we could share with the pyodide folks to help them fix it (perhaps it is at least worth opening an issue).

@brocksam
Copy link
Contributor

brocksam commented Aug 1, 2023

Somehow it times out at shutdown. There were problems with this when migrating to use pytest (I remember @brocksam wrestling with it for a while). At the time the problem seemed to get fixed but now something flakey in pyodide has brought it back. I'm certain that it is a pyodide bug but I'm not sure that we have any meaningful information that we could share with the pyodide folks to help them fix it (perhaps it is at least worth opening an issue).

Nothing constructive to add here other than I corroborate everything @oscarbenjamin is saying. I recently spent time looking at this again but wasn't able to make any progress. It is a peculiar issue for sure. #25398 is open for it.

@oscarbenjamin
Copy link
Collaborator Author

I am a bit unclear about what the status of this PR is now. It seems like a very clear improvement that should just be merged. It makes some things massively faster and does not cause any problems like giving undesirable output etc.

Various suggestions have been made here and in gh-25452 about changes to the code. I'll try to address those here:

It was suggested that it is not good to move the code from methods on DomainMatrix to functions in the rref.py module. I disagree with this very directly. The rref module added here has 500 lines of code for choosing what algorithm to use and will grow larger as more domains are considered. It is far too big to just be all in methods on DomainMatrix and it is better to have a separate module where someone can see the details of particular algorithms in one place. Ideally the actual algorithms themselves (ddm_irref etc) would be moved into that module as well but I haven't done that here because I am trying to keep the changes as minimal as possible and not make it any harder to review. If the algorithms were moved in then the module would be 1000 lines. If more domains were considered it could easily grow further.

It was also suggested that a method= argument is a bad idea. I did not initially have that but added it at @skieffer's suggestion. Having explored this for a bit I am convinced that the argument is needed because it is impossible to predict the best algorithm perfectly and while it is easy enough for me to stitch together the 2 or 3 function calls that make up each choice it is not straight-forward for most other people to figure out exactly how to do that. While there is something nice about having each method imply a particular algorithm at the same time somewhere there needs to be a method/function that can select the best algorithm so that it can be used by code that just wants to compute RREF and doesn't know what the best algorithm is. So we do need to have methods that can auto-select the algorithm as well as making it easy for users to choose the algorithm.

Adding the method= argument did complicate the code. I have simplified that a bit but it is not possible to make it much simpler: fundamentally there just is a certain amount of conditional logic that has to be there to make all of these pieces work and you can either put that together in large functions or break it up into small functions but the complexity won't disappear either way. I have tried to reduce repetition in the module because that was implicitly suggested. That does simplify things a little bit but not by much.

It was also suggested that the denominator and non-denominator codepaths could be combined by treating the denominator as 1 in the non-denominator case. That just doesn't work: I tried it and it was a mess. The problem is that the conditional logic for what you need to do in terms of choosing domains, converting, dividing etc is just very different in the two cases and just setting the den = 1 does not change that. There are now two functions dm_rref and dm_rref_den and the logic in them might look similar but it really is not. The code is all different and the invariants (which I have now included in comments) are completely different. Merging the codepaths makes it very difficult to see what invariants are expected to hold at any point and to keep track of what is going on with switching between domains.

Other PRs that also bring substantial improvements but require more careful consideration will depend on this one e.g. gh-25452. With this PR I tried to make so that it is just an obvious uncontroversial improvement that we can just merge.

The test failures are upstream bugs and are unrelated to this PR: they show up on all PRs. Any PR can still be merged regardless of those failures though so we don't need to worry about that.

If anyone would like to see changes here then can they be very explicit about what those changes would be? Otherwise can we just merge this now?

@sylee957
Copy link
Member

sylee957 commented Aug 1, 2023

It was suggested that it is not good to move the code from methods on DomainMatrix to functions in the rref.py module. I disagree with this very directly.

The only thing I specifically meant was the public functions inside rref.py module. (dm_rref, dm_rref_den) I'm still unclear why you have to put the same thing as function as well as method.
I only expect a single source where users can access them, and if you are copy pasting things like that, maintaining documentations becomes a problem.

Comment on lines 1996 to 2026
def rref_gj_div(self):
"""Compute RREF using Gauss-Jordan elimination with division.

The domain must be a field.

Returns
=======

(DomainMatrix, list)
reduced-row echelon form and list of pivots for the DomainMatrix

See Also
========

convert_to, lu
rref
Compute RREF choosing the most efficient algorithm (possibly uses
this algorithm).
rref_den
Compute RREF with denominator using the most efficient algorithm
(possibly uses this algorithm).
sympy.polys.matrices.dense.ddm_irref
The underlying algorithm in the dense case.
sympy.polys.matrices.sdm.sdm_irref
The underlying algorithm in the sparse case.
"""
if not self.domain.is_Field:
raise DMNotAField('Not a field')
rref_ddm, pivots = self.rep.rref()
return self.from_rep(rref_ddm), tuple(pivots)

def rref_den_gj_ff(self):
Copy link
Member

@sylee957 sylee957 Aug 1, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, I would like to not to add them, especially it is controlled by methods.
I'm not sure if you left it by intent, but it looks like we shouldn't add if rref stuff is not actually using this.

@sylee957
Copy link
Member

sylee957 commented Aug 1, 2023

I'd also want to respond to some comments here
#25452 (comment)

I don't think that dense keyword literally makes the code more shorter
(You literally have to write dense in GJ_dense or to_dense)
And another argument is that that encourages some bad practice for the users, because users get an option to use something that performs worse, which builds some throwaway dense/sparse matrix implicitly.

It can slightly help for some experiments, but I think that it is better for users to know the matrix types, and approve them to cast it appropriately in practice.

@skieffer
Copy link
Member

skieffer commented Aug 2, 2023

If anyone would like to see changes here then can they be very explicit about what those changes would be? Otherwise can we just merge this now?

LGTM

@oscarbenjamin
Copy link
Collaborator Author

The only thing I specifically meant was the public functions inside rref.py module. (dm_rref, dm_rref_den) I'm still unclear why you have to put the same thing as function as well as method.

Oh, Okay. I'll rename them as private then and perhaps remove/simplify the docstrings.

@oscarbenjamin
Copy link
Collaborator Author

I don't think that dense keyword literally makes the code more shorter
(You literally have to write dense in GJ_dense or to_dense)
And another argument is that that encourages some bad practice for the users, because users get an option to use something that performs worse, which builds some throwaway dense/sparse matrix implicitly.

It can slightly help for some experiments, but I think that it is better for users to know the matrix types, and approve them to cast it appropriately in practice.

Choosing an algorithm is not the same as choosing a type though. The functions here will always return the same type e.g. dense if the input was dense but the best algorithm might not be the dense one even if the input was dense. There can also be reasons for choosing an algorithm even if it is inefficient.

There are two basic ways of providing an API for algorithms like this:

  1. Each type and method name corresponds to a particular algorithm.
  2. Each method returns the requested object in the expected type but uses whatever algorithm is best internally.

So far DomainMatrix mostly follows approach 1. This PR changes the rref and rref_den methods to use approach 2: choose the best algorithm. That includes changing the type of the matrix from dense to sparse if it is probably faster to do so (which it is right now but that might change in future). This is needed because deciding automatically the best algorithm is nontrivial and e.g. the rest of the SymPy codebase needs to use RREF and wants it to be as fast as possible without having to duplicate all of the logic here for picking an algorithm.

This change means that the caller no longer controls which exact algorithm is used simply by choosing the type and calling the rref method so they need some other way to control which is used. There are two ways that we can give the caller that control:

  • Have the method= parameter allow all possible algorithms and implementations like GJ_dense etc (i.e. make approach 2 from above flexible).
  • Provide separate methods like rref_gj_div that correspond directly to algorithms and do not do any automatic selection (as in approach 1 above).

I have implemented both of these options in this PR but your two comments above have objected to both of them. Given that we do want to have an rref function that can select the likely best method (which we do and it should absolutely be called rref), how do you propose that we allow the caller to control which method is used when they do need to control that?

@sylee957
Copy link
Member

sylee957 commented Aug 2, 2023

Have the method= parameter allow all possible algorithms and implementations like GJ_dense etc (i.e. make approach 2 from above flexible).
Provide separate methods like rref_gj_div that correspond directly to algorithms and do not do any automatic selection (as in approach 1 above).
I have implemented both of these options in this PR but your two comments above have objected to both of them.

I think that it's better to decide only one than shipping everything. It confuses the users, if there are duplicate ways to do one thing.
I'd like to merge toward using a single rref with method because it makes the code and documentation more 'dense' in some way. I don't think that many methods need to be added in the DomainMatrix.
(In fact, in some extreme code standards, people literally restrict number of public methods in object/classes or it is antipattern)

If there are multiple public facing methods in there, it just means that we have to write down duplicate documentation or test cases for things like rref_gj_div eventually, and I'm sure that that gets into other frustration eventually.

@oscarbenjamin
Copy link
Collaborator Author

In the last commit I renamed dm_rref to _dm_rref and removed the rref module from the docs. Instead I have added more detail in the rref and rref_den methods about the algorithms.

I also removed the redundant methods rref_gj_div and rref_gj_ff.

@oscarbenjamin
Copy link
Collaborator Author

I'd like to merge toward using a single rref with method because it makes the code and documentation more 'dense' in some way.

I think I've done what you intended here but this statement is a little ambiguous. I have made _dm_rref and _dm_rref_den private but they still contain the body that implements the methods in rref.py. I have also removed the other methods like rref_gj_div and rref_gj_ff. Now the only public methods are rref and rref_den and I have expanded their docstrings to describe all of the available methods and to link to other relevant functions.

If you mean that rref and rref_den should be merged then I disagree. They don't compute the same things, don't return the same things, don't have the same invariants and cannot be used interchangeably. The caller needs to know which of these they are using and needs to handle their outputs differently. Likewise the called code needs to know which of these the caller wanted so that it can produce the expected output in the most efficient way possible.

@sylee957 sylee957 merged commit e1869a4 into sympy:master Aug 2, 2023
@oscarbenjamin oscarbenjamin deleted the pr_rref branch August 2, 2023 18:43
@oscarbenjamin
Copy link
Collaborator Author

Thanks @sylee957 and @skieffer !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants