Add an mzed_transpose function #8

Biffo89 · 2025-07-10T16:35:56Z

A fairly-optimised transpose function for matrices, as this is missing from Sage.

Transfers are made in order of the data in the source matrix to reduce cache misses
Buffers are used to reduce number of calls to __mzd_xor_bits, writes are done word by word
Bulk reads are done on each word in memory to reduce calls to __mzd_read_bits, then masks and shifts are applied after

Tests and benchmarks added, now memory safe with a speedup of ~4.5-12x over naive transpose.

malb · 2025-07-10T20:00:41Z

Cool, can you add tests?

malb · 2025-07-10T20:03:00Z

Since you're aiming for performance, maybe check how this compares with a transpose of a GF(2) matrix that has the same storage cost? Carlo Wood spent a lot of time making that one fast, if my memory serves.

malb · 2025-07-16T07:53:14Z

Thanks for the tests, they look good. What are the benchmark results for you?

Biffo89 · 2025-07-16T10:47:43Z

I realised I had a few mistakes in the benchmark file. The results for me are about a 4x speedup for all sizes over the naive version. The performance is admittedly slower than for GF(2) but still a good improvement. I'll try a further optimisation on the number of __mzd_read_bits calls, but I can't spend but more time on this at the moment as it's just instrumental to an addition to Sage which makes use of the transpose, and GF(2^e) was particularly slow.

......
Total running time: 0.552 seconds.
Sample size: 6; mean: 81136039; standard deviation: 387795
99% confidence interval: +/- 638333 (0.8%): [80497706..81774373]
e: 1, m: 8192, n: 8192, algo: optimised, cpu cycles: 81136039, cc/bit: 1.20902, wall time: 0.035220
................................................................
Total running time: 60.624 seconds.
Sample size: 64; mean: 513710525; standard deviation: 55561305
99% confidence interval: +/- 18437183 (3.6%): [495273342..532147708]
e: 2, m: 5793, n: 5793, algo: optimised, cpu cycles: 513710525, cc/bit: 7.65388, wall time: 0.222969
....................................
Total running time: 60.650 seconds.
Sample size: 36; mean: 2208722449; standard deviation: 119895354
99% confidence interval: +/- 54452471 (2.5%): [2154269978..2263174920]
e: 2, m: 5793, n: 5793, algo: naive, cpu cycles: 2208722449, cc/bit: 32.90820, wall time: 0.958652
..........................................................................................................................
Total running time: 60.316 seconds.
Sample size: 122; mean: 300793563; standard deviation: 37965874
99% confidence interval: +/- 8993658 (3.0%): [291799905..309787222]
e: 4, m: 4096, n: 4096, algo: optimised, cpu cycles: 300793563, cc/bit: 4.48217, wall time: 0.130557
.........................................................
Total running time: 60.511 seconds.
Sample size: 57; mean: 1593826232; standard deviation: 161739373
99% confidence interval: +/- 57063761 (3.6%): [1536762471..1650889992]
e: 4, m: 4096, n: 4096, algo: naive, cpu cycles: 1593826232, cc/bit: 23.74986, wall time: 0.691769
.......................................................................................................................................................................................................
Total running time: 60.073 seconds.
Sample size: 199; mean: 209458250; standard deviation: 75109151
99% confidence interval: +/- 13801440 (6.6%): [195656810..223259690]
e: 8, m: 2896, n: 2896, algo: optimised, cpu cycles: 209458250, cc/bit: 3.12184, wall time: 0.090915
..................................................................................................
Total running time: 60.574 seconds.
Sample size: 98; mean: 867007988; standard deviation: 86690940
99% confidence interval: +/- 22923520 (2.6%): [844084467..889931508]
e: 8, m: 2896, n: 2896, algo: naive, cpu cycles: 867007988, cc/bit: 12.92219, wall time: 0.376313
............................................................................................................................................................................................................................................................................................................
Total running time: 60.085 seconds.
Sample size: 300; mean: 168759155; standard deviation: 25998175
99% confidence interval: +/- 3873832 (2.3%): [164885323..172632987]
e: 16, m: 2048, n: 2048, algo: optimised, cpu cycles: 168759155, cc/bit: 2.51471, wall time: 0.073252
.....................................................................................................................................................
Total running time: 60.094 seconds.
Sample size: 149; mean: 638015204; standard deviation: 100339408
99% confidence interval: +/- 21416198 (3.4%): [616599006..659431402]
e: 16, m: 2048, n: 2048, algo: naive, cpu cycles: 638015204, cc/bit: 9.50717, wall time: 0.276921

malb · 2025-07-16T10:56:22Z

Cool, that's a tidy speed-up. I agree "perfect is the enemy of good enough" and I'm happy to merge this. For documentation, could you also compare:

for $e' \in [2,3,4]$ do

$e=1,n=4096,m=n \cdot e'$
$e=e',n=4096,m=4096$

This should be the same "stuff" in memory when ignoring structure? This gives people who want to improve stuff further in the future a sense of what should be possible.

I can also run it when I get around to it.

Biffo89 · 2025-07-16T12:58:55Z

I've added the final optimisation, bulking the reads as well as the writes. This gets the smaller fields performing a lot closer to the larger ones.

............................................................................................................
Total running time: 2.157 seconds.
Sample size: 108; mean: 16127904; standard deviation: 637916
99% confidence interval: +/- 160517 (1.0%): [15967387..16288421]
e: 1, m: 4096, n: 4096, algo: optimised, cpu cycles: 16127904, cc/bit: 0.96130, wall time: 0.007003
.................................................................................................................................................................................................................
Total running time: 8.630 seconds.
Sample size: 209; mean: 33174729; standard deviation: 1846588
99% confidence interval: +/- 330865 (1.0%): [32843864..33505593]
e: 1, m: 4096, n: 8192, algo: optimised, cpu cycles: 33174729, cc/bit: 0.98868, wall time: 0.014403
.......................................................................................................................................................
Total running time: 60.044 seconds.
Sample size: 151; mean: 79823588; standard deviation: 14346361
99% confidence interval: +/- 3040897 (3.8%): [76782691..82864485]
e: 2, m: 4096, n: 4096, algo: optimised, cpu cycles: 79823588, cc/bit: 2.37893, wall time: 0.034650
.......................................................................................
Total running time: 60.509 seconds.
Sample size: 87; mean: 788010942; standard deviation: 71298990
99% confidence interval: +/- 20065458 (2.5%): [767945484..808076400]
e: 2, m: 4096, n: 4096, algo: naive, cpu cycles: 788010942, cc/bit: 23.48456, wall time: 0.342022
.............................................................................................................................................
Total running time: 12.032 seconds.
Sample size: 141; mean: 63902321; standard deviation: 2890534
99% confidence interval: +/- 634926 (1.0%): [63267395..64537247]
e: 1, m: 4096, n: 16384, algo: optimised, cpu cycles: 63902321, cc/bit: 0.95222, wall time: 0.027740
................................................................................................................................................
Total running time: 60.402 seconds.
Sample size: 144; mean: 130753811; standard deviation: 11981954
99% confidence interval: +/- 2603224 (2.0%): [128150588..133357035]
e: 4, m: 4096, n: 4096, algo: optimised, cpu cycles: 130753811, cc/bit: 1.94838, wall time: 0.056754
..........................................................
Total running time: 60.005 seconds.
Sample size: 58; mean: 1548850990; standard deviation: 106076010
99% confidence interval: +/- 37084812 (2.4%): [1511766178..1585935802]
e: 4, m: 4096, n: 4096, algo: naive, cpu cycles: 1548850990, cc/bit: 23.07968, wall time: 0.672248
.............................................................................................................................................................................................................................................................................................................................................................
Total running time: 60.098 seconds.
Sample size: 349; mean: 129727419; standard deviation: 13486709
99% confidence interval: +/- 1861623 (1.4%): [127865796..131589042]
e: 1, m: 4096, n: 32768, algo: optimised, cpu cycles: 129727419, cc/bit: 0.96654, wall time: 0.056310
................................................................................
Total running time: 60.294 seconds.
Sample size: 80; mean: 359327627; standard deviation: 95849520
99% confidence interval: +/- 28204658 (7.8%): [331122969..387532285]
e: 8, m: 4096, n: 4096, algo: optimised, cpu cycles: 359327627, cc/bit: 2.67720, wall time: 0.155965
.................................
Total running time: 61.725 seconds.
Sample size: 33; mean: 3130671292; standard deviation: 235445777
99% confidence interval: +/- 112281556 (3.6%): [3018389736..3242952848]
e: 8, m: 4096, n: 4096, algo: naive, cpu cycles: 3130671292, cc/bit: 23.32532, wall time: 1.358803
...................................................................................................................................................
Total running time: 60.160 seconds.
Sample size: 147; mean: 284117327; standard deviation: 26567621
99% confidence interval: +/- 5710528 (2.0%): [278406799..289827855]
e: 1, m: 4096, n: 65536, algo: optimised, cpu cycles: 284117327, cc/bit: 1.05842, wall time: 0.123320
......................................................................................
Total running time: 60.411 seconds.
Sample size: 86; mean: 530672372; standard deviation: 44196951
99% confidence interval: +/- 12514531 (2.4%): [518157841..543186903]
e: 16, m: 4096, n: 4096, algo: optimised, cpu cycles: 530672372, cc/bit: 1.97691, wall time: 0.230332
........................................
Total running time: 60.640 seconds.
Sample size: 40; mean: 2510473879; standard deviation: 387311123
99% confidence interval: +/- 165828584 (6.6%): [2344645295..2676302462]
e: 16, m: 4096, n: 4096, algo: naive, cpu cycles: 2510473879, cc/bit: 9.35224, wall time: 1.089621

Biffo89 added 5 commits July 10, 2025 15:44

implement transpose for mzed_t

8536c5f

fix typo in error message

ea075c1

fix dimension bug

567ce57

fix bug buf overwritten

ebac1ce

optimise memory accesses by reading rows instead of columns

27e332b

add tests and benchmarks, some simplification of transpose

9b00c4c

Biffo89 added 2 commits July 16, 2025 11:49

remove extra loop in naive, fix typos

7522d06

remove extra transpose call, add comparison with GF(2)

ef0698b

bulk reads

0806dae

remove variable leftover from testing

9d012c0

Biffo89 mentioned this pull request Jul 16, 2025

Add temporary naive transpose for GF(2^e) sagemath/sage#40425

Open

5 tasks

malb merged commit 0b07e2b into malb:master Jul 22, 2025
2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add an mzed_transpose function #8

Add an mzed_transpose function #8

Uh oh!

Biffo89 commented Jul 10, 2025 •

edited

Loading

Uh oh!

malb commented Jul 10, 2025

Uh oh!

malb commented Jul 10, 2025

Uh oh!

malb commented Jul 16, 2025

Uh oh!

Biffo89 commented Jul 16, 2025

Uh oh!

malb commented Jul 16, 2025

Uh oh!

Biffo89 commented Jul 16, 2025

Uh oh!

Uh oh!

Uh oh!

Add an mzed_transpose function #8

Add an mzed_transpose function #8

Uh oh!

Conversation

Biffo89 commented Jul 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

malb commented Jul 10, 2025

Uh oh!

malb commented Jul 10, 2025

Uh oh!

malb commented Jul 16, 2025

Uh oh!

Biffo89 commented Jul 16, 2025

Uh oh!

malb commented Jul 16, 2025

Uh oh!

Biffo89 commented Jul 16, 2025

Uh oh!

Uh oh!

Uh oh!

Biffo89 commented Jul 10, 2025 •

edited

Loading