Skip to content

Conversation

Biffo89
Copy link
Contributor

@Biffo89 Biffo89 commented Jul 10, 2025

A fairly-optimised transpose function for matrices, as this is missing from Sage.

  • Transfers are made in order of the data in the source matrix to reduce cache misses
  • Buffers are used to reduce number of calls to __mzd_xor_bits, writes are done word by word
  • Bulk reads are done on each word in memory to reduce calls to __mzd_read_bits, then masks and shifts are applied after

Tests and benchmarks added, now memory safe with a speedup of ~4.5-12x over naive transpose.

@malb
Copy link
Owner

malb commented Jul 10, 2025

Cool, can you add tests?

@malb
Copy link
Owner

malb commented Jul 10, 2025

Since you're aiming for performance, maybe check how this compares with a transpose of a GF(2) matrix that has the same storage cost? Carlo Wood spent a lot of time making that one fast, if my memory serves.

@malb
Copy link
Owner

malb commented Jul 16, 2025

Thanks for the tests, they look good. What are the benchmark results for you?

@Biffo89
Copy link
Contributor Author

Biffo89 commented Jul 16, 2025

I realised I had a few mistakes in the benchmark file. The results for me are about a 4x speedup for all sizes over the naive version. The performance is admittedly slower than for GF(2) but still a good improvement. I'll try a further optimisation on the number of __mzd_read_bits calls, but I can't spend but more time on this at the moment as it's just instrumental to an addition to Sage which makes use of the transpose, and GF(2^e) was particularly slow.

......
Total running time: 0.552 seconds.
Sample size: 6; mean: 81136039; standard deviation: 387795
99% confidence interval: +/- 638333 (0.8%): [80497706..81774373]
e: 1, m: 8192, n: 8192, algo: optimised, cpu cycles: 81136039, cc/bit: 1.20902, wall time: 0.035220
................................................................
Total running time: 60.624 seconds.
Sample size: 64; mean: 513710525; standard deviation: 55561305
99% confidence interval: +/- 18437183 (3.6%): [495273342..532147708]
e: 2, m: 5793, n: 5793, algo: optimised, cpu cycles: 513710525, cc/bit: 7.65388, wall time: 0.222969
....................................
Total running time: 60.650 seconds.
Sample size: 36; mean: 2208722449; standard deviation: 119895354
99% confidence interval: +/- 54452471 (2.5%): [2154269978..2263174920]
e: 2, m: 5793, n: 5793, algo: naive, cpu cycles: 2208722449, cc/bit: 32.90820, wall time: 0.958652
..........................................................................................................................
Total running time: 60.316 seconds.
Sample size: 122; mean: 300793563; standard deviation: 37965874
99% confidence interval: +/- 8993658 (3.0%): [291799905..309787222]
e: 4, m: 4096, n: 4096, algo: optimised, cpu cycles: 300793563, cc/bit: 4.48217, wall time: 0.130557
.........................................................
Total running time: 60.511 seconds.
Sample size: 57; mean: 1593826232; standard deviation: 161739373
99% confidence interval: +/- 57063761 (3.6%): [1536762471..1650889992]
e: 4, m: 4096, n: 4096, algo: naive, cpu cycles: 1593826232, cc/bit: 23.74986, wall time: 0.691769
.......................................................................................................................................................................................................
Total running time: 60.073 seconds.
Sample size: 199; mean: 209458250; standard deviation: 75109151
99% confidence interval: +/- 13801440 (6.6%): [195656810..223259690]
e: 8, m: 2896, n: 2896, algo: optimised, cpu cycles: 209458250, cc/bit: 3.12184, wall time: 0.090915
..................................................................................................
Total running time: 60.574 seconds.
Sample size: 98; mean: 867007988; standard deviation: 86690940
99% confidence interval: +/- 22923520 (2.6%): [844084467..889931508]
e: 8, m: 2896, n: 2896, algo: naive, cpu cycles: 867007988, cc/bit: 12.92219, wall time: 0.376313
............................................................................................................................................................................................................................................................................................................
Total running time: 60.085 seconds.
Sample size: 300; mean: 168759155; standard deviation: 25998175
99% confidence interval: +/- 3873832 (2.3%): [164885323..172632987]
e: 16, m: 2048, n: 2048, algo: optimised, cpu cycles: 168759155, cc/bit: 2.51471, wall time: 0.073252
.....................................................................................................................................................
Total running time: 60.094 seconds.
Sample size: 149; mean: 638015204; standard deviation: 100339408
99% confidence interval: +/- 21416198 (3.4%): [616599006..659431402]
e: 16, m: 2048, n: 2048, algo: naive, cpu cycles: 638015204, cc/bit: 9.50717, wall time: 0.276921

@malb
Copy link
Owner

malb commented Jul 16, 2025

Cool, that's a tidy speed-up. I agree "perfect is the enemy of good enough" and I'm happy to merge this. For documentation, could you also compare:

for $e' \in [2,3,4]$ do

  • $e=1,n=4096,m=n \cdot e'$
  • $e=e',n=4096,m=4096$

This should be the same "stuff" in memory when ignoring structure? This gives people who want to improve stuff further in the future a sense of what should be possible.

I can also run it when I get around to it.

@Biffo89
Copy link
Contributor Author

Biffo89 commented Jul 16, 2025

I've added the final optimisation, bulking the reads as well as the writes. This gets the smaller fields performing a lot closer to the larger ones.

............................................................................................................
Total running time: 2.157 seconds.
Sample size: 108; mean: 16127904; standard deviation: 637916
99% confidence interval: +/- 160517 (1.0%): [15967387..16288421]
e: 1, m: 4096, n: 4096, algo: optimised, cpu cycles: 16127904, cc/bit: 0.96130, wall time: 0.007003
.................................................................................................................................................................................................................
Total running time: 8.630 seconds.
Sample size: 209; mean: 33174729; standard deviation: 1846588
99% confidence interval: +/- 330865 (1.0%): [32843864..33505593]
e: 1, m: 4096, n: 8192, algo: optimised, cpu cycles: 33174729, cc/bit: 0.98868, wall time: 0.014403
.......................................................................................................................................................
Total running time: 60.044 seconds.
Sample size: 151; mean: 79823588; standard deviation: 14346361
99% confidence interval: +/- 3040897 (3.8%): [76782691..82864485]
e: 2, m: 4096, n: 4096, algo: optimised, cpu cycles: 79823588, cc/bit: 2.37893, wall time: 0.034650
.......................................................................................
Total running time: 60.509 seconds.
Sample size: 87; mean: 788010942; standard deviation: 71298990
99% confidence interval: +/- 20065458 (2.5%): [767945484..808076400]
e: 2, m: 4096, n: 4096, algo: naive, cpu cycles: 788010942, cc/bit: 23.48456, wall time: 0.342022
.............................................................................................................................................
Total running time: 12.032 seconds.
Sample size: 141; mean: 63902321; standard deviation: 2890534
99% confidence interval: +/- 634926 (1.0%): [63267395..64537247]
e: 1, m: 4096, n: 16384, algo: optimised, cpu cycles: 63902321, cc/bit: 0.95222, wall time: 0.027740
................................................................................................................................................
Total running time: 60.402 seconds.
Sample size: 144; mean: 130753811; standard deviation: 11981954
99% confidence interval: +/- 2603224 (2.0%): [128150588..133357035]
e: 4, m: 4096, n: 4096, algo: optimised, cpu cycles: 130753811, cc/bit: 1.94838, wall time: 0.056754
..........................................................
Total running time: 60.005 seconds.
Sample size: 58; mean: 1548850990; standard deviation: 106076010
99% confidence interval: +/- 37084812 (2.4%): [1511766178..1585935802]
e: 4, m: 4096, n: 4096, algo: naive, cpu cycles: 1548850990, cc/bit: 23.07968, wall time: 0.672248
.............................................................................................................................................................................................................................................................................................................................................................
Total running time: 60.098 seconds.
Sample size: 349; mean: 129727419; standard deviation: 13486709
99% confidence interval: +/- 1861623 (1.4%): [127865796..131589042]
e: 1, m: 4096, n: 32768, algo: optimised, cpu cycles: 129727419, cc/bit: 0.96654, wall time: 0.056310
................................................................................
Total running time: 60.294 seconds.
Sample size: 80; mean: 359327627; standard deviation: 95849520
99% confidence interval: +/- 28204658 (7.8%): [331122969..387532285]
e: 8, m: 4096, n: 4096, algo: optimised, cpu cycles: 359327627, cc/bit: 2.67720, wall time: 0.155965
.................................
Total running time: 61.725 seconds.
Sample size: 33; mean: 3130671292; standard deviation: 235445777
99% confidence interval: +/- 112281556 (3.6%): [3018389736..3242952848]
e: 8, m: 4096, n: 4096, algo: naive, cpu cycles: 3130671292, cc/bit: 23.32532, wall time: 1.358803
...................................................................................................................................................
Total running time: 60.160 seconds.
Sample size: 147; mean: 284117327; standard deviation: 26567621
99% confidence interval: +/- 5710528 (2.0%): [278406799..289827855]
e: 1, m: 4096, n: 65536, algo: optimised, cpu cycles: 284117327, cc/bit: 1.05842, wall time: 0.123320
......................................................................................
Total running time: 60.411 seconds.
Sample size: 86; mean: 530672372; standard deviation: 44196951
99% confidence interval: +/- 12514531 (2.4%): [518157841..543186903]
e: 16, m: 4096, n: 4096, algo: optimised, cpu cycles: 530672372, cc/bit: 1.97691, wall time: 0.230332
........................................
Total running time: 60.640 seconds.
Sample size: 40; mean: 2510473879; standard deviation: 387311123
99% confidence interval: +/- 165828584 (6.6%): [2344645295..2676302462]
e: 16, m: 4096, n: 4096, algo: naive, cpu cycles: 2510473879, cc/bit: 9.35224, wall time: 1.089621

@malb malb merged commit 0b07e2b into malb:master Jul 22, 2025
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants