goasm: Implement amd64 optimised code #28

pjbgf · 2023-01-23T22:06:29Z

The AMD64 partial optimisation does the 80 rounds of SHA1 in goasm, and also stores the compression state which is later assessed for collision in the Go code.
This provides a significant speedup over the pure Go implementation, while keeping the goasm implementation reasonably simple and well aligned with the standard Go SHA1 implementation.

Go benchmark results:

goos: linux
goarch: amd64
pkg: github.com/pjbgf/sha1cd/test
cpu: Intel(R) Core(TM) i9-10885H CPU @ 2.40GHz
                                │       before        │                 after                      │
                                │       sec/op        │       sec/op         vs base               │
Hash8Bytes/sha1cd-16                     397.9n ± ∞ ¹          264.1n ± ∞ ¹  -33.63% (p=0.008 n=5)
Hash320Bytes/sha1cd-16                   2.130µ ± ∞ ¹          1.354µ ± ∞ ¹  -36.43% (p=0.008 n=5)
Hash1K/sha1cd-16                         5.832µ ± ∞ ¹          3.645µ ± ∞ ¹  -37.50% (p=0.008 n=5)
Hash8K/sha1cd-16                         43.56µ ± ∞ ¹          27.28µ ± ∞ ¹  -37.39% (p=0.008 n=5)
HashWithCollision/sha1cd-16              6.335µ ± ∞ ¹          4.588µ ± ∞ ¹  -27.58% (p=0.008 n=5)
¹ need >= 6 samples for confidence interval at level 0.95

                                │     before      │              after                    │
                                │       B/s       │      B/s        vs base               │
Hash8Bytes/sha1cd-16                19.18Mi ± ∞ ¹    28.90Mi ± ∞ ¹  +50.67% (p=0.008 n=5)
Hash320Bytes/sha1cd-16              143.3Mi ± ∞ ¹    225.4Mi ± ∞ ¹  +57.36% (p=0.008 n=5)
Hash1K/sha1cd-16                    167.5Mi ± ∞ ¹    267.9Mi ± ∞ ¹  +60.00% (p=0.008 n=5)
Hash8K/sha1cd-16                    179.3Mi ± ∞ ¹    286.4Mi ± ∞ ¹  +59.71% (p=0.008 n=5)
HashWithCollision/sha1cd-16         96.34Mi ± ∞ ¹   133.03Mi ± ∞ ¹  +38.08% (p=0.008 n=5)
¹ need >= 6 samples for confidence interval at level 0.95

Tests and benchmarks are now segregated by implementation. For benchmarks the comparison between Go's sha1, and the 3 different sha1cd implementations can be more easily observed:

BenchmarkHash8Bytes/sha1-16                      8186018               145.2 ns/op        55.11 MB/s           0 B/op          0 allocs/op
BenchmarkHash8Bytes/sha1cd_native-16             3309099               385.5 ns/op        20.75 MB/s         192 B/op          2 allocs/op
BenchmarkHash8Bytes/sha1cd_generic-16            2430501               495.0 ns/op        16.16 MB/s         192 B/op          2 allocs/op
BenchmarkHash8Bytes/sha1cd_cgo-16                1000000              1768 ns/op           4.53 MB/s        2688 B/op          1 allocs/op
BenchmarkHash320Bytes/sha1-16                    1932904               602.5 ns/op       531.13 MB/s           0 B/op          0 allocs/op
BenchmarkHash320Bytes/sha1cd_native-16            826519              1400 ns/op         228.58 MB/s         192 B/op          2 allocs/op
BenchmarkHash320Bytes/sha1cd_generic-16           600067              1828 ns/op         175.01 MB/s         192 B/op          2 allocs/op
BenchmarkHash320Bytes/sha1cd_cgo-16               623458              2664 ns/op         120.11 MB/s        2688 B/op          1 allocs/op
BenchmarkHash1K/sha1-16                          1186273              1010 ns/op        1013.73 MB/s           0 B/op          0 allocs/op
BenchmarkHash1K/sha1cd_native-16                  396523              2985 ns/op         343.09 MB/s         192 B/op          2 allocs/op
BenchmarkHash1K/sha1cd_generic-16                 242604              5307 ns/op         192.94 MB/s         192 B/op          2 allocs/op
BenchmarkHash1K/sha1cd_cgo-16                     309276              4273 ns/op         239.62 MB/s        2688 B/op          1 allocs/op
BenchmarkHash8K/sha1-16                           145369              8098 ns/op        1011.61 MB/s           0 B/op          0 allocs/op
BenchmarkHash8K/sha1cd_native-16                   45085             26536 ns/op         308.71 MB/s         192 B/op          2 allocs/op
BenchmarkHash8K/sha1cd_generic-16                  26968             44383 ns/op         184.57 MB/s         192 B/op          2 allocs/op
BenchmarkHash8K/sha1cd_cgo-16                      47247             23654 ns/op         346.33 MB/s        2688 B/op          1 allocs/op
BenchmarkHashWithCollision/sha1cd_native-16       249066              4769 ns/op         134.20 MB/s         192 B/op          2 allocs/op
BenchmarkHashWithCollision/sha1cd_generic-16      185652              6463 ns/op          99.02 MB/s         192 B/op          2 allocs/op
BenchmarkHashWithCollision/sha1cd_cgo-16          256411              4340 ns/op         147.46 MB/s        2688 B/op          1 allocs/op

This PR only introduces a native implementation for AMD64.

Relates to: #15

The AMD64 partial optimisation does the 80 rounds of SHA1 in goasm, and also stores the compression state which is later assessed for collision in the Go code. This provides a significant speedup over the pure Go implementation, while keeping the goasm implementation reasonably simple and well aligned with the standard Go SHA1 implementation. Go benchmark results: goos: linux goarch: amd64 pkg: github.com/pjbgf/sha1cd/test cpu: Intel(R) Core(TM) i9-10885H CPU @ 2.40GHz │ before │ after │ │ sec/op │ sec/op vs base │ Hash8Bytes/sha1cd-16 397.9n ± ∞ ¹ 264.1n ± ∞ ¹ -33.63% (p=0.008 n=5) Hash320Bytes/sha1cd-16 2.130µ ± ∞ ¹ 1.354µ ± ∞ ¹ -36.43% (p=0.008 n=5) Hash1K/sha1cd-16 5.832µ ± ∞ ¹ 3.645µ ± ∞ ¹ -37.50% (p=0.008 n=5) Hash8K/sha1cd-16 43.56µ ± ∞ ¹ 27.28µ ± ∞ ¹ -37.39% (p=0.008 n=5) HashWithCollision/sha1cd-16 6.335µ ± ∞ ¹ 4.588µ ± ∞ ¹ -27.58% (p=0.008 n=5) ¹ need >= 6 samples for confidence interval at level 0.95 │ before │ after │ │ B/op │ B/op vs base │ Hash8Bytes/sha1cd-16 0.0 ± ∞ ¹ 112.0 ± ∞ ¹ ? (p=0.008 n=5) Hash320Bytes/sha1cd-16 0.0 ± ∞ ¹ 112.0 ± ∞ ¹ ? (p=0.008 n=5) Hash1K/sha1cd-16 0.0 ± ∞ ¹ 112.0 ± ∞ ¹ ? (p=0.008 n=5) Hash8K/sha1cd-16 0.0 ± ∞ ¹ 112.0 ± ∞ ¹ ? (p=0.008 n=5) HashWithCollision/sha1cd-16 0.0 ± ∞ ¹ 112.0 ± ∞ ¹ ? (p=0.008 n=5) ¹ need >= 6 samples for confidence interval at level 0.95 │ before │ after │ │ allocs/op │ allocs/op vs base │ Hash8Bytes/sha1cd-16 0.000 ± ∞ ¹ 1.000 ± ∞ ¹ ? (p=0.008 n=5) Hash320Bytes/sha1cd-16 0.000 ± ∞ ¹ 1.000 ± ∞ ¹ ? (p=0.008 n=5) Hash1K/sha1cd-16 0.000 ± ∞ ¹ 1.000 ± ∞ ¹ ? (p=0.008 n=5) Hash8K/sha1cd-16 0.000 ± ∞ ¹ 1.000 ± ∞ ¹ ? (p=0.008 n=5) HashWithCollision/sha1cd-16 0.000 ± ∞ ¹ 1.000 ± ∞ ¹ ? (p=0.008 n=5) ¹ need >= 6 samples for confidence interval at level 0.95 │ before │ after │ │ B/s │ B/s vs base │ Hash8Bytes/sha1cd-16 19.18Mi ± ∞ ¹ 28.90Mi ± ∞ ¹ +50.67% (p=0.008 n=5) Hash320Bytes/sha1cd-16 143.3Mi ± ∞ ¹ 225.4Mi ± ∞ ¹ +57.36% (p=0.008 n=5) Hash1K/sha1cd-16 167.5Mi ± ∞ ¹ 267.9Mi ± ∞ ¹ +60.00% (p=0.008 n=5) Hash8K/sha1cd-16 179.3Mi ± ∞ ¹ 286.4Mi ± ∞ ¹ +59.71% (p=0.008 n=5) HashWithCollision/sha1cd-16 96.34Mi ± ∞ ¹ 133.03Mi ± ∞ ¹ +38.08% (p=0.008 n=5) geomean 142.8Mi 160.4Mi +12.33% ¹ need >= 6 samples for confidence interval at level 0.95 Signed-off-by: Paulo Gomes <pjbgf@linux.com>

The tests and benchmarks are now segregate by implementation, so it is easier to compare results and ensure regression for cgo, generic and native implementations. Signed-off-by: Paulo Gomes <pjbgf@linux.com>

pjbgf force-pushed the optimisations branch from a3505c9 to 54ba024 Compare January 23, 2023 22:49

pjbgf force-pushed the optimisations branch 2 times, most recently from eae3603 to cc4e80e Compare February 25, 2023 10:53

pjbgf added 2 commits February 25, 2023 11:59

tests: Segregate tests and benchmarks

95a7df9

The tests and benchmarks are now segregate by implementation, so it is easier to compare results and ensure regression for cgo, generic and native implementations. Signed-off-by: Paulo Gomes <pjbgf@linux.com>

pjbgf force-pushed the optimisations branch from cc4e80e to 95a7df9 Compare February 25, 2023 12:03

pjbgf merged commit a2b84bc into main Feb 25, 2023

pjbgf deleted the optimisations branch February 25, 2023 12:16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

goasm: Implement amd64 optimised code #28

goasm: Implement amd64 optimised code #28

Uh oh!

pjbgf commented Jan 23, 2023 •

edited

Loading

Uh oh!

Uh oh!

goasm: Implement amd64 optimised code #28

goasm: Implement amd64 optimised code #28

Uh oh!

Conversation

pjbgf commented Jan 23, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

pjbgf commented Jan 23, 2023 •

edited

Loading