Skip to content

Conversation

peterdettman
Copy link
Contributor

  • In secp256k1_gej_split_exp, there are two divisions used. Since the denominator is a constant known at compile-time, each can be replaced by a multiplication followed by a right-shift (and rounding).
  • This change adds the constants g1, g2 for this purpose and rewrites secp256k1_gej_split_exp accordingly.

The technique is discussed, amongst other places, in the paper "Efficient Software Implementation of Public-Key Cryptography on Sensor Networks Using the MSP430X Microcontroller" (Gouvea, Oliveira, Lopez) - in section 4.3.

@sipa
Copy link
Contributor

sipa commented Jun 2, 2014

Nice, this removes the need for secp256k1_num_div entirely.

@sipa
Copy link
Contributor

sipa commented Jun 15, 2014

Can you:

  • Remove the secp256k1_num_div function and its implementations.
  • Add some commentary about how you came up with the numbers?

@peterdettman
Copy link
Contributor Author

@sipa Apologies for the delay, I'm a bit pushed for time at the moment. I'll return to these pull requests hopefully later this week.

@peterdettman
Copy link
Contributor Author

I've added some comments and removed _num_div, but not ready for merge yet as it's not giving the expected performance improvement. Bear with me while I investigate!

@peterdettman
Copy link
Contributor Author

The split itself is a lot faster, but there's some kind of one-off-error vs the original "bnc1/bnc2" values on a small proportion of inputs, so looking into that now.

@peterdettman
Copy link
Contributor Author

This one's got me a bit confused. Not the "one-off-error", which was actually just the rough edges of what is inherently an approximation (I've increased the precision so now the split is virtually always exactly the same as the old code). Rather, I can't see that performance (of bench) is actually improved.

Naively an improvement is expected since two divisions are replaced by simple truncations (right-shift). If the split code is wrapped in a (pointless) loop of say 10 iterations to try and give it a heavier "weight", then the difference is pretty clear: the new split is something like 1.8x as fast, and it shows in the bench results. However, for the unmodified bench, I think I am measuring the new code as very slightly slower (it's hard to pick the signal out from the noise here though).

So, I'm thinking it has something to do with needing to load g1/g2 (vs 'order') at each call i.e. some sort of memory/cacheing problem. This seems plausible as the split is only needed once per scalar mult, and so the constants could quite well be getting evicted between split calls. (Note that 'order', which is not used in the updated split, is used in the ECDSA code, so perhaps is being cached for the existing split code).

On that theory I tried a few changes like copying things into local variables, using gcc prefetch intrinsics, changing the layout in memory to correspond to the order of usage (g1,g2,a1b2,a2,b1), etc., but without much luck and usually just making things worse!

Anyway, I'd appreciate it if others could perhaps report their timings with/without this patch (of course it's only relevant when the endomorphism is used), and if you have any thoughts on what's going on here performance-wise, I'm all ears.

mpn_copyi(r->data, a->data + limbs, r->limbs);
} else {
mpn_rshift(r->data, a->data + limbs, r->limbs, left);
while (r->limbs>1 && r->data[r->limbs-1]==0) r->limbs--;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Who not >0?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm guessing you're referring to the "r->limbs>1" test? _num_trunc is just _num_split modified to ignore the low bits (perhaps I could have just extended _num_shift to support large shifts). I gather that a zero valued num still has limbs==1, but in any case, I've just copied and modified... Please clarify if I've misunderstood the question.

@sipa
Copy link
Contributor

sipa commented Sep 1, 2014

Indeed, I notice a slight slowdown with this (around 0.4%). I wonder whether the advantage of not needing division in the bignum API isn't already worth it...

@theuni theuni mentioned this pull request Sep 11, 2014
@sipa
Copy link
Contributor

sipa commented Oct 26, 2014

@gmaxwell review requested

@peterdettman
Copy link
Contributor Author

@sipa I'm still very curious about the performance gap, but I think it's reasonable to merge this since the native num stuff appears to be progressing rapidly. I still expect there will be a net performance gain in batch operations, and we may yet find a way around the single-op deficit.

@peterdettman peterdettman force-pushed the split-without-div branch 5 times, most recently from 59691ab to 47131be Compare November 4, 2014 12:11
@peterdettman
Copy link
Contributor Author

Re-measuring with -O3 -march=native, it looks like this PR actually does improve performance (I measure very roughly 0.5% faster on bench_verify).

@peterdettman peterdettman force-pushed the split-without-div branch 2 times, most recently from 5278c94 to 7db5892 Compare November 18, 2014 03:45
- In secp256k1_gej_split_exp, there are two divisions used. Since the denominator is a constant known at compile-time, each can be replaced by a multiplication followed by a right-shift (and rounding).
- Add the constants g1, g2 for this purpose and rewrite secp256k1_gej_split_exp accordingly.
- Remove secp256k1_num_div since no longer used
@gmaxwell gmaxwell closed this Dec 7, 2014
tecnovert pushed a commit to tecnovert/secp256k1 that referenced this pull request Apr 18, 2022
Reject surjection proofs with trailing garbage
matteonardelli referenced this pull request in bancaditalia/secp256k1-frost Nov 22, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants