-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Avoid division when decomposing scalars #21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Nice, this removes the need for secp256k1_num_div entirely. |
Can you:
|
@sipa Apologies for the delay, I'm a bit pushed for time at the moment. I'll return to these pull requests hopefully later this week. |
I've added some comments and removed _num_div, but not ready for merge yet as it's not giving the expected performance improvement. Bear with me while I investigate! |
The split itself is a lot faster, but there's some kind of one-off-error vs the original "bnc1/bnc2" values on a small proportion of inputs, so looking into that now. |
This one's got me a bit confused. Not the "one-off-error", which was actually just the rough edges of what is inherently an approximation (I've increased the precision so now the split is virtually always exactly the same as the old code). Rather, I can't see that performance (of bench) is actually improved. Naively an improvement is expected since two divisions are replaced by simple truncations (right-shift). If the split code is wrapped in a (pointless) loop of say 10 iterations to try and give it a heavier "weight", then the difference is pretty clear: the new split is something like 1.8x as fast, and it shows in the bench results. However, for the unmodified bench, I think I am measuring the new code as very slightly slower (it's hard to pick the signal out from the noise here though). So, I'm thinking it has something to do with needing to load g1/g2 (vs 'order') at each call i.e. some sort of memory/cacheing problem. This seems plausible as the split is only needed once per scalar mult, and so the constants could quite well be getting evicted between split calls. (Note that 'order', which is not used in the updated split, is used in the ECDSA code, so perhaps is being cached for the existing split code). On that theory I tried a few changes like copying things into local variables, using gcc prefetch intrinsics, changing the layout in memory to correspond to the order of usage (g1,g2,a1b2,a2,b1), etc., but without much luck and usually just making things worse! Anyway, I'd appreciate it if others could perhaps report their timings with/without this patch (of course it's only relevant when the endomorphism is used), and if you have any thoughts on what's going on here performance-wise, I'm all ears. |
mpn_copyi(r->data, a->data + limbs, r->limbs); | ||
} else { | ||
mpn_rshift(r->data, a->data + limbs, r->limbs, left); | ||
while (r->limbs>1 && r->data[r->limbs-1]==0) r->limbs--; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Who not >0?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm guessing you're referring to the "r->limbs>1" test? _num_trunc is just _num_split modified to ignore the low bits (perhaps I could have just extended _num_shift to support large shifts). I gather that a zero valued num still has limbs==1, but in any case, I've just copied and modified... Please clarify if I've misunderstood the question.
fe0e4c4
to
84bcb08
Compare
Indeed, I notice a slight slowdown with this (around 0.4%). I wonder whether the advantage of not needing division in the bignum API isn't already worth it... |
@gmaxwell review requested |
@sipa I'm still very curious about the performance gap, but I think it's reasonable to merge this since the native num stuff appears to be progressing rapidly. I still expect there will be a net performance gain in batch operations, and we may yet find a way around the single-op deficit. |
59691ab
to
47131be
Compare
47131be
to
a403074
Compare
Re-measuring with -O3 -march=native, it looks like this PR actually does improve performance (I measure very roughly 0.5% faster on bench_verify). |
5278c94
to
7db5892
Compare
- In secp256k1_gej_split_exp, there are two divisions used. Since the denominator is a constant known at compile-time, each can be replaced by a multiplication followed by a right-shift (and rounding). - Add the constants g1, g2 for this purpose and rewrite secp256k1_gej_split_exp accordingly. - Remove secp256k1_num_div since no longer used
7db5892
to
960f9a2
Compare
Reject surjection proofs with trailing garbage
The technique is discussed, amongst other places, in the paper "Efficient Software Implementation of Public-Key Cryptography on Sensor Networks Using the MSP430X Microcontroller" (Gouvea, Oliveira, Lopez) - in section 4.3.