-
Notifications
You must be signed in to change notification settings - Fork 37.7k
Add Muhash3072 implementation in Python #19105
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
def modinv(a, n): | ||
"""Compute the modular inverse of a modulo n.""" | ||
t1, t2 = 0, 1 | ||
r1, r2 = n, a | ||
while r2 != 0: | ||
q = r1 // r2 | ||
t1, t2 = t2, t1 - q * t2 | ||
r1, r2 = r2, r1 - q * r2 | ||
if r1 > 1: | ||
return None | ||
if t1 < 0: | ||
t1 += n | ||
return t1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To simplify the code, one could just use Fermat's little theorem here to calculate the modular inverse. The drawback is that it's much slower than the extended Euclidean algorithm, calculating the modinv of a random 3072-bit number takes approx. 100-150ms on my machine, which is at least 1 order of magnitude slower. Not sure if that's an issue and in the case of tests whether performance or readability is more important :-)
def modinv(a, n): | |
"""Compute the modular inverse of a modulo n.""" | |
t1, t2 = 0, 1 | |
r1, r2 = n, a | |
while r2 != 0: | |
q = r1 // r2 | |
t1, t2 = t2, t1 - q * t2 | |
r1, r2 = r2, r1 - q * r2 | |
if r1 > 1: | |
return None | |
if t1 < 0: | |
t1 += n | |
return t1 | |
def modinv(a, n): | |
"""Compute the modular inverse of a prime modulo n using Fermat's little theorem.""" | |
return pow(a, n-2, n) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, I see your point, however, given how often these tests run on people's machines and in the CI environment, performance does matter quite a bit. But I think it's a great question to discuss during the PR review club next week :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, if comprehension is a concern, I suspect that people who are familiar with modular inverses will generally understand both the euclidean and the fermat approach; and to people who aren't familiar with it both will look like black magic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@fjahr @sipa: Fair enough! By the way, Python 3.8 introduced support for negative exponents in the modulo (three argument) pow()
(see https://bugs.python.org/issue36027, python/cpython#13266), internally calculating the modular inverse via the extended Euclidean algorithm. I.e. somewhere in the future it could be just a very simple:
return pow(a, -1, n)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's worth annotating in a TODO: // Python 3.8: return pow(a, -1, n)
(tested locally with Python 3.8.2)
I like having the simpler approach here. It's yet another sanity check that our implementation is correct, given the lack of test vectors. But given the performance impact, if it really matters compared to the rest of the test, better leave that as a TODO.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be nice to have a unittest to confirm that modinv(a, n)
produces the same results as using pow(a, n-2, n)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@narula I added a unit test for this.
Added a super simple test that reimplements the C++ impl unit test in Python. I am working on more exhaustive tests for the next PR in this series. |
The following sections might be updated with supplementary metadata relevant to reviewers and maintainers. ConflictsReviewers, this pull request conflicts with the following ones:
If you consider this pull request important, please also help to review the conflicting pull requests. Ideally, start with the one that should be merged first. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it would be clearer to use the official interfaces defined in collections.abc
instead of inventing new names.
bytes384 = chacha20_32_to_384(data) | ||
return int.from_bytes(bytes384, 'little') | ||
|
||
class MuHash3072: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this should implement MutableSet
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this suggestion, but I am not convinced that this is a good idea. I think it would be confusing because this is not really a Set but a hash representing a Set. So there is no way for example to know about the inclusion of a specific value or the number of values already added. There is also no enforcement here that this is actually a Set. You could add the same value twice and it would work. We are just not sure if this breaks security guarantees and the way MuHash is intended to be used in Bitcoin Core, for now, this should not happen because the UTXO set is a Set. So doing this would send the wrong message I think. And I would also like to keep this class as plain as possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could add the same value twice and it would work.
Sets commonly ignore the second insertion. The set in Python also works like this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, but that is the whole point: Muhash cannot ignore the second insertion. Adding an element the second time changes the internal state because there is no way to tell if an element is already in the Muhash or not.
self.numerator = 1 | ||
self.denominator = 1 | ||
|
||
def insert(self, data): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the method is called add
in collections.abc.
"""Insert a byte array data in the set.""" | ||
self.numerator = (self.numerator * data_to_num3072(data)) % self.MODULUS | ||
|
||
def remove(self, data): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the method is called discard
in collections.abc
4371b6d
to
ebd98da
Compare
Rebased and addressed all review comments. This is now also using SHA256 as discussed in #19055 . |
The chacha code also deserves its own file (and commit). |
@Sjors I'd agree if it was generically useful ChaCha20 code, but given that's it's a minimal specialized implementation just for 3072-bit outputs with IV 0, I'm less convinced. |
We'll need a python chacha implementation when BIP324 lands, and the functionality in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Took @Sjors suggestions: moved the TODO comment to the util function and added chacha20 test vectors with nonce 0. |
re-ACK 352c702 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good. Just a few minor comments inline.
Addressed @jnewbery 's review comments. |
utACK 36ec980 |
Code review ACK 36ec980 |
36ec980 test: Add chacha20 test vectors in muhash (Fabian Jahr) 0e2b400 test: Add basic Python/C++ Muhash implementation parity unit test (Fabian Jahr) b85543c test: Add Python MuHash3072 implementation to test framework (Pieter Wuille) ab30cec test: Move modinv to util and add unit test (Fabian Jahr) Pull request description: This is the second in a [series of pull requests](bitcoin#18000) to implement an Index for UTXO set statistics. This pull request adds a Python implementation of Muhash3072, a homomorphic hashing algorithm to be used for hashing the UTXO set. The Python implementation can then be used to compare behavior with the C++ version. ACKs for top commit: jnewbery: utACK 36ec980 laanwj: Code review ACK 36ec980 Tree-SHA512: a3519c6e11031174f1ae71ecd8bcc7f3be42d7fc9c84c77f2fbea7cfc5ad54fcbe10b55116ad8d9a52ac5d675640eefed3bf260c58a02f2bf3bc0d8ec208baa6
bits %= 32 # Make sure the term below does not throw an exception | ||
return ((v << bits) & 0xffffffff) | (v >> (32 - bits)) | ||
|
||
def chacha20_doubleround(s): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adding a little explainer video here for anyone else curious about ChaCha20: https://youtu.be/UeIpq-C-GSA
Summary: PR description: > This is the second in a series of pull requests to implement an Index for UTXO set statistics. > > This pull request adds a Python implementation of Muhash3072, a homomorphic hashing algorithm to be used for hashing the UTXO set. The Python implementation can then be used to compare behavior with the C++ version. > test: Move modinv to util and add unit test bitcoin/bitcoin@ab30cec > test: Add Python MuHash3072 implementation to test framework bitcoin/bitcoin@b85543c > test: Add basic Python/C++ Muhash implementation parity unit test bitcoin/bitcoin@0e2b400 > test: Add chacha20 test vectors in muhash bitcoin/bitcoin@36ec980 This is a backport of [[bitcoin/bitcoin#19105 | core#19105]] Backport note: the C++ implementation will be merged much later ([[bitcoin/bitcoin#19055 | core#19055]]) Test Plan: `ninja check-functional` Reviewers: #bitcoin_abc, Fabien Reviewed By: #bitcoin_abc, Fabien Differential Revision: https://reviews.bitcoinabc.org/D10166
This is the second in a series of pull requests to implement an Index for UTXO set statistics.
This pull request adds a Python implementation of Muhash3072, a homomorphic hashing algorithm to be used for hashing the UTXO set. The Python implementation can then be used to compare behavior with the C++ version.