Skip to content

Conversation

fjahr
Copy link
Contributor

@fjahr fjahr commented May 29, 2020

This is the second in a series of pull requests to implement an Index for UTXO set statistics.

This pull request adds a Python implementation of Muhash3072, a homomorphic hashing algorithm to be used for hashing the UTXO set. The Python implementation can then be used to compare behavior with the C++ version.

Comment on lines 8 to 20
def modinv(a, n):
"""Compute the modular inverse of a modulo n."""
t1, t2 = 0, 1
r1, r2 = n, a
while r2 != 0:
q = r1 // r2
t1, t2 = t2, t1 - q * t2
r1, r2 = r2, r1 - q * r2
if r1 > 1:
return None
if t1 < 0:
t1 += n
return t1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To simplify the code, one could just use Fermat's little theorem here to calculate the modular inverse. The drawback is that it's much slower than the extended Euclidean algorithm, calculating the modinv of a random 3072-bit number takes approx. 100-150ms on my machine, which is at least 1 order of magnitude slower. Not sure if that's an issue and in the case of tests whether performance or readability is more important :-)

Suggested change
def modinv(a, n):
"""Compute the modular inverse of a modulo n."""
t1, t2 = 0, 1
r1, r2 = n, a
while r2 != 0:
q = r1 // r2
t1, t2 = t2, t1 - q * t2
r1, r2 = r2, r1 - q * r2
if r1 > 1:
return None
if t1 < 0:
t1 += n
return t1
def modinv(a, n):
"""Compute the modular inverse of a prime modulo n using Fermat's little theorem."""
return pow(a, n-2, n)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I see your point, however, given how often these tests run on people's machines and in the CI environment, performance does matter quite a bit. But I think it's a great question to discuss during the PR review club next week :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, if comprehension is a concern, I suspect that people who are familiar with modular inverses will generally understand both the euclidean and the fermat approach; and to people who aren't familiar with it both will look like black magic.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fjahr @sipa: Fair enough! By the way, Python 3.8 introduced support for negative exponents in the modulo (three argument) pow() (see https://bugs.python.org/issue36027, python/cpython#13266), internally calculating the modular inverse via the extended Euclidean algorithm. I.e. somewhere in the future it could be just a very simple:
return pow(a, -1, n)

Copy link
Member

@Sjors Sjors Jun 9, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's worth annotating in a TODO: // Python 3.8: return pow(a, -1, n) (tested locally with Python 3.8.2)

I like having the simpler approach here. It's yet another sanity check that our implementation is correct, given the lack of test vectors. But given the performance impact, if it really matters compared to the rest of the test, better leave that as a TODO.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice to have a unittest to confirm that modinv(a, n) produces the same results as using pow(a, n-2, n)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@narula I added a unit test for this.

@fjahr
Copy link
Contributor Author

fjahr commented Jun 2, 2020

Added a super simple test that reimplements the C++ impl unit test in Python. I am working on more exhaustive tests for the next PR in this series.

@DrahtBot
Copy link
Contributor

DrahtBot commented Jun 6, 2020

The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.

Conflicts

Reviewers, this pull request conflicts with the following ones:

If you consider this pull request important, please also help to review the conflicting pull requests. Ideally, start with the one that should be merged first.

Copy link

@ysangkok ysangkok left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it would be clearer to use the official interfaces defined in collections.abc instead of inventing new names.

bytes384 = chacha20_32_to_384(data)
return int.from_bytes(bytes384, 'little')

class MuHash3072:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should implement MutableSet

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this suggestion, but I am not convinced that this is a good idea. I think it would be confusing because this is not really a Set but a hash representing a Set. So there is no way for example to know about the inclusion of a specific value or the number of values already added. There is also no enforcement here that this is actually a Set. You could add the same value twice and it would work. We are just not sure if this breaks security guarantees and the way MuHash is intended to be used in Bitcoin Core, for now, this should not happen because the UTXO set is a Set. So doing this would send the wrong message I think. And I would also like to keep this class as plain as possible.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could add the same value twice and it would work.

Sets commonly ignore the second insertion. The set in Python also works like this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but that is the whole point: Muhash cannot ignore the second insertion. Adding an element the second time changes the internal state because there is no way to tell if an element is already in the Muhash or not.

self.numerator = 1
self.denominator = 1

def insert(self, data):

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the method is called add in collections.abc.

"""Insert a byte array data in the set."""
self.numerator = (self.numerator * data_to_num3072(data)) % self.MODULUS

def remove(self, data):

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the method is called discard in collections.abc

@fjahr fjahr force-pushed the csi-2-muhash-py branch 4 times, most recently from 4371b6d to ebd98da Compare June 11, 2020 15:01
@fjahr
Copy link
Contributor Author

fjahr commented Jun 11, 2020

Rebased and addressed all review comments. This is now also using SHA256 as discussed in #19055 .

@Sjors
Copy link
Member

Sjors commented Jun 12, 2020

The chacha code also deserves its own file (and commit).

@sipa
Copy link
Member

sipa commented Jun 12, 2020

@Sjors I'd agree if it was generically useful ChaCha20 code, but given that's it's a minimal specialized implementation just for 3072-bit outputs with IV 0, I'm less convinced.

@jnewbery
Copy link
Contributor

We'll need a python chacha implementation when BIP324 lands, and the functionality in chacha20_doubleround() could be part of that, but until BIP324 I think it's fine to have it in the same file as the muhash code.

Copy link
Member

@Sjors Sjors left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ACK 88997dd

You could add a test framework unit test for the chacha functions.

I don't mind merging this before the C++ implementation, though it might make sense to wait for #19105.

@fjahr fjahr force-pushed the csi-2-muhash-py branch from 88997dd to 352c702 Compare July 9, 2020 11:51
@fjahr
Copy link
Contributor Author

fjahr commented Jul 9, 2020

Took @Sjors suggestions: moved the TODO comment to the util function and added chacha20 test vectors with nonce 0.

@Sjors
Copy link
Member

Sjors commented Jul 9, 2020

re-ACK 352c702

Copy link
Contributor

@jnewbery jnewbery left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Just a few minor comments inline.

@fjahr fjahr force-pushed the csi-2-muhash-py branch from 352c702 to 36ec980 Compare July 16, 2020 16:23
@fjahr
Copy link
Contributor Author

fjahr commented Jul 16, 2020

Addressed @jnewbery 's review comments.

@jnewbery
Copy link
Contributor

utACK 36ec980

@laanwj
Copy link
Member

laanwj commented Sep 1, 2020

Code review ACK 36ec980

@laanwj laanwj merged commit 48c1083 into bitcoin:master Sep 1, 2020
sidhujag pushed a commit to syscoin/syscoin that referenced this pull request Sep 3, 2020
36ec980 test: Add chacha20 test vectors in muhash (Fabian Jahr)
0e2b400 test: Add basic Python/C++ Muhash implementation parity unit test (Fabian Jahr)
b85543c test: Add Python MuHash3072 implementation to test framework (Pieter Wuille)
ab30cec test: Move modinv to util and add unit test (Fabian Jahr)

Pull request description:

  This is the second in a [series of pull requests](bitcoin#18000) to implement an Index for UTXO set statistics.

  This pull request adds a Python implementation of Muhash3072, a homomorphic hashing algorithm to be used for hashing the UTXO set. The Python implementation can then be used to compare behavior with the C++ version.

ACKs for top commit:
  jnewbery:
    utACK 36ec980
  laanwj:
    Code review ACK 36ec980

Tree-SHA512: a3519c6e11031174f1ae71ecd8bcc7f3be42d7fc9c84c77f2fbea7cfc5ad54fcbe10b55116ad8d9a52ac5d675640eefed3bf260c58a02f2bf3bc0d8ec208baa6
bits %= 32 # Make sure the term below does not throw an exception
return ((v << bits) & 0xffffffff) | (v >> (32 - bits))

def chacha20_doubleround(s):
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding a little explainer video here for anyone else curious about ChaCha20: https://youtu.be/UeIpq-C-GSA

Fabcien pushed a commit to Bitcoin-ABC/bitcoin-abc that referenced this pull request Sep 21, 2021
Summary:
PR description:
> This is the second in a series of pull requests to implement an Index for UTXO set statistics.
>
> This pull request adds a Python implementation of Muhash3072, a homomorphic hashing algorithm to be used for hashing the UTXO set. The Python implementation can then be used to compare behavior with the C++ version.

> test: Move modinv to util and add unit test
bitcoin/bitcoin@ab30cec

> test: Add Python MuHash3072 implementation to test framework
bitcoin/bitcoin@b85543c

> test: Add basic Python/C++ Muhash implementation parity unit test
bitcoin/bitcoin@0e2b400

> test: Add chacha20 test vectors in muhash
bitcoin/bitcoin@36ec980

This is a backport of [[bitcoin/bitcoin#19105 | core#19105]]

Backport note: the C++ implementation will be merged much later ([[bitcoin/bitcoin#19055 | core#19055]])

Test Plan: `ninja check-functional`

Reviewers: #bitcoin_abc, Fabien

Reviewed By: #bitcoin_abc, Fabien

Differential Revision: https://reviews.bitcoinabc.org/D10166
kwvg added a commit to kwvg/dash that referenced this pull request Feb 26, 2022
kwvg added a commit to kwvg/dash that referenced this pull request Feb 26, 2022
kwvg added a commit to kwvg/dash that referenced this pull request Apr 3, 2022
kwvg added a commit to kwvg/dash that referenced this pull request Apr 20, 2022
kwvg added a commit to kwvg/dash that referenced this pull request Apr 24, 2022
kwvg added a commit to kwvg/dash that referenced this pull request Apr 27, 2022
@bitcoin bitcoin locked as resolved and limited conversation to collaborators Aug 16, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.