-
Notifications
You must be signed in to change notification settings - Fork 117
Faster hash_func. No need for salts. #41
Conversation
@suranap - Thanks for this. We're going to do some testing and validation to confirm your proposed changes. Also, it seems like you copied some commits from master onto this branch. My suggestion in terms of easiest way to clean this up would be to do a |
@mreiferson: Sorry, I'm still new to git & github. Did I do it right? If not, you can grab the changes and apply them yourself. Git is giving me a headache. |
no worries... looks great, thanks. |
Hey! I've been reading up on lineal combination this whole last week, and was going to open this same PR. Nice to see that @suranap was faster than me. Here are some ramblings you may find interesting: I got very excited initially about the idea of saving hash time by using linear combination. However, there's not enough Adderall in all of Europe to let me finish reading the first paragraph of 0. I'm going to spare you my rant on the current state of Academia, but the math is fucking intentionally dense (specially for such un-intuitive results), and the graphed data is pants-on-head retarded. ...So I just went ahead and tried to reproduce all the results on my own -- because that's really the only way I'd trust this math enough to run it on production. My plan was simple: I modified the SMhasher test suite so it would run uniform distribution and bias tests on values generated through lineal combination, and compared these against the distribution results for values generated through randomized seeding.
This is... well. It puts me at ease, to be fair. The math checks out, and running these distribution tests locally also answers some practical implementation questions that the paper obviously doesn't consider. E.g. whether you can can split the output of a given hash function in half and consider these two values to be independent for linear combination. The answer is yes, by the way. There's no statistical difference between:
_HOWEVER_: The code in this PR is, in fact, not that much faster than the original! Have you ran benchmarks? Take a look at the size of your I've had very good results on my local benchmark by bringing back CityHash instead of Murmur (yes, this is karma biting me in the neck). Note that in this PR we're wasting half of the output size of Murmur (we're only combining the two initial 32-bit words, but we've computed 4 of them!). City, on the other hand, goes straight into 2-words and is special-cased for short strings. Basically, the same reasons that made it suck before (when we had a lot of different seeds) make it work great in this new situation. I'd strongly advise bringing back CityHash if you're going to go with linear combination. Otherwise the performance boost from this PR is not big enough to warrant the (possible?) decrease in accuracy. Oh, and one more note: Have you made sure that GCC is actually vectorizing the lineal product out of the loop? You may want to perform the optimization by hand just to be 100% sure there's no multiplication anywhere in the loop: for (i = 0; i < bloom->nfuncs; i++) {
hashes[i] = roots[1] % bloom->counts_per_func;
roots[1] += roots[3];
} Alright, I'm off to keep reading literature on this shit. Have you guys noticed that when you change the hash function to output an arbitrary number of bytes from the original digest through lineal combination, you're effectively creating a PRNG? I wonder if the composed values could pass a DieHard test... Fucking computers man. They be cray. |
I have a commit on a branch (not yet in a pull request) that enables the Maybe I should break that out into something minimal we can merge before this, because it could affect the slight performance improvements we get here. Bitly is quite open to performance improvements, but is more interested in the code reduction and simplification this gives - if this branch gives the same speed and the same accuracy, we're still interested. I plan to work on a couple new tests, and compare the accuracy before and after this "linear combination" changeset. |
I did my own rough comparison, between my "more_tests_stuff" branch (which is in pull request #44) and a branch where I merged that and this branch. Below, I call the former "ploxiln" and the latter "suranap" I ran my tests using 3 different word lists: "finnish" words, words I scraped from system libraries ("words3"), and both concatenated together ("words4"). Further, in the "words3" case I reduce capacity to 30,000, and in a second "words4" case I reduce error rate to 0.01.
It looks like this changeset makes it slightly faster in all cases, and the accuracy is a wash, and the code simplification is a win. I vote we merge it. We can consider switching to cityhash after that. @suranap please rebase this on the latest bitly/master, there's a conflict you have to resolve manually, but it's pretty easy, just delete |
@suranap thanks, one last request, can you squash this into one commit please? |
Faster hash_func. No need for salts.
The performance bottleneck is in calling the hash function repeatedly. The solution is to call it less frequently. This paper (http://www.eecs.harvard.edu/~michaelm/postscripts/rsa2008.pdf) says you can replace the expensive calls to a hash function with the equation: gi(x) = h1(x) + i * h2(x). Both h1 and h2 are generated by one call to the hash function. Then you just loop around that equation to generate all the hash values you need for the bloom filter. In my limited testing, performance improves about 20%.