Vmg perf branch #39

ploxiln · 2012-08-14T06:20:31Z

rebased and squashed the famous #19 from @vmg on recent master
deferred some struct member type and function return type cleanups to a later commit

At least some of the performance improvements are still present. On my machine, with a 731000 word dictionary,
make test goes from 7.1 seconds runtime to 1.7 seconds runtime
make test_pydablooms goes from 8.5 seconds runtime to 3.1 seconds runtime

ploxiln · 2012-08-14T14:23:40Z

ready for review @Hines @mreiferson

The last commit makes a benchmark I did with 1 million words drop from 3.65 seconds to 3.45 seconds, consistently. (I first wrote a crazy clever version of my own to avoid rounding up the allocation, and it only brought it down to 3.5 seconds, and was crazier...)

ploxiln · 2012-08-15T23:04:55Z

I'd like to squash that last commit into the first ftruncate() commit before merge

ploxiln · 2012-08-15T23:08:15Z

I could also squash the "clean up type of bloom->hashes" commit into the "replace MD5 hash with Murmur" commit, if desired

mreiferson · 2012-08-16T01:28:08Z

src/dablooms.c

+    /* rounding-up integer divide by 2 of bloom->size */
+    bloom->num_bytes = ((bloom->size + 1) / 2) + HEADER_BYTES;
+    /* round hashes allocation up to a multiple of 4 (for convenience in hash_func()) */
+    bloom->hashes = calloc((bloom->nfuncs + (4-1)) & (~(4-1)), sizeof(uint32_t));


given the fact that this is the init function (and relatively speaking is not a considerable source for optimization) I'm not sure the "cleverness" of the code here is worth it.

the comments help though.

(moved comment to pull req)

I agree. The reason I did it that way was conciseness, and so new_salts() didn't have to be called first.

I added another commit to simplify this

mreiferson · 2012-08-16T01:37:20Z

I'd say just squash that comment update into the original ftruncate change and your last bugfix (good catch) into 484c6cf

ploxiln · 2012-08-16T01:56:38Z

can I get one last nod of approval of my last change before I squash (and I'm more inclined to squash the "type cleanup" into the hash function change now that I think about it)

instead of using lseek() repeatedly in a loop (original commit by Vicent Marti, rebased and tweaked by Pierce Lopez, any bugs added probably due to the latter)

new hash function should be much faster now salts should be better distributed now clean up type of bloom->hashes and hash_func() (original commit by Vicent Marti, rebased and modified to use the 128-bit hash for salt generation by Pierce Lopez, any bugs added probably due to the latter)

more flexible and safe Also, note that Python strings (and pretty much any dynamic language) already keep length information. This will save us quite a few `strlen` calls. (original commit by Vicent Marti, rebased by Pierce Lopez, any bugs added probably due to the latter)

this way hash_func() can always do 4 hashes at a time simpler loop logic is a tiny bit faster also clean up num_salts handling a bit (original work by Vicent Marti, rebased and pulled out by Pierce Lopez, any bugs are probably due to the latter)

ploxiln · 2012-08-16T02:17:14Z

just squashed, ready for merge

mreiferson · 2012-08-16T02:28:20Z

src/dablooms.c

+    /* rounding-up integer divide by 4 of bloom->nfuncs */
+    bloom->num_salts = (bloom->nfuncs + 3) / 4;
+    /* effectively bloom->nfuncs rounded up to a multiple of 4 */
+    bloom->hashes = calloc(bloom->num_salts * 4, sizeof(uint32_t));
    new_salts(bloom);


Your last commit looks good.

Also, do you think it makes sense for new_salts to have the following signature?

uint32_t *new_salts(unsigned int num_salts);

and then, in this function it would be used like:

bloom->salts = new_salts(bloom->num_salts);

Essentially, I don't see any reason for it to have to take a specific structure as an argument. I realize this function is only used once but this makes it a bit more agnostic to the type of bloom filter (otherwise I would argue it should be named counting_bloom_new_salts or just be brought into counting_bloom_init)

(we can postpone this for your larger cleanup branch)

yeah, I think I'd like to postpone this

mreiferson · 2012-08-16T02:40:11Z

nod

Vmg perf branch

ploxiln mentioned this pull request Aug 14, 2012

Some performance tweaks #19

Closed

mreiferson reviewed Aug 16, 2012
View reviewed changes

vmg added 4 commits August 15, 2012 22:04

use ftruncate() to resize blooms file

3808900

instead of using lseek() repeatedly in a loop (original commit by Vicent Marti, rebased and tweaked by Pierce Lopez, any bugs added probably due to the latter)

mreiferson reviewed Aug 16, 2012
View reviewed changes

mreiferson added a commit that referenced this pull request Aug 16, 2012

Merge pull request #39 from ploxiln/vmg_perf_branch

a9c386b

Vmg perf branch

mreiferson merged commit a9c386b into bitly:master Aug 16, 2012

mreiferson mentioned this pull request Aug 16, 2012

Faster hash_func #38

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Vmg perf branch #39

Vmg perf branch #39

Uh oh!

ploxiln commented Aug 14, 2012

Uh oh!

ploxiln commented Aug 14, 2012

Uh oh!

ploxiln commented Aug 15, 2012

Uh oh!

ploxiln commented Aug 15, 2012

Uh oh!

mreiferson Aug 16, 2012

Uh oh!

ploxiln Aug 16, 2012

Uh oh!

ploxiln Aug 16, 2012

Uh oh!

mreiferson commented Aug 16, 2012

Uh oh!

ploxiln commented Aug 16, 2012

Uh oh!

ploxiln commented Aug 16, 2012

Uh oh!

mreiferson Aug 16, 2012

Uh oh!

mreiferson Aug 16, 2012

Uh oh!

ploxiln Aug 16, 2012

Uh oh!

mreiferson commented Aug 16, 2012

Uh oh!

Uh oh!

Vmg perf branch #39

Vmg perf branch #39

Uh oh!

Conversation

ploxiln commented Aug 14, 2012

Uh oh!

ploxiln commented Aug 14, 2012

Uh oh!

ploxiln commented Aug 15, 2012

Uh oh!

ploxiln commented Aug 15, 2012

Uh oh!

mreiferson Aug 16, 2012

Choose a reason for hiding this comment

Uh oh!

ploxiln Aug 16, 2012

Choose a reason for hiding this comment

Uh oh!

ploxiln Aug 16, 2012

Choose a reason for hiding this comment

Uh oh!

mreiferson commented Aug 16, 2012

Uh oh!

ploxiln commented Aug 16, 2012

Uh oh!

ploxiln commented Aug 16, 2012

Uh oh!

mreiferson Aug 16, 2012

Choose a reason for hiding this comment

Uh oh!

mreiferson Aug 16, 2012

Choose a reason for hiding this comment

Uh oh!

ploxiln Aug 16, 2012

Choose a reason for hiding this comment

Uh oh!

mreiferson commented Aug 16, 2012

Uh oh!

Uh oh!