Skip to content
This repository was archived by the owner on Jan 23, 2025. It is now read-only.

Conversation

ploxiln
Copy link
Contributor

@ploxiln ploxiln commented Aug 14, 2012

rebased and squashed the famous #19 from @vmg on recent master
deferred some struct member type and function return type cleanups to a later commit

At least some of the performance improvements are still present. On my machine, with a 731000 word dictionary,
make test goes from 7.1 seconds runtime to 1.7 seconds runtime
make test_pydablooms goes from 8.5 seconds runtime to 3.1 seconds runtime

@ploxiln ploxiln mentioned this pull request Aug 14, 2012
@ploxiln
Copy link
Contributor Author

ploxiln commented Aug 14, 2012

ready for review @Hines @mreiferson

The last commit makes a benchmark I did with 1 million words drop from 3.65 seconds to 3.45 seconds, consistently. (I first wrote a crazy clever version of my own to avoid rounding up the allocation, and it only brought it down to 3.5 seconds, and was crazier...)

@ploxiln
Copy link
Contributor Author

ploxiln commented Aug 15, 2012

I'd like to squash that last commit into the first ftruncate() commit before merge

@ploxiln
Copy link
Contributor Author

ploxiln commented Aug 15, 2012

I could also squash the "clean up type of bloom->hashes" commit into the "replace MD5 hash with Murmur" commit, if desired

/* rounding-up integer divide by 2 of bloom->size */
bloom->num_bytes = ((bloom->size + 1) / 2) + HEADER_BYTES;
/* round hashes allocation up to a multiple of 4 (for convenience in hash_func()) */
bloom->hashes = calloc((bloom->nfuncs + (4-1)) & (~(4-1)), sizeof(uint32_t));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

given the fact that this is the init function (and relatively speaking is not a considerable source for optimization) I'm not sure the "cleverness" of the code here is worth it.

the comments help though.

(moved comment to pull req)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. The reason I did it that way was conciseness, and so new_salts() didn't have to be called first.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added another commit to simplify this

@mreiferson
Copy link
Contributor

I'd say just squash that comment update into the original ftruncate change and your last bugfix (good catch) into 484c6cf

@ploxiln
Copy link
Contributor Author

ploxiln commented Aug 16, 2012

can I get one last nod of approval of my last change before I squash (and I'm more inclined to squash the "type cleanup" into the hash function change now that I think about it)

vmg added 4 commits August 15, 2012 22:04
instead of using lseek() repeatedly in a loop

(original commit by Vicent Marti, rebased and tweaked by Pierce Lopez,
any bugs added probably due to the latter)
new hash function should be much faster now
salts should be better distributed now
clean up type of bloom->hashes and hash_func()

(original commit by Vicent Marti, rebased and modified to use the
128-bit hash for salt generation by Pierce Lopez, any bugs added
probably due to the latter)
more flexible and safe

Also, note that Python strings (and pretty much any dynamic language)
already keep length information. This will save us quite a few
`strlen` calls.

(original commit by Vicent Marti, rebased by Pierce Lopez, any bugs
added probably due to the latter)
this way hash_func() can always do 4 hashes at a time
simpler loop logic is a tiny bit faster

also clean up num_salts handling a bit

(original work by Vicent Marti, rebased and pulled out by
Pierce Lopez, any bugs are probably due to the latter)
@ploxiln
Copy link
Contributor Author

ploxiln commented Aug 16, 2012

just squashed, ready for merge

/* rounding-up integer divide by 4 of bloom->nfuncs */
bloom->num_salts = (bloom->nfuncs + 3) / 4;
/* effectively bloom->nfuncs rounded up to a multiple of 4 */
bloom->hashes = calloc(bloom->num_salts * 4, sizeof(uint32_t));
new_salts(bloom);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your last commit looks good.

Also, do you think it makes sense for new_salts to have the following signature?

uint32_t *new_salts(unsigned int num_salts);

and then, in this function it would be used like:

bloom->salts = new_salts(bloom->num_salts);

Essentially, I don't see any reason for it to have to take a specific structure as an argument. I realize this function is only used once but this makes it a bit more agnostic to the type of bloom filter (otherwise I would argue it should be named counting_bloom_new_salts or just be brought into counting_bloom_init)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(we can postpone this for your larger cleanup branch)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, I think I'd like to postpone this

@mreiferson
Copy link
Contributor

nod

mreiferson added a commit that referenced this pull request Aug 16, 2012
@mreiferson mreiferson merged commit a9c386b into bitly:master Aug 16, 2012
@mreiferson mreiferson mentioned this pull request Aug 16, 2012
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Development

Successfully merging this pull request may close these issues.

3 participants