-
Notifications
You must be signed in to change notification settings - Fork 97
Fix TagBloomFilter behavior after updating the crate to 2.0 #6633
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
The `Bloom` filter no longer behaves as a ring buffer after the update to 2.0 This fact has been addressed in the `TagBloomFilter` wrapper, which makes sure that the count of the entries in the filter is correctly tracked. Also, the false-positive rate decreased to 0.001% and the capacity is kept at 10M replay tags. Unit tests have been added to correctly test the new required behavior. Benchmark was added for Bloom filter performance measurements
📝 Walkthrough📝 WalkthroughWalkthroughThe pull request introduces updates to the Changes
Possibly related PRs
Suggested labels
Suggested reviewers
Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
Documentation and Community
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Outside diff range and nitpick comments (4)
common/internal-types/benches/bloom_filter.rs (2)
6-9
: Consider pre-generating random data for initializationCreating random bytes in a tight loop during initialization might affect system resources. Consider pre-generating a vector of random bytes before the benchmark.
- for _ in 1..bloom.capacity() { - bloom.set(&hopr_crypto_random::random_bytes()); + let random_data: Vec<_> = (0..bloom.capacity()) + .map(|_| hopr_crypto_random::random_bytes()) + .collect(); + for tag in random_data { + bloom.set(&tag);
14-16
: Consider adjusting benchmark parameters based on false positive rateGiven the PR mentions a false positive rate of 0.001%, the sample size of 100,000 might need adjustment to effectively capture this rate in the benchmarks.
Consider increasing the sample size or adding specific benchmarks to measure false positive occurrences.
common/internal-types/src/protocol.rs (2)
163-172
: Consider adding size validation in from_bytes.While the implementation is correct, consider adding validation for the deserialized size to ensure it matches the expected capacity.
pub fn from_bytes(data: &[u8]) -> Result<Self> { - bincode::deserialize(data).map_err(|e| CoreTypesError::ParseError(e.to_string())) + let filter: Self = bincode::deserialize(data) + .map_err(|e| CoreTypesError::ParseError(e.to_string()))?; + if filter.count > filter.capacity { + return Err(CoreTypesError::ParseError("Invalid count exceeds capacity".into())); + } + Ok(filter) }
326-368
: LGTM! Comprehensive test coverage with room for improvement.The new tests thoroughly verify:
- Count management
- Capacity limits
- Wrap-around behavior
- Edge cases
Consider adding tests for:
- Concurrent access scenarios
- Serialization/deserialization with various capacities
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
⛔ Files ignored due to path filters (1)
Cargo.lock
is excluded by!**/*.lock
📒 Files selected for processing (3)
common/internal-types/Cargo.toml
(1 hunks)common/internal-types/benches/bloom_filter.rs
(1 hunks)common/internal-types/src/protocol.rs
(4 hunks)
🔇 Additional comments (8)
common/internal-types/benches/bloom_filter.rs (3)
1-3
: LGTM! Imports are well-organized and appropriate.
The imports include all necessary Criterion features for benchmarking and correctly import TagBloomFilter from the prelude.
25-26
: LGTM! Criterion setup is correct.
The benchmark configuration follows Criterion best practices.
6-7
: Verify the capacity matches requirements
The PR objectives mention a capacity of 10 million replay tags. Let's verify this is correctly configured.
✅ Verification successful
Let me search for the actual implementation of TagBloomFilter to verify the capacity configuration.
Let me search for the TagBloomFilter implementation to verify its capacity configuration.
Capacity matches requirements
The TagBloomFilter implementation has a DEFAULT_MAX_ITEMS constant set to 10 million (10_000_000) entries, which exactly matches the PR objectives. The implementation also includes proper handling of capacity limits with clear warning messages when capacity is reached.
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Search for capacity configuration in the codebase
rg -A 5 "TagBloomFilter.*new|TagBloomFilter.*default"
Length of output: 2843
Script:
#!/bin/bash
# Search for TagBloomFilter implementation
ast-grep --pattern 'struct TagBloomFilter {
$$$
}'
# Also search for any constants or configurations related to capacity
rg -A 5 "const.*CAPACITY|BLOOM_FILTER.*CAPACITY"
Length of output: 190662
Script:
#!/bin/bash
# Search for TagBloomFilter implementation
ast-grep --pattern 'impl TagBloomFilter {
$$$
}'
# Also search for any constants or configurations related to TagBloomFilter capacity
rg "TagBloomFilter.*capacity|BLOOM_FILTER.*CAPACITY|BLOOM.*CAPACITY" -A 5
Length of output: 6798
common/internal-types/Cargo.toml (2)
39-39
: LGTM: Appropriate addition of criterion dependency
The criterion dependency is correctly added to dev-dependencies and follows the workspace versioning pattern, which is ideal for benchmarking purposes.
48-50
: LGTM: Well-configured benchmark setup
The benchmark configuration is properly set up with harness = false
as required for criterion benchmarks. This will enable accurate performance measurements of the TagBloomFilter's new behavior.
common/internal-types/src/protocol.rs (3)
100-109
: LGTM! The changes align with the PR objectives.
The addition of the capacity
field and the reduction in false positive rate to 0.001% match the stated goals. The rename of MAX_ITEMS
to DEFAULT_MAX_ITEMS
better reflects its purpose.
116-118
: LGTM! The capacity-related methods are well implemented.
The getter and constructor methods are simple, correct, and properly initialize all fields.
Also applies to: 175-181
140-159
: Verify the new check_and_set behavior handles edge cases correctly.
The implementation looks correct but has complex logic for handling full capacity scenarios. The count is properly maintained in both cases:
- At full capacity: Only resets if the tag is not present
- Below capacity: Increments count only for new items
Let's verify the behavior with edge cases:
✅ Verification successful
Let me analyze the code further by searching for tests related to the Bloom filter implementation.
Let me check the actual Bloom filter implementation and its tests to verify the behavior.
Based on the code review and tests, I can now provide a conclusive response about the check_and_set behavior:
The check_and_set implementation correctly handles edge cases and maintains count
The implementation has been verified through the test cases and code inspection:
-
The count is properly maintained:
- Increments only for new items
- Stays same for existing items
- Test
tag_bloom_filter_count
confirms this behavior
-
Capacity handling is correct:
- When at capacity, it only checks and conditionally resets
- Reset clears the filter and starts fresh with count=1
- Test
tag_bloom_filter_wrap_around
verifies this behavior
-
The implementation matches the test expectations and handles all edge cases appropriately, including:
- Initial insertion
- Duplicate checks
- Capacity limits
- Filter reset
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Search for tests that verify edge cases in check_and_set behavior
# Look for test cases involving capacity and count management
rg -l "check_and_set.*capacity|count" --type rust
Length of output: 227405
Script:
#!/bin/bash
# Search for tests related to the Bloom filter implementation
rg -l "bloom.*test|test.*bloom" --type rust
# Search for the actual Bloom filter implementation
rg -l "struct.*Bloom|impl.*Bloom" --type rust
Length of output: 467
Script:
#!/bin/bash
# Check the Bloom filter implementation and tests
cat transport/protocol/src/bloom.rs common/internal-types/src/protocol.rs vendor/cargo/bloomfilter-2.0.0/tests/bloom.rs
Length of output: 15512
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Outside diff range and nitpick comments (3)
common/internal-types/benches/bloom_filter.rs (3)
18-20
: Document benchmark configuration choicesConsider adding comments explaining why these specific values were chosen:
- Sample size of 100,000
- Throughput of 1 element
let mut group = c.benchmark_group("tag_bloom_filter"); + // Large sample size to ensure statistical significance group.sample_size(100_000); + // Each operation processes one tag group.throughput(Throughput::Elements(1));
22-26
: Consider expanding benchmark coverageThe benchmark only measures
check
operations. Consider adding benchmarks for:
set
operations- Multiple consecutive operations to measure any caching effects
group.bench_function("check", |b| { b.iter(|| { bloom.check(black_box(&existing_tag)); }) }); + + group.bench_function("set", |b| { + b.iter(|| { + let tag = hopr_crypto_random::random_bytes(); + bloom.set(black_box(&tag)); + }) + });
28-33
: Use more realistic test data for non-existent tagsUsing a zeroed array might not represent real-world scenarios well and could lead to optimizations that wouldn't occur in production.
- let non_existent_tag = [0u8; PACKET_TAG_LENGTH]; + // Generate a random tag and ensure it's not in the filter + let mut non_existent_tag = hopr_crypto_random::random_bytes(); + while bloom.check(&non_existent_tag) { + non_existent_tag = hopr_crypto_random::random_bytes(); + }
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
📒 Files selected for processing (1)
common/internal-types/benches/bloom_filter.rs
(1 hunks)
🔇 Additional comments (1)
common/internal-types/benches/bloom_filter.rs (1)
7-12
: Consider improving benchmark initialization robustness
The current initialization has several potential issues:
- Setting
tag[0] = 0xaa
reduces entropy and may not represent real-world usage patterns - There's no verification that the filter reaches the expected capacity (10M as per PR description)
Consider this improvement:
let mut bloom = TagBloomFilter::default();
+ assert_eq!(bloom.capacity(), 10_000_000, "Unexpected filter capacity");
for _ in 1..bloom.capacity() {
- let mut tag = hopr_crypto_random::random_bytes();
- tag[0] = 0xaa;
+ let tag = hopr_crypto_random::random_bytes();
bloom.set(&tag);
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Outside diff range and nitpick comments (1)
common/internal-types/src/protocol.rs (1)
139-157
: Consider improving readability with early returns.The logic is correct but could be more readable. Consider restructuring with early returns:
pub fn check_and_set(&mut self, tag: &PacketTag) -> bool { - if self.count == self.capacity { - let is_present = self.bloom.check(tag); - if !is_present { - warn!("maximum number of items in the Bloom filter reached!"); - self.bloom.clear(); - self.bloom.set(tag); - self.count = 1; - } - is_present - } else { - let was_present = self.bloom.check_and_set(tag); - if !was_present { - self.count += 1; - } - was_present + // Handle full capacity case + if self.count == self.capacity { + let is_present = self.bloom.check(tag); + if is_present { + return true; + } + + // Reset filter for new item + warn!("maximum number of items in the Bloom filter reached!"); + self.bloom.clear(); + self.bloom.set(tag); + self.count = 1; + return false; } + + // Handle normal case + let was_present = self.bloom.check_and_set(tag); + if !was_present { + self.count += 1; + } + was_present }
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
📒 Files selected for processing (1)
common/internal-types/src/protocol.rs
(5 hunks)
🔇 Additional comments (4)
common/internal-types/src/protocol.rs (4)
100-100
: LGTM! Significant improvement in false positive rate.
The addition of the capacity
field and the reduction in FALSE_POSITIVE_RATE
from 0.01% to 0.001% represents a 10x improvement in accuracy while maintaining the same 10M tag capacity.
Also applies to: 104-109
116-118
: LGTM! Clear capacity accessor implementation.
The capacity accessor method provides a clean interface to query the filter's maximum size.
Line range hint 284-369
: LGTM! Comprehensive test coverage.
The tests effectively cover:
- Basic functionality and serialization
- Count tracking with duplicate entries
- Wrap-around behavior at capacity
- Edge cases with zero and one tags
This aligns well with the PR objectives of ensuring correct behavior after the crate update.
166-170
: Verify serialization cannot fail.
The to_bytes
method assumes serialization cannot fail. While this is likely true for the current implementation, it would be safer to handle potential errors.
Consider handling potential serialization errors:
pub fn to_bytes(&self) -> Box<[u8]> {
- bincode::serialize(&self)
- .expect("serialization must not fail")
- .into_boxed_slice()
+ bincode::serialize(&self)
+ .map_err(|e| CoreTypesError::SerializationError(e.to_string()))?
+ .into_boxed_slice()
}
The
Bloom
filter no longer behaves as a ring buffer after the update to 2.0This fact has been addressed in the
TagBloomFilter
wrapper, which makes sure that the count of the entries in the filter is correctly tracked.Also, the false-positive rate decreased to 0.001%
and the capacity is kept at 10M replay tags.
Unit tests have been added to correctly test the new required behavior.
Benchmark was added for Bloom filter performance measurements