Fix TagBloomFilter behavior after updating the crate to 2.0 #6633

NumberFour8 · 2024-11-14T11:42:33Z

The Bloom filter no longer behaves as a ring buffer after the update to 2.0

This fact has been addressed in the TagBloomFilter wrapper, which makes sure that the count of the entries in the filter is correctly tracked.

Also, the false-positive rate decreased to 0.001%
and the capacity is kept at 10M replay tags.

Unit tests have been added to correctly test the new required behavior.

Benchmark was added for Bloom filter performance measurements

The `Bloom` filter no longer behaves as a ring buffer after the update to 2.0 This fact has been addressed in the `TagBloomFilter` wrapper, which makes sure that the count of the entries in the filter is correctly tracked. Also, the false-positive rate decreased to 0.001% and the capacity is kept at 10M replay tags. Unit tests have been added to correctly test the new required behavior. Benchmark was added for Bloom filter performance measurements

coderabbitai · 2024-11-14T11:42:40Z

📝 Walkthrough

Walkthrough

The pull request introduces updates to the Cargo.toml file for the hopr-internal-types package by adding a new dependency for benchmarking and configuring a benchmark. Additionally, a benchmarking module for the TagBloomFilter is created, utilizing the criterion crate to measure performance. Modifications to the TagBloomFilter struct include the addition of a capacity field, updates to existing methods for capacity management, and the introduction of serialization methods. These changes enhance the functionality and performance measurement capabilities of the TagBloomFilter.

Changes

File Path	Change Summary
common/internal-types/Cargo.toml	- Added `criterion` to `[dev-dependencies]` - Introduced benchmark `bloom_filter` with harness set to false
common/internal-types/benches/bloom_filter.rs	- Added `tag_bloom_filter_bench` function for benchmarking - Created benchmark group `tag_bloom_filter` - Defined main entry point for benchmarks
common/internal-types/src/protocol.rs	- Added `capacity` field to `TagBloomFilter` - Updated `FALSE_POSITIVE_RATE` and replaced `MAX_ITEMS` with `DEFAULT_MAX_ITEMS` - Modified `set` and `check_and_set` methods for new capacity logic - Introduced `from_bytes` and `to_bytes` methods for serialization - Updated `Default` implementation to use `with_capacity` method

Possibly related PRs

Add criterion benchmarks for crypto-packet crate #6619: This PR adds the criterion dependency to the Cargo.toml file for the hopr-crypto-packet crate, similar to the main PR which also added criterion to the Cargo.toml of the hopr-internal-types package. Both PRs involve the integration of the criterion crate for benchmarking purposes.

Suggested labels

bug, crate:hopr-crypto-packet

Suggested reviewers

Teebor-Choka

Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Generate unit testing code for this file.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
- @coderabbitai generate unit testing code for this file.
- @coderabbitai modularize this function.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read src/utils.ts and generate unit testing code.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
- @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

@coderabbitai pause to pause the reviews on a PR.
@coderabbitai resume to resume the paused reviews.
@coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
@coderabbitai full review to do a full review from scratch and review all the files again.
@coderabbitai summary to regenerate the summary of the PR.
@coderabbitai resolve resolve all the CodeRabbit review comments.
@coderabbitai configuration to show the current CodeRabbit configuration for the repository.
@coderabbitai help to get help.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

Documentation and Community

Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

coderabbitai

Actionable comments posted: 1

🧹 Outside diff range and nitpick comments (4)

common/internal-types/benches/bloom_filter.rs (2)
6-9: Consider pre-generating random data for initialization

Creating random bytes in a tight loop during initialization might affect system resources. Consider pre-generating a vector of random bytes before the benchmark.
-    for _ in 1..bloom.capacity() {
-        bloom.set(&hopr_crypto_random::random_bytes());
+    let random_data: Vec<_> = (0..bloom.capacity())
+        .map(|_| hopr_crypto_random::random_bytes())
+        .collect();
+    for tag in random_data {
+        bloom.set(&tag);
14-16: Consider adjusting benchmark parameters based on false positive rate

Given the PR mentions a false positive rate of 0.001%, the sample size of 100,000 might need adjustment to effectively capture this rate in the benchmarks.

Consider increasing the sample size or adding specific benchmarks to measure false positive occurrences.
common/internal-types/src/protocol.rs (2)
163-172: Consider adding size validation in from_bytes.

While the implementation is correct, consider adding validation for the deserialized size to ensure it matches the expected capacity.
 pub fn from_bytes(data: &[u8]) -> Result<Self> {
-    bincode::deserialize(data).map_err(|e| CoreTypesError::ParseError(e.to_string()))
+    let filter: Self = bincode::deserialize(data)
+        .map_err(|e| CoreTypesError::ParseError(e.to_string()))?;
+    if filter.count > filter.capacity {
+        return Err(CoreTypesError::ParseError("Invalid count exceeds capacity".into()));
+    }
+    Ok(filter)
 }
326-368: LGTM! Comprehensive test coverage with room for improvement.

The new tests thoroughly verify:

Count management

Capacity limits

Wrap-around behavior

Edge cases

Consider adding tests for:

Concurrent access scenarios

Serialization/deserialization with various capacities

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between 9d6f21c and 8a3806d.

⛔ Files ignored due to path filters (1)

Cargo.lock is excluded by !**/*.lock

📒 Files selected for processing (3)

common/internal-types/Cargo.toml (1 hunks)
common/internal-types/benches/bloom_filter.rs (1 hunks)
common/internal-types/src/protocol.rs (4 hunks)

🔇 Additional comments (8)

common/internal-types/benches/bloom_filter.rs (3)

1-3: LGTM! Imports are well-organized and appropriate.

The imports include all necessary Criterion features for benchmarking and correctly import TagBloomFilter from the prelude.

25-26: LGTM! Criterion setup is correct.

The benchmark configuration follows Criterion best practices.

6-7: Verify the capacity matches requirements

The PR objectives mention a capacity of 10 million replay tags. Let's verify this is correctly configured.

✅ Verification successful

Let me search for the actual implementation of TagBloomFilter to verify the capacity configuration.

Let me search for the TagBloomFilter implementation to verify its capacity configuration.

Capacity matches requirements

The TagBloomFilter implementation has a DEFAULT_MAX_ITEMS constant set to 10 million (10_000_000) entries, which exactly matches the PR objectives. The implementation also includes proper handling of capacity limits with clear warning messages when capacity is reached.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Search for capacity configuration in the codebase
rg -A 5 "TagBloomFilter.*new|TagBloomFilter.*default"

Length of output: 2843

Script:

#!/bin/bash
# Search for TagBloomFilter implementation
ast-grep --pattern 'struct TagBloomFilter {
  $$$
}'

# Also search for any constants or configurations related to capacity
rg -A 5 "const.*CAPACITY|BLOOM_FILTER.*CAPACITY"

Length of output: 190662

Script:

#!/bin/bash
# Search for TagBloomFilter implementation
ast-grep --pattern 'impl TagBloomFilter {
  $$$
}'

# Also search for any constants or configurations related to TagBloomFilter capacity
rg "TagBloomFilter.*capacity|BLOOM_FILTER.*CAPACITY|BLOOM.*CAPACITY" -A 5

Length of output: 6798

common/internal-types/Cargo.toml (2)

39-39: LGTM: Appropriate addition of criterion dependency

The criterion dependency is correctly added to dev-dependencies and follows the workspace versioning pattern, which is ideal for benchmarking purposes.

48-50: LGTM: Well-configured benchmark setup

The benchmark configuration is properly set up with harness = false as required for criterion benchmarks. This will enable accurate performance measurements of the TagBloomFilter's new behavior.

common/internal-types/src/protocol.rs (3)

100-109: LGTM! The changes align with the PR objectives.

The addition of the capacity field and the reduction in false positive rate to 0.001% match the stated goals. The rename of MAX_ITEMS to DEFAULT_MAX_ITEMS better reflects its purpose.

116-118: LGTM! The capacity-related methods are well implemented.

The getter and constructor methods are simple, correct, and properly initialize all fields.

Also applies to: 175-181

140-159: Verify the new check_and_set behavior handles edge cases correctly.

The implementation looks correct but has complex logic for handling full capacity scenarios. The count is properly maintained in both cases:

At full capacity: Only resets if the tag is not present
Below capacity: Increments count only for new items

Let's verify the behavior with edge cases:

✅ Verification successful

Let me analyze the code further by searching for tests related to the Bloom filter implementation.

Let me check the actual Bloom filter implementation and its tests to verify the behavior.

Based on the code review and tests, I can now provide a conclusive response about the check_and_set behavior:

The check_and_set implementation correctly handles edge cases and maintains count

The implementation has been verified through the test cases and code inspection:

The count is properly maintained:
- Increments only for new items
- Stays same for existing items
- Test tag_bloom_filter_count confirms this behavior
Capacity handling is correct:
- When at capacity, it only checks and conditionally resets
- Reset clears the filter and starts fresh with count=1
- Test tag_bloom_filter_wrap_around verifies this behavior
The implementation matches the test expectations and handles all edge cases appropriately, including:
- Initial insertion
- Duplicate checks
- Capacity limits
- Filter reset

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Search for tests that verify edge cases in check_and_set behavior

# Look for test cases involving capacity and count management
rg -l "check_and_set.*capacity|count" --type rust

Length of output: 227405

Script:

#!/bin/bash
# Search for tests related to the Bloom filter implementation
rg -l "bloom.*test|test.*bloom" --type rust

# Search for the actual Bloom filter implementation
rg -l "struct.*Bloom|impl.*Bloom" --type rust

Length of output: 467

Script:

#!/bin/bash
# Check the Bloom filter implementation and tests
cat transport/protocol/src/bloom.rs common/internal-types/src/protocol.rs vendor/cargo/bloomfilter-2.0.0/tests/bloom.rs

Length of output: 15512

common/internal-types/benches/bloom_filter.rs

common/internal-types/src/protocol.rs

coderabbitai

Actionable comments posted: 0

🧹 Outside diff range and nitpick comments (3)

common/internal-types/benches/bloom_filter.rs (3)
18-20: Document benchmark configuration choices

Consider adding comments explaining why these specific values were chosen:

Sample size of 100,000

Throughput of 1 element
     let mut group = c.benchmark_group("tag_bloom_filter");
+    // Large sample size to ensure statistical significance
     group.sample_size(100_000);
+    // Each operation processes one tag
     group.throughput(Throughput::Elements(1));
22-26: Consider expanding benchmark coverage

The benchmark only measures check operations. Consider adding benchmarks for:

set operations

Multiple consecutive operations to measure any caching effects
     group.bench_function("check", |b| {
         b.iter(|| {
             bloom.check(black_box(&existing_tag));
         })
     });
+
+    group.bench_function("set", |b| {
+        b.iter(|| {
+            let tag = hopr_crypto_random::random_bytes();
+            bloom.set(black_box(&tag));
+        })
+    });
28-33: Use more realistic test data for non-existent tags

Using a zeroed array might not represent real-world scenarios well and could lead to optimizations that wouldn't occur in production.
-    let non_existent_tag = [0u8; PACKET_TAG_LENGTH];
+    // Generate a random tag and ensure it's not in the filter
+    let mut non_existent_tag = hopr_crypto_random::random_bytes();
+    while bloom.check(&non_existent_tag) {
+        non_existent_tag = hopr_crypto_random::random_bytes();
+    }

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between 8a3806d and 0bd7183.

📒 Files selected for processing (1)

common/internal-types/benches/bloom_filter.rs (1 hunks)

🔇 Additional comments (1)

common/internal-types/benches/bloom_filter.rs (1)

7-12: Consider improving benchmark initialization robustness

The current initialization has several potential issues:

Setting tag[0] = 0xaa reduces entropy and may not represent real-world usage patterns
There's no verification that the filter reaches the expected capacity (10M as per PR description)

Consider this improvement:

     let mut bloom = TagBloomFilter::default();
+    assert_eq!(bloom.capacity(), 10_000_000, "Unexpected filter capacity");
     for _ in 1..bloom.capacity() {
-        let mut tag = hopr_crypto_random::random_bytes();
-        tag[0] = 0xaa;
+        let tag = hopr_crypto_random::random_bytes();
         bloom.set(&tag);
     }

common/internal-types/src/protocol.rs

coderabbitai

Actionable comments posted: 0

🧹 Outside diff range and nitpick comments (1)

common/internal-types/src/protocol.rs (1)

139-157: Consider improving readability with early returns.

The logic is correct but could be more readable. Consider restructuring with early returns:

 pub fn check_and_set(&mut self, tag: &PacketTag) -> bool {
-    if self.count == self.capacity {
-        let is_present = self.bloom.check(tag);
-        if !is_present {
-            warn!("maximum number of items in the Bloom filter reached!");
-            self.bloom.clear();
-            self.bloom.set(tag);
-            self.count = 1;
-        }
-        is_present
-    } else {
-        let was_present = self.bloom.check_and_set(tag);
-        if !was_present {
-            self.count += 1;
-        }
-        was_present
+    // Handle full capacity case
+    if self.count == self.capacity {
+        let is_present = self.bloom.check(tag);
+        if is_present {
+            return true;
+        }
+        
+        // Reset filter for new item
+        warn!("maximum number of items in the Bloom filter reached!");
+        self.bloom.clear();
+        self.bloom.set(tag);
+        self.count = 1;
+        return false;
     }
+    
+    // Handle normal case
+    let was_present = self.bloom.check_and_set(tag);
+    if !was_present {
+        self.count += 1;
+    }
+    was_present
 }

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between 0bd7183 and ac8a1b7.

📒 Files selected for processing (1)

common/internal-types/src/protocol.rs (5 hunks)

🔇 Additional comments (4)

common/internal-types/src/protocol.rs (4)

100-100: LGTM! Significant improvement in false positive rate.

The addition of the capacity field and the reduction in FALSE_POSITIVE_RATE from 0.01% to 0.001% represents a 10x improvement in accuracy while maintaining the same 10M tag capacity.

Also applies to: 104-109

116-118: LGTM! Clear capacity accessor implementation.

The capacity accessor method provides a clean interface to query the filter's maximum size.

Line range hint 284-369: LGTM! Comprehensive test coverage.

The tests effectively cover:

Basic functionality and serialization
Count tracking with duplicate entries
Wrap-around behavior at capacity
Edge cases with zero and one tags

This aligns well with the PR objectives of ensuring correct behavior after the crate update.

166-170: Verify serialization cannot fail.

The to_bytes method assumes serialization cannot fail. While this is likely true for the current implementation, it would be safer to handle potential errors.

Consider handling potential serialization errors:

 pub fn to_bytes(&self) -> Box<[u8]> {
-    bincode::serialize(&self)
-        .expect("serialization must not fail")
-        .into_boxed_slice()
+    bincode::serialize(&self)
+        .map_err(|e| CoreTypesError::SerializationError(e.to_string()))?
+        .into_boxed_slice()
 }

NumberFour8 added this to the 2.2.0-rc.1 milestone Nov 14, 2024

NumberFour8 requested a review from a team November 14, 2024 11:42

NumberFour8 self-assigned this Nov 14, 2024

NumberFour8 marked this pull request as ready for review November 14, 2024 11:42

github-actions bot added dependencies Pull requests that update a dependency file crate:hopr-internal-types labels Nov 14, 2024

coderabbitai bot reviewed Nov 14, 2024

View reviewed changes

common/internal-types/benches/bloom_filter.rs Outdated Show resolved Hide resolved

Enhance Bloom Filter benchmark to bench also non-existent tags

0bd7183

Teebor-Choka reviewed Nov 14, 2024

View reviewed changes

common/internal-types/src/protocol.rs Show resolved Hide resolved

coderabbitai bot reviewed Nov 14, 2024

View reviewed changes

Teebor-Choka reviewed Nov 14, 2024

View reviewed changes

common/internal-types/src/protocol.rs Outdated Show resolved Hide resolved

common/internal-types/src/protocol.rs Outdated Show resolved Hide resolved

Improve code readability

ac8a1b7

NumberFour8 requested a review from Teebor-Choka November 14, 2024 12:25

coderabbitai bot reviewed Nov 14, 2024

View reviewed changes

NumberFour8 enabled auto-merge November 14, 2024 12:31

Teebor-Choka approved these changes Nov 14, 2024

View reviewed changes

NumberFour8 added this pull request to the merge queue Nov 14, 2024

github-merge-queue bot removed this pull request from the merge queue due to failed status checks Nov 14, 2024

NumberFour8 merged commit c415a8c into master Nov 14, 2024
28 checks passed

NumberFour8 deleted the lukas/fix-bloom-2.0 branch November 14, 2024 19:19

Teebor-Choka mentioned this pull request Mar 15, 2025

Benchmarks for the critical and hot path components #6343

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix TagBloomFilter behavior after updating the crate to 2.0 #6633

Fix TagBloomFilter behavior after updating the crate to 2.0 #6633

Uh oh!

NumberFour8 commented Nov 14, 2024

Uh oh!

coderabbitai bot commented Nov 14, 2024 •

edited

Loading

Walkthrough

Changes

Possibly related PRs

Suggested labels

Suggested reviewers

Chat

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

Documentation and Community

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Fix TagBloomFilter behavior after updating the crate to 2.0 #6633

Fix TagBloomFilter behavior after updating the crate to 2.0 #6633

Uh oh!

Conversation

NumberFour8 commented Nov 14, 2024

Uh oh!

coderabbitai bot commented Nov 14, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Possibly related PRs

Suggested labels

Suggested reviewers

Chat

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

Documentation and Community

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot commented Nov 14, 2024 •

edited

Loading