V2.0.0 Unlocking Deeper Performance Insights with Multi-Core Simulation and Enhanced Workloads
LatestWe are thrilled to announce the release of Fleetbench v2.0, a major milestone that significantly enhances our benchmarking suite's capability to accurately characterize system performance under realistic, concurrent workloads. This release introduces the powerful Multiprocessing Framework, alongside critical New Benchmarks (gRPC and SIMD), and substantial Improvements and Bug Fixes across the suite.
This version represents a substantial step forward in capturing system performance from diverse angles, enabling developers and performance engineers to gain granular insights into how important libraries behave in complex, multi-core environments.
π New Features & Capabilities
Broadened Hardware & Environment Supports
- Runnability on Emulation and Real Hardwares: Fleetbench is now rigorously tested and validated for consistent performance measurement across both emulated environments and physical hardware. This ensures that development and testing workflows utilizing platforms like QEMU can accurately predict real-world performance characteristics, enabling a more seamless transition from concept to development to deployment.
Multiprocessing Framework (/fleetbench/parallel/)
The new Fleetbench Multiprocessing framework is designed for precise CPU load simulation, moving beyond simplistic single-threaded measurements to analyze system behavior under controlled, concurrent loads.
-
Core Architecture: At its heart,
parallel_bench.py
orchestrates parallel benchmark execution. A central controller dynamically schedules Fleetbench binaries across a configurable pool of worker threads, distributed over multiple CPU cores. -
Adaptive Load Simulation: Load maintenance is achieved through an adaptive scheduling approach. The controller continuously monitors real-time CPU utilization and dynamically adjusts benchmark scheduling strategy to ensure sustained target CPU utilization.
-
Granular Control: We've introduced extensive customization options, including:
-
Workload Distribution Strategies: Users can define workload composition with strategies like
WORKLOAD_WEIGHTED
(based on aggregate workload runtime) orDCTAX_WEIGHTED
(user-defined proportional weights via weights.csv), allowing for fine-tuned synthetic load generation. -
Hyperthreading Control (x86_64): Advanced SMT state manipulation via
--hyperthreading_mode
enables detailed analysis of core contention and cache behavior. -
Flexible Execution Parameters: Flags such as
--duration
,--num_cpus
, and--workload_filter
provide precise control over the benchmark environment.
-
-
Google Benchmark Integration: The framework seamlessly integrates with the underlying Google Benchmark library, supporting familiar flags like
--benchmark_repetitions
,--benchmark_filter
, and--benchmark_perf_counters
for detailed metric collection.
Usage
First build two targets, one for the Fleetbench binary and the other is the multiprocessing framework:
bazel build --config=clang --config=opt --config=haswell fleetbench:fleetbench
bazel build --config=clang --config=opt --config=haswell fleetbench/parallel:parallel_bench
Then run with command:
bazel-bin/fleetbench/parallel/parallel_bench --benchmark_target=bazel-bin/fleetbench/fleetbench
For more usages, please check the README.md or get the flag list via bazel-bin/fleetbench/parallel/parallel_bench --help
.
New Benchmarks
We've expanded our suite with two crucial, real-world representative benchmarks:
SIMD Benchmark
-
Purpose: Accurately measures the performance of Single Instruction, Multiple Data (SIMD) operations.
-
Workload: Based on the SIMD-heavy computational patterns from ScaNN LUT16, reflecting operations common in database query processing, cryptography, and approximate nearest neighbor search.
-
Mechanism: It calculates distance scores by indexing into query-specific Look-Up Tables (LUTs) using database item codes and accumulating retrieved values. Leverages parallel data loading, table lookups, and accumulation to harness SIMD power. The benchmark focuses entirely on the performance of the SIMD-heavy lookup-and-accumulate loop.
-
Relevance: SIMD instructions are fundamental to high performance in modern computing, accounting for a large portion of CPU instructions in our fleet and growing rapidly.
gRPC Benchmark
-
Purpose: Provides a realistic assessment of kernel and scheduling performance for remote procedure calls (RPC).
-
Workload: Utilizes synthesized representative protos reflecting common request/response patterns derived from real-world fleet traffic, similar to our existing Proto Benchmark.
-
Mechanism: Built upon the open-source gRPC framework, this benchmark employs a streamlined, asynchronous callback client/server architecture operating on a local host to minimize network interference.
-
Relevance: This benchmark addresses the need for accurately evaluating Hyperscale SoC performance under realistic and complex traffic patterns and server loads.
β¨ Benchmark Updates & Enhancements
Overall Suite Improvements
-
Updated Fleet Data: All V1.0 benchmarks now use the more recent fleet data for continuous representativeness.
-
Explicit Iteration Counts: Benchmarks have explicit iteration counts, ensuring more consistent and reproducible results.
-
Enhanced Stability with Warmup Phases: A warmup phase has been added to benchmarks to reduce initial variance, leading to more consistent performance measurements.
-
Accurate L3 Cache Size Detection on AMD Platforms: Fleetbench now correctly aggregates L3 cache size across all CCXs per socket, providing more accurate
cold
benchmark constructions.
Dedicated Benchmark Refinements
Proto Benchmark
-
Improved Representativeness: Re-implemented logic for field sample messages (now weight-based), better cold message generation, improved enum fields, and smarter message type generation with reused types.
-
Data Synthesis: Better distinguishing between synthesized data for varint and fixed integers.
-
Memory Optimization: Optimized memory usage for improved emulator compatibility.
Swissmap Benchmarks
-
Improved Capacity Sizing: More accurate Swissmap's capacity sizing and including fleet size-capacity parameters.
-
New InsertMiss Benchmarks: Introduced
InsertMiss_Hot
andInsertMiss_Cold
for measuring insertion performance of non-present elements. -
Optimized Destructor Benchmarks: Adjusted batch sizes in
IntDestructor
andStrDestructor
benchmarks to reduce overhead from helper functions for more accurate measurements. -
Improved Hash Function: Updated to use a low-cost hash function for better entropy with random 32-bit integer keys.
LIBC Benchmarks
-
Realistic Branching Behavior: Incorporated a more fleet representative branching pattern for realistic branch prediction.
-
Improved
memcmp
&bcmp
benchmarks: Now using the same source and destination buffer for correctly accounts for buffer overlaps. -
memmove
and Compare Benchmarks Fix: Corrected buffer size calculation for non-overlapping destination addresses, preventing potential infinite loops. -
Integer Overflow Protection: Added checks for maximum supported L3 cache size to enhance robustness.
π Bug Fixes
We also fixed a series of bugs across the suite to improve stability, accuracy, and reliability.
π Get Started
We encourage everyone to try Fleetbench v2.0 for the performance analysis and let us know how you think!
π Special Thanks to Our Contributors!
This release is a testament to the power of collaborative development. We extend our deepest gratitude to everyone who contributed to Fleetbench! Your insightful feedback, diligent bug reports, and valuable code contributions have been instrumental in making this release a reality and significantly advancing the capabilities of our benchmarking suite. A big thank you to everyone! πππ