Skip to content

V2.0.0 Unlocking Deeper Performance Insights with Multi-Core Simulation and Enhanced Workloads

Latest
Compare
Choose a tag to compare
@liyuying0000 liyuying0000 released this 27 Jun 14:43
· 29 commits to main since this release

We are thrilled to announce the release of Fleetbench v2.0, a major milestone that significantly enhances our benchmarking suite's capability to accurately characterize system performance under realistic, concurrent workloads. This release introduces the powerful Multiprocessing Framework, alongside critical New Benchmarks (gRPC and SIMD), and substantial Improvements and Bug Fixes across the suite.

This version represents a substantial step forward in capturing system performance from diverse angles, enabling developers and performance engineers to gain granular insights into how important libraries behave in complex, multi-core environments.

πŸš€ New Features & Capabilities

Broadened Hardware & Environment Supports

  • Runnability on Emulation and Real Hardwares: Fleetbench is now rigorously tested and validated for consistent performance measurement across both emulated environments and physical hardware. This ensures that development and testing workflows utilizing platforms like QEMU can accurately predict real-world performance characteristics, enabling a more seamless transition from concept to development to deployment.

Multiprocessing Framework (/fleetbench/parallel/)

The new Fleetbench Multiprocessing framework is designed for precise CPU load simulation, moving beyond simplistic single-threaded measurements to analyze system behavior under controlled, concurrent loads.

  • Core Architecture: At its heart, parallel_bench.py orchestrates parallel benchmark execution. A central controller dynamically schedules Fleetbench binaries across a configurable pool of worker threads, distributed over multiple CPU cores.

  • Adaptive Load Simulation: Load maintenance is achieved through an adaptive scheduling approach. The controller continuously monitors real-time CPU utilization and dynamically adjusts benchmark scheduling strategy to ensure sustained target CPU utilization.

  • Granular Control: We've introduced extensive customization options, including:

    • Workload Distribution Strategies: Users can define workload composition with strategies like WORKLOAD_WEIGHTED (based on aggregate workload runtime) or DCTAX_WEIGHTED (user-defined proportional weights via weights.csv), allowing for fine-tuned synthetic load generation.

    • Hyperthreading Control (x86_64): Advanced SMT state manipulation via --hyperthreading_mode enables detailed analysis of core contention and cache behavior.

    • Flexible Execution Parameters: Flags such as --duration, --num_cpus, and --workload_filter provide precise control over the benchmark environment.

  • Google Benchmark Integration: The framework seamlessly integrates with the underlying Google Benchmark library, supporting familiar flags like --benchmark_repetitions, --benchmark_filter, and --benchmark_perf_counters for detailed metric collection.

Usage

First build two targets, one for the Fleetbench binary and the other is the multiprocessing framework:

bazel build --config=clang --config=opt --config=haswell fleetbench:fleetbench
bazel build --config=clang --config=opt --config=haswell fleetbench/parallel:parallel_bench

Then run with command:

bazel-bin/fleetbench/parallel/parallel_bench --benchmark_target=bazel-bin/fleetbench/fleetbench

For more usages, please check the README.md or get the flag list via bazel-bin/fleetbench/parallel/parallel_bench --help.

New Benchmarks

We've expanded our suite with two crucial, real-world representative benchmarks:

SIMD Benchmark

  • Purpose: Accurately measures the performance of Single Instruction, Multiple Data (SIMD) operations.

  • Workload: Based on the SIMD-heavy computational patterns from ScaNN LUT16, reflecting operations common in database query processing, cryptography, and approximate nearest neighbor search.

  • Mechanism: It calculates distance scores by indexing into query-specific Look-Up Tables (LUTs) using database item codes and accumulating retrieved values. Leverages parallel data loading, table lookups, and accumulation to harness SIMD power. The benchmark focuses entirely on the performance of the SIMD-heavy lookup-and-accumulate loop.

  • Relevance: SIMD instructions are fundamental to high performance in modern computing, accounting for a large portion of CPU instructions in our fleet and growing rapidly.

gRPC Benchmark

  • Purpose: Provides a realistic assessment of kernel and scheduling performance for remote procedure calls (RPC).

  • Workload: Utilizes synthesized representative protos reflecting common request/response patterns derived from real-world fleet traffic, similar to our existing Proto Benchmark.

  • Mechanism: Built upon the open-source gRPC framework, this benchmark employs a streamlined, asynchronous callback client/server architecture operating on a local host to minimize network interference.

  • Relevance: This benchmark addresses the need for accurately evaluating Hyperscale SoC performance under realistic and complex traffic patterns and server loads.

✨ Benchmark Updates & Enhancements

Overall Suite Improvements

  • Updated Fleet Data: All V1.0 benchmarks now use the more recent fleet data for continuous representativeness.

  • Explicit Iteration Counts: Benchmarks have explicit iteration counts, ensuring more consistent and reproducible results.

  • Enhanced Stability with Warmup Phases: A warmup phase has been added to benchmarks to reduce initial variance, leading to more consistent performance measurements.

  • Accurate L3 Cache Size Detection on AMD Platforms: Fleetbench now correctly aggregates L3 cache size across all CCXs per socket, providing more accurate cold benchmark constructions.

Dedicated Benchmark Refinements

Proto Benchmark

  • Improved Representativeness: Re-implemented logic for field sample messages (now weight-based), better cold message generation, improved enum fields, and smarter message type generation with reused types.

  • Data Synthesis: Better distinguishing between synthesized data for varint and fixed integers.

  • Memory Optimization: Optimized memory usage for improved emulator compatibility.

Swissmap Benchmarks

  • Improved Capacity Sizing: More accurate Swissmap's capacity sizing and including fleet size-capacity parameters.

  • New InsertMiss Benchmarks: Introduced InsertMiss_Hot and InsertMiss_Cold for measuring insertion performance of non-present elements.

  • Optimized Destructor Benchmarks: Adjusted batch sizes in IntDestructor and StrDestructor benchmarks to reduce overhead from helper functions for more accurate measurements.

  • Improved Hash Function: Updated to use a low-cost hash function for better entropy with random 32-bit integer keys.

LIBC Benchmarks

  • Realistic Branching Behavior: Incorporated a more fleet representative branching pattern for realistic branch prediction.

  • Improved memcmp & bcmp benchmarks: Now using the same source and destination buffer for correctly accounts for buffer overlaps.

  • memmove and Compare Benchmarks Fix: Corrected buffer size calculation for non-overlapping destination addresses, preventing potential infinite loops.

  • Integer Overflow Protection: Added checks for maximum supported L3 cache size to enhance robustness.

πŸ› Bug Fixes

We also fixed a series of bugs across the suite to improve stability, accuracy, and reliability.

πŸš€ Get Started

We encourage everyone to try Fleetbench v2.0 for the performance analysis and let us know how you think!

πŸ™Œ Special Thanks to Our Contributors!

This release is a testament to the power of collaborative development. We extend our deepest gratitude to everyone who contributed to Fleetbench! Your insightful feedback, diligent bug reports, and valuable code contributions have been instrumental in making this release a reality and significantly advancing the capabilities of our benchmarking suite. A big thank you to everyone! 🎊🎊🎊