Skip to content

Conversation

fitzgen
Copy link
Member

@fitzgen fitzgen commented Apr 22, 2025

This provides an improvement across the board for our sync/no-hook benchmarks:

Benchmark Results
$ cargo bench --profile profiling --bench call '\bsync/no-hook' -- --baseline main
    Finished `profiling` profile [optimized + debuginfo] target(s) in 0.28s
     Running benches/call.rs (target/profiling/deps/call-b0a2bedd3336ad76)
sync/no-hook/core - host-to-wasm - typed - nop
                        time:   [27.334 ns 27.499 ns 27.668 ns]
                        change: [-16.388% -14.870% -13.479%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 7 outliers among 100 measurements (7.00%)
  5 (5.00%) high mild
  2 (2.00%) high severe
sync/no-hook/core - host-to-wasm - untyped - nop
                        time:   [44.141 ns 44.429 ns 44.757 ns]
                        change: [-18.380% -17.041% -15.670%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 3 outliers among 100 measurements (3.00%)
  1 (1.00%) high mild
  2 (2.00%) high severe
sync/no-hook/core - host-to-wasm - unchecked - nop
                        time:   [29.731 ns 29.983 ns 30.262 ns]
                        change: [-25.104% -22.176% -19.159%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 7 outliers among 100 measurements (7.00%)
  5 (5.00%) high mild
  2 (2.00%) high severe
sync/no-hook/core - host-to-wasm - typed - nop-params-and-results
                        time:   [28.990 ns 29.143 ns 29.303 ns]
                        change: [-25.804% -24.562% -23.372%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 6 outliers among 100 measurements (6.00%)
  3 (3.00%) high mild
  3 (3.00%) high severe
sync/no-hook/core - host-to-wasm - untyped - nop-params-and-results
                        time:   [110.00 ns 110.65 ns 111.46 ns]
                        change: [-11.967% -9.0070% -6.1347%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 7 outliers among 100 measurements (7.00%)
  2 (2.00%) high mild
  5 (5.00%) high severe
sync/no-hook/core - host-to-wasm - unchecked - nop-params-and-results
                        time:   [58.828 ns 59.089 ns 59.418 ns]
                        change: [-15.596% -13.573% -11.484%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 3 outliers among 100 measurements (3.00%)
  3 (3.00%) high severe

sync/no-hook/core - wasm-to-host - typed - nop
                        time:   [6.6209 ns 6.6615 ns 6.7077 ns]
                        change: [-53.555% -52.878% -52.116%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 6 outliers among 100 measurements (6.00%)
  5 (5.00%) high mild
  1 (1.00%) high severe
sync/no-hook/core - wasm-to-host - typed - nop-params-and-results
                        time:   [7.9783 ns 8.0173 ns 8.0611 ns]
                        change: [-54.341% -53.947% -53.505%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 3 outliers among 100 measurements (3.00%)
  3 (3.00%) high severe
sync/no-hook/core - wasm-to-host - untyped - nop
                        time:   [18.306 ns 18.393 ns 18.491 ns]
                        change: [-29.104% -28.127% -27.171%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 7 outliers among 100 measurements (7.00%)
  3 (3.00%) high mild
  4 (4.00%) high severe
sync/no-hook/core - wasm-to-host - untyped - nop-params-and-results
                        time:   [67.741 ns 68.120 ns 68.601 ns]
                        change: [-26.453% -25.061% -23.663%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 12 outliers among 100 measurements (12.00%)
  6 (6.00%) high mild
  6 (6.00%) high severe
sync/no-hook/core - wasm-to-host - unchecked - nop
                        time:   [6.8379 ns 6.8915 ns 6.9566 ns]
                        change: [-55.623% -55.062% -54.481%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 7 outliers among 100 measurements (7.00%)
  5 (5.00%) high mild
  2 (2.00%) high severe
sync/no-hook/core - wasm-to-host - unchecked - nop-params-and-results
                        time:   [27.856 ns 28.024 ns 28.214 ns]
                        change: [-17.320% -16.103% -15.038%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 10 outliers among 100 measurements (10.00%)
  6 (6.00%) high mild
  4 (4.00%) high severe

sync/no-hook/component - host-to-wasm - typed - nop
                        time:   [55.126 ns 55.506 ns 55.932 ns]
                        change: [-19.458% -18.098% -16.736%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 8 outliers among 100 measurements (8.00%)
  2 (2.00%) high mild
  6 (6.00%) high severe
sync/no-hook/component - host-to-wasm - untyped - nop
                        time:   [101.42 ns 102.06 ns 102.82 ns]
                        change: [-15.679% -14.108% -12.523%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 9 outliers among 100 measurements (9.00%)
  7 (7.00%) high mild
  2 (2.00%) high severe
sync/no-hook/component - host-to-wasm - typed - nop-params-and-results
                        time:   [61.482 ns 62.017 ns 62.591 ns]
                        change: [-16.576% -15.100% -13.595%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 10 outliers among 100 measurements (10.00%)
  9 (9.00%) high mild
  1 (1.00%) high severe
sync/no-hook/component - host-to-wasm - untyped - nop-params-and-results
                        time:   [223.50 ns 224.72 ns 226.05 ns]
                        change: [-21.732% -20.178% -18.679%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 4 outliers among 100 measurements (4.00%)
  1 (1.00%) high mild
  3 (3.00%) high severe

sync/no-hook/component - wasm-to-host - typed - nop
                        time:   [39.115 ns 39.295 ns 39.500 ns]
                        change: [-15.139% -13.886% -12.721%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 8 outliers among 100 measurements (8.00%)
  1 (1.00%) low mild
  2 (2.00%) high mild
  5 (5.00%) high severe
sync/no-hook/component - wasm-to-host - typed - nop-params-and-results
                        time:   [47.234 ns 47.458 ns 47.745 ns]
                        change: [-13.833% -11.951% -9.8784%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 9 outliers among 100 measurements (9.00%)
  3 (3.00%) high mild
  6 (6.00%) high severe
sync/no-hook/component - wasm-to-host - untyped - nop
                        time:   [52.311 ns 52.556 ns 52.817 ns]
                        change: [-12.736% -11.712% -10.693%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 9 outliers among 100 measurements (9.00%)
  4 (4.00%) high mild
  5 (5.00%) high severe
sync/no-hook/component - wasm-to-host - untyped - nop-params-and-results
                        time:   [239.71 ns 241.59 ns 244.11 ns]
                        change: [-29.804% -28.173% -26.415%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 9 outliers among 100 measurements (9.00%)
  4 (4.00%) high mild
  5 (5.00%) high severe

Depends on #10626

@fitzgen fitzgen requested review from a team as code owners April 22, 2025 19:53
@fitzgen fitzgen requested review from alexcrichton and removed request for a team April 22, 2025 19:53
Copy link
Member

@alexcrichton alexcrichton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit surprised about the #[inline] on dynamically typed things since there are so many other inefficiences related to that, but otherwise at a high level looks good

@@ -1178,6 +1179,7 @@ impl Func {
/// This must be called just before `call_impl_do_call`.
///
/// Returns whether we need to GC before calling `call_impl_do_call`.
#[inline]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is dealing with a dynamic Val does #[inline] really help much here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one was indeed showing up in profiles.

@@ -943,6 +943,7 @@ impl Func {
/// `StoreOpaque` while the `FuncType` is also being used (from the
/// perspective of the borrow-checker) because otherwise the signature would
/// consider `StoreOpaque` borrowed mutable while `FuncType` is in use.
#[inline]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems related to the dynamic calls case which we don't tend to try to put too much effort into optimizing?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was hot in a benchmark that was regressing compared to main; easy to fix; don't see why we wouldn't do this.

@@ -179,6 +180,7 @@ where
///
/// If `Self::need_gc_before_call_raw`, then the caller must have done a GC
/// just before calling this method.
#[inline]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a pretty big function, does inlining this really help that much?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one was also showing up in profiles.

@fitzgen
Copy link
Member Author

fitzgen commented Apr 22, 2025

I'm a bit surprised about the #[inline] on dynamically typed things since there are so many other inefficiences related to that, but otherwise at a high level looks good

I didn't measure every #[inline] independently, I just looked at which functions were at the top of the perf report and marked them #[inline] (and some of their callees) if it seemed like it made sense, and then I measured the benchmarks with all the changes together. So it is possible that some of these are not strictly necessary, but I think that is fine, since testing all combinations of #[inline]s here would take forever.

fitzgen added 2 commits April 22, 2025 14:42
This provides an improvement across the board for our `sync/no-hook` benchmarks:

<details>

<summary>Benchmark Results</summary>

```
$ cargo bench --profile profiling --bench call '\bsync/no-hook' -- --baseline main
    Finished `profiling` profile [optimized + debuginfo] target(s) in 0.28s
     Running benches/call.rs (target/profiling/deps/call-b0a2bedd3336ad76)
sync/no-hook/core - host-to-wasm - typed - nop
                        time:   [27.334 ns 27.499 ns 27.668 ns]
                        change: [-16.388% -14.870% -13.479%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 7 outliers among 100 measurements (7.00%)
  5 (5.00%) high mild
  2 (2.00%) high severe
sync/no-hook/core - host-to-wasm - untyped - nop
                        time:   [44.141 ns 44.429 ns 44.757 ns]
                        change: [-18.380% -17.041% -15.670%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 3 outliers among 100 measurements (3.00%)
  1 (1.00%) high mild
  2 (2.00%) high severe
sync/no-hook/core - host-to-wasm - unchecked - nop
                        time:   [29.731 ns 29.983 ns 30.262 ns]
                        change: [-25.104% -22.176% -19.159%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 7 outliers among 100 measurements (7.00%)
  5 (5.00%) high mild
  2 (2.00%) high severe
sync/no-hook/core - host-to-wasm - typed - nop-params-and-results
                        time:   [28.990 ns 29.143 ns 29.303 ns]
                        change: [-25.804% -24.562% -23.372%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 6 outliers among 100 measurements (6.00%)
  3 (3.00%) high mild
  3 (3.00%) high severe
sync/no-hook/core - host-to-wasm - untyped - nop-params-and-results
                        time:   [110.00 ns 110.65 ns 111.46 ns]
                        change: [-11.967% -9.0070% -6.1347%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 7 outliers among 100 measurements (7.00%)
  2 (2.00%) high mild
  5 (5.00%) high severe
sync/no-hook/core - host-to-wasm - unchecked - nop-params-and-results
                        time:   [58.828 ns 59.089 ns 59.418 ns]
                        change: [-15.596% -13.573% -11.484%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 3 outliers among 100 measurements (3.00%)
  3 (3.00%) high severe

sync/no-hook/core - wasm-to-host - typed - nop
                        time:   [6.6209 ns 6.6615 ns 6.7077 ns]
                        change: [-53.555% -52.878% -52.116%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 6 outliers among 100 measurements (6.00%)
  5 (5.00%) high mild
  1 (1.00%) high severe
sync/no-hook/core - wasm-to-host - typed - nop-params-and-results
                        time:   [7.9783 ns 8.0173 ns 8.0611 ns]
                        change: [-54.341% -53.947% -53.505%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 3 outliers among 100 measurements (3.00%)
  3 (3.00%) high severe
sync/no-hook/core - wasm-to-host - untyped - nop
                        time:   [18.306 ns 18.393 ns 18.491 ns]
                        change: [-29.104% -28.127% -27.171%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 7 outliers among 100 measurements (7.00%)
  3 (3.00%) high mild
  4 (4.00%) high severe
sync/no-hook/core - wasm-to-host - untyped - nop-params-and-results
                        time:   [67.741 ns 68.120 ns 68.601 ns]
                        change: [-26.453% -25.061% -23.663%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 12 outliers among 100 measurements (12.00%)
  6 (6.00%) high mild
  6 (6.00%) high severe
sync/no-hook/core - wasm-to-host - unchecked - nop
                        time:   [6.8379 ns 6.8915 ns 6.9566 ns]
                        change: [-55.623% -55.062% -54.481%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 7 outliers among 100 measurements (7.00%)
  5 (5.00%) high mild
  2 (2.00%) high severe
sync/no-hook/core - wasm-to-host - unchecked - nop-params-and-results
                        time:   [27.856 ns 28.024 ns 28.214 ns]
                        change: [-17.320% -16.103% -15.038%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 10 outliers among 100 measurements (10.00%)
  6 (6.00%) high mild
  4 (4.00%) high severe

sync/no-hook/component - host-to-wasm - typed - nop
                        time:   [55.126 ns 55.506 ns 55.932 ns]
                        change: [-19.458% -18.098% -16.736%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 8 outliers among 100 measurements (8.00%)
  2 (2.00%) high mild
  6 (6.00%) high severe
sync/no-hook/component - host-to-wasm - untyped - nop
                        time:   [101.42 ns 102.06 ns 102.82 ns]
                        change: [-15.679% -14.108% -12.523%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 9 outliers among 100 measurements (9.00%)
  7 (7.00%) high mild
  2 (2.00%) high severe
sync/no-hook/component - host-to-wasm - typed - nop-params-and-results
                        time:   [61.482 ns 62.017 ns 62.591 ns]
                        change: [-16.576% -15.100% -13.595%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 10 outliers among 100 measurements (10.00%)
  9 (9.00%) high mild
  1 (1.00%) high severe
sync/no-hook/component - host-to-wasm - untyped - nop-params-and-results
                        time:   [223.50 ns 224.72 ns 226.05 ns]
                        change: [-21.732% -20.178% -18.679%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 4 outliers among 100 measurements (4.00%)
  1 (1.00%) high mild
  3 (3.00%) high severe

sync/no-hook/component - wasm-to-host - typed - nop
                        time:   [39.115 ns 39.295 ns 39.500 ns]
                        change: [-15.139% -13.886% -12.721%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 8 outliers among 100 measurements (8.00%)
  1 (1.00%) low mild
  2 (2.00%) high mild
  5 (5.00%) high severe
sync/no-hook/component - wasm-to-host - typed - nop-params-and-results
                        time:   [47.234 ns 47.458 ns 47.745 ns]
                        change: [-13.833% -11.951% -9.8784%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 9 outliers among 100 measurements (9.00%)
  3 (3.00%) high mild
  6 (6.00%) high severe
sync/no-hook/component - wasm-to-host - untyped - nop
                        time:   [52.311 ns 52.556 ns 52.817 ns]
                        change: [-12.736% -11.712% -10.693%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 9 outliers among 100 measurements (9.00%)
  4 (4.00%) high mild
  5 (5.00%) high severe
sync/no-hook/component - wasm-to-host - untyped - nop-params-and-results
                        time:   [239.71 ns 241.59 ns 244.11 ns]
                        change: [-29.804% -28.173% -26.415%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 9 outliers among 100 measurements (9.00%)
  4 (4.00%) high mild
  5 (5.00%) high severe
```

</details>
@fitzgen fitzgen force-pushed the speed-up-call-benches branch from 29bd960 to f6763ee Compare April 22, 2025 21:42
@fitzgen fitzgen requested a review from alexcrichton April 22, 2025 21:43
Copy link
Member

@alexcrichton alexcrichton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know that I'm far more inline-averse than most others, but at the same time I don't think that I'm un-founded when I push back against #[inline] either. For example I don't doubt that dynamic Val-style call paths can benefit from some inlining, but that's also not the purpose of those code paths since they're explicitly de-optimized in many other ways as well. Personally I'd prefer to avoid manual #[inline] annotations there and let the compiler figure things out. If the compiler isn't the best at figuring things out, which it isn't for these benchmarks, then personally I'd prefer to just shrug and move on rather than add annotations to optimize an explicitly non-optimal path.

Another reason I'm personally pretty inline-averse is that it's such a brittle metric. For example TypedFunc::call_raw is a pretty big function here to be inlining. I don't doubt that it's in the profile but I'm also unsure whether inlining shows benefit, e.g. if it basically just shifted all its cost to the caller without further benefitting from inlining. This sort of code golfing is IMO brittle in the long-term where just because it improves the micro-benchmark now doesn't mean it'll continue to improve further into the future.

Overall I'm personally fine with #[inline] on "obvious small functions" like one-liners or similar. Bigger functions I'm less sure about and honestly IMO are basically just not worth annotation. In theory we should be measuring the attribute on big functions to see if it's worth it and most of the time I suspect that the effort to measure far outweighs the effort gained from a possible inlining.

At the end of the day I would relatively strongly prefer that #[inline] were removed from anything related to Val. I would then slightly prefer to remove #[inline] on anything bigger than one line basically. For this latter point I'll leave it up to you though since I realize the cost is not really worth debating over as well. Although that's also not quite true since if we #[inline] everything it's obviously quite bad for compile times and binary sizes, so there's some line somewhere I just tend to be far more conservative than others I think.

@github-actions github-actions bot added the wasmtime:api Related to the API of the `wasmtime` crate itself label Apr 22, 2025
@fitzgen fitzgen enabled auto-merge April 24, 2025 15:48
@fitzgen
Copy link
Member Author

fitzgen commented Apr 24, 2025

I've removed #[inline] from the larger methods.

@fitzgen fitzgen added this pull request to the merge queue Apr 24, 2025
Merged via the queue into bytecodealliance:main with commit cb484dd Apr 24, 2025
41 checks passed
@fitzgen fitzgen deleted the speed-up-call-benches branch April 24, 2025 16:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
wasmtime:api Related to the API of the `wasmtime` crate itself
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants