-
Notifications
You must be signed in to change notification settings - Fork 1.1k
WebAssembly Updates #5097
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WebAssembly Updates #5097
Conversation
This has been split off from #5097 to make reviewing simpler; this modifies a bunch of tests for better use with WASM: - everything in auto_schedule is skipped, since the Mullapudi2016 autoscheduler assert-fails for most non-real-CPU backends - everything in performance is skipped, because we'll be doing wasm "jit" testing with an interpreter, so performance results would be meaningless - Added/updated the skip message for tests that exercise things that wasm doesn't support yet (eg atomics, threads) - For a handful of tests that are unreasonably slow under the wasm "jit" interpreter, dialed down some of the test parameters to avoid test timeouts - Removed some wasm-related skips that won't be necessary any longer
2ffa61c
to
6785c98
Compare
Cherry-picking some CMake-related stuff from #5097 to make it easier to review: - Add an optional COMMAND argument to add_halide_test(), this allows us to customize the command for execution (e.g. to run generated .wasm with a shell tool) - Add some missing DEPENDS in a few places - Add an optional EXTRA_LIBS argument to halide_define_aot_test(); this allows us to pass extra dependencies rather than requiring separate calls to target_link_libraries(). That last one is a little odd, so let me expand: the intent here is that (when the wasm changes land), some of the tests that aren't usable under wasm (e.g. matlab), and halide_define_aot_test() will handle these by just skipping those targets entirely. This means that we can effectively centralize the blocklist in one place, and then the callers of halide_define_aot_test() can just do something like halide_define_aot_test(matlab) if (TARGET generator_aot_matlab) set_target_properties(generator_aot_matlab PROPERTIES ENABLE_EXPORTS True) endif () ... i.e., just assume that if the target may not be defined if it's blacklisted for some reason. (Open for better suggestions on this last one, but I felt it was better than spraying lots of checks for "if wasm blah blah" thru this entire file)
Cherry-picking a change from #5097 so that reviewing that PR will be simpler; this just adds another feature option for wasm that tooling wasn't ready for before, but is now (saturating float-to-int conversion).
re: the GIT_SUBMODULES CMake bug. It was fixed: https://gitlab.kitware.com/cmake/cmake/-/merge_requests/4729 |
...but presumably in 3.18+, so we can't rely on it yet? |
Ah, I misunderstood your comment in the code. I thought that the bug was introduced in 3.18. Regardless, leaving |
7f10ff7
to
95b3ed5
Compare
There's more wasm-related work to do in the future (notably, getting benchmarks/performance tests working), but I think this is a good enough stopping point that we should look at reviewing and landing it. (Note that wasm testing isn't enabled on the buildbots yet -- it never has been! -- but should be possible now thanks to these changes. After this lands, I'll update the buildbots accordingly.) |
README_webassembly.md
Outdated
`HL_JIT_TARGET=wasm-32-wasmrt-wasm_simd128`) and run normally. The test suites | ||
which we have vetted to work include correctness, performance, error, and | ||
warning. (Some of the others could likely be made to work with modest effort.) | ||
- To run the JIT tests, set `HL_JIT_TARGET=wasm-32-wasmrt` (possibly adding `wasm_simd128`, `wasm_signext`, and/or `wasm_sat_float_to_int`) and run CMake/CTest normally. Note that wasm testing is only support under CMake (not via Make). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
line wrapping in this file seems necessary. I think @alexreinking used a markdown auto-formatter at some point that might do it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it's called prettier.io. You can install it from NPM. I like it quite a bit. I make sure to always pass --prose-wrap always
at the CLI so that it reflows paragraphs nicely.
{Target::WasmSimd128, true, UInt(16, 8), 0, "llvm.wasm.avgr.unsigned.v8i16", u16(((wild_u32x_ + wild_u32x_) + 1) >> 1)}, | ||
|
||
// TODO: LLVM should support this directly, but doesn't yet. | ||
// To make this work, we need to be able to call the intrinsics with two vecs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The way I've had to do this in the past is with force-inlined implementations that accept the wider vec, e.g. see packsswbx16 in src/runtime/x86.ll
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One of the wasm/LLVM folks has been working on improving these codegen issues in their backend, so I haven't expended much energy on trying to improve it here yet. I'll add this as a note in case circumstances make us need to improve sooner.
String::Utf8Value str(isolate, key); | ||
wdebug(0) << "\t" << i + 1 << ". " << *str << "\n"; | ||
} | ||
// lld will temporarily hijack the signal handlers to ensure that temp files get cleaned up, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This sounded painful to debug.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it wasn't the worst thing I've ever had to do, but it wasn't much fun, either
src/WasmExecutor.cpp
Outdated
inline void ExtractAndStoreScalar<int64_t>::operator()(const Local<Context> &context, const Local<Value> &val, void *slot) { | ||
internal_error << "TODO: 64-bit slots aren't yet supported"; | ||
std::string to_string(const wabt::MemoryStream &m) { | ||
// TODO: ugh |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please elaborate. Are you talking about the weird requirement to const cast?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, that comment doesn't belong there. Removing.
Can I get final review/approval for this? (Only failure is unrelated Windows GPU failure) |
I'd like to check that the commands in the README still work even with an LLVM that doesn't have WebAssembly/WABT/whatever, since the Ubuntu-bundled one doesn't. |
cool, will wait |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There might be some weirdness if you try to use this with WITH_WASM_SHELL
disabled, but we'll cross that bridge when we come to it.
will be necessary once halide/Halide#5097 lands.
This is a partial revert of halide#5097. It brings back a bunch of the code in WasmExecutor to set up and use V8 to run Wasm code. All of the code is copy-pasted. There are some small cleanups to move common code (like BDMalloc, structs, asserts) to a common area guarded by `if WITH_WABT || WITH_V8`. Enabling V8 requires setting 2 CMake options: - V8_INCLUDE_PATH - V8_LIB_PATH The first is a path to v8 include folder, to find headers, the second is the monolithic v8 library. This is because it's pretty difficult to build v8, and there are various flags you can set. Comments around those options provide some instructions for building v8. By default, we still use the wabt for running Wasm code, but we can use V8 by setting WITH_WABT=OFF WITH_V8=ON. Maybe in the future, with more testing, we can flip this. Right now this requires a locally patched build of V8 due to https://crbug.com/v8/10461, but once that is resolved, the version of V8 that includes the fix will be fine. Also enable a single test, block_transpose, to run on V8, with these results: $ HL_JIT_TARGET=wasm-32-wasmrt-wasm_simd128 \ ./test/performance/performance_block_transpose Dummy Func version: Scalar transpose bandwidth 3.45061e+08 byte/s. Wrapper version: Scalar transpose bandwidth 3.38931e+08 byte/s. Dummy Func version: Transpose vectorized in y bandwidth 6.74143e+08 byte/s. Wrapper version: Transpose vectorized in y bandwidth 3.54331e+08 byte/s. Dummy Func version: Transpose vectorized in x bandwidth 3.50053e+08 byte/s. Wrapper version: Transpose vectorized in x bandwidth 6.73421e+08 byte/s. Success! For comparison, when targeting host: $ ./test/performance/performance_block_transpose Dummy Func version: Scalar transpose bandwidth 1.33689e+09 byte/s. Wrapper version: Scalar transpose bandwidth 1.33583e+09 byte/s. Dummy Func version: Transpose vectorized in y bandwidth 2.20278e+09 byte/s. Wrapper version: Transpose vectorized in y bandwidth 1.45959e+09 byte/s. Dummy Func version: Transpose vectorized in x bandwidth 1.45921e+09 byte/s. Wrapper version: Transpose vectorized in x bandwidth 2.21746e+09 byte/s. Success! For comparison, running with wabt: Dummy Func version: Scalar transpose bandwidth 828715 byte/s. Wrapper version: Scalar transpose bandwidth 826204 byte/s. Dummy Func version: Transpose vectorized in y bandwidth 1.12008e+06 byte/s. Wrapper version: Transpose vectorized in y bandwidth 874958 byte/s. Dummy Func version: Transpose vectorized in x bandwidth 879031 byte/s. Wrapper version: Transpose vectorized in x bandwidth 1.10525e+06 byte/s. Success!
* Support using V8 as the Wasm JIT interpreter This is a partial revert of #5097. It brings back a bunch of the code in WasmExecutor to set up and use V8 to run Wasm code. All of the code is copy-pasted. There are some small cleanups to move common code (like BDMalloc, structs, asserts) to a common area guarded by `if WITH_WABT || WITH_V8`. Enabling V8 requires setting 2 CMake options: - V8_INCLUDE_PATH - V8_LIB_PATH The first is a path to v8 include folder, to find headers, the second is the monolithic v8 library. This is because it's pretty difficult to build v8, and there are various flags you can set. Comments around those options provide some instructions for building v8. By default, we still use the wabt for running Wasm code, but we can use V8 by setting WITH_WABT=OFF WITH_V8=ON. Maybe in the future, with more testing, we can flip this. Right now this requires a locally patched build of V8 due to https://crbug.com/v8/10461, but once that is resolved, the version of V8 that includes the fix will be fine. Also enable a single test, block_transpose, to run on V8, with these results: $ HL_JIT_TARGET=wasm-32-wasmrt-wasm_simd128 \ ./test/performance/performance_block_transpose Dummy Func version: Scalar transpose bandwidth 3.45061e+08 byte/s. Wrapper version: Scalar transpose bandwidth 3.38931e+08 byte/s. Dummy Func version: Transpose vectorized in y bandwidth 6.74143e+08 byte/s. Wrapper version: Transpose vectorized in y bandwidth 3.54331e+08 byte/s. Dummy Func version: Transpose vectorized in x bandwidth 3.50053e+08 byte/s. Wrapper version: Transpose vectorized in x bandwidth 6.73421e+08 byte/s. Success! For comparison, when targeting host: $ ./test/performance/performance_block_transpose Dummy Func version: Scalar transpose bandwidth 1.33689e+09 byte/s. Wrapper version: Scalar transpose bandwidth 1.33583e+09 byte/s. Dummy Func version: Transpose vectorized in y bandwidth 2.20278e+09 byte/s. Wrapper version: Transpose vectorized in y bandwidth 1.45959e+09 byte/s. Dummy Func version: Transpose vectorized in x bandwidth 1.45921e+09 byte/s. Wrapper version: Transpose vectorized in x bandwidth 2.21746e+09 byte/s. Success! For comparison, running with wabt: Dummy Func version: Scalar transpose bandwidth 828715 byte/s. Wrapper version: Scalar transpose bandwidth 826204 byte/s. Dummy Func version: Transpose vectorized in y bandwidth 1.12008e+06 byte/s. Wrapper version: Transpose vectorized in y bandwidth 874958 byte/s. Dummy Func version: Transpose vectorized in x bandwidth 879031 byte/s. Wrapper version: Transpose vectorized in x bandwidth 1.10525e+06 byte/s. Success! * Add instructions to build V8 * Formatting * More documentation * Update README_webassembly.md * Update README_webassembly.md * Update WasmExecutor.cpp * Update WasmExecutor.cpp * Skip tests * Update WasmExecutor.cpp * Skip performance tests * Update WasmExecutor.cpp * Address review comments * 9.8.147 -> 9.8.177 Co-authored-by: Ng Zhi An <zhin@google.com>
Revamp our WebAssembly support to bring it more up-to-date with current spec:
Fixes #4245