wasmparser: detect "malformed" cases in parser alone (without validator) #2134

keithw · 2025-04-10T06:06:55Z

This PR moves the checks required to detect "malformed" modules/components out of the validator and into the reader/parser. The biggest (textual) change is moving all of the operators reading logic into the OperatorsReader, which also now maintains a stack of blocks required to check that end opcodes are matched with the beginning of a block.

alexcrichton

Thank you again for putting this all together, I'm sure it took a lot of back-and-forth with the test suite!

I've got some questions on particulars below, but I'm also realizing now at this point after going through this that the changes to src/bin/wasm-tools/wast.rs aren't here so my comments below about "tests still pass" may be less relevant. I assume though that the changes to src/bin/wasm-tools/wast.rs are relatively small, do you think they'd be reasonable to fold into this PR as well?

Otherwise, at a high-level, I do very much prefer this as a solution than #2123. I think it's worth pushing on this instead and trying to find a balance between validation/parsing and checks and such.

tests/cli/missing-features/missing-exceptions.wast

crates/wasm-shrink/tests/tests.rs

crates/wasmparser/src/readers/core/memories.rs

crates/wasmparser/src/readers/core/tables.rs

crates/wasmparser/src/readers/core/tags.rs

alexcrichton · 2025-04-10T15:25:02Z

crates/wasmparser/src/readers/core/operators.rs

+        match (data_index_allowed, self.data_index_occurred) {
+            (false, Some(pos)) => bail!(pos, "data count section required"),
+            _ => Ok(()),
+        }


Question on this: why was this moved out of the validator? From a binary-syntax point of view my read is that data.drop is allowed with any index, and it's only validation that requires that data.drop is preceded by a data count section. If I remove this check here the spec tests all look like they pass as well, so is the change here unnecessary?

And to clarify, the reason I ask is that threading this around is pretty cumbersome it looks like so I'd prefer to remove it if possible and leave it purely to validation to figure this out

I found this to be definitely the most cumbersome part of the binary format spec. :-( The tests that require this are at https://github.com/WebAssembly/spec/blob/05949f5/test/core/binary.wast#L492-L533 (which is probably clearer now that I pushed a wast.rs that runs the assert_malformed tests through only the parser -- you'll see the failure if you run it now). The textual part of the binary format spec for this is https://webassembly.github.io/spec/core/binary/modules.html#binary-module :

if $m^{?} \ne \epsilon \lor \texttt{dataidx}(\textit{code}^n) = \emptyset $

I assume the reason it works this way is that in the land of the spec, the validator runs on the abstract syntax (not the text or binary formats), and "the data count section disagreed with the data section about the number of data segments" or "the data count section was missing" aren't expressible in the abstract syntax.

I tried a few different ways to handle this (e.g. trying to keep state in the BinaryReader about whether a data index was currently allowed or not, which meant being careful to always reset that state when entering a new sub-parser which I wasn't confident I could get safely, etc.) but this seemed to be the least-bad way in the end. :-/

Oof ok I see what you mean, I was looking in the wrong spot for the binary validation...

For this issue specifically, what do you think about diverging from the spec on this? My impression is that the spec is viewed by most authors/engine implementors as a guideline but engines aren't expected to match 1:1 with all the nuances of the spec in cases like this (e.g. precisely where an error shows up, precisely an error message, etc). In that sense I'm hesitant to take design principles from one engine, the spec interpreter, and force that to guide design decisions of other engines (e.g. wasm-tools/wasmtime/etc). More-or-less, I don't think it's worth the baggage necessary to implement this one rule.

That being said I also at the same time would not want to champion a change in the spec. I fear pushback along the lines of "well just don't do that in your engine and reject it in the validator", or resistance along the lines of that. Given that we nonetheless have to handle this test somehow with the change to assert_malformed. WDYT of special-casing this error message and, in the case of this exact match, assert that the binary parsing is valid while the binary validation fails with the error message? That removes the need for all the infrastructure plumbing around whether a data index is allowed and keeps the "hack" somewhat scoped to just wast.rs file.

Fine with me -- just done in 5be6fcd

I guess, now that I've done this, we probably could use the same trick to move the "shared" feature checks out of the reader too if you want.

FWIW, there is no need for implementations to distinguish "malformed" and "invalid" at all — and most engines fuse decoding and validation anyway. For all practical purposes, they have the exact same effect anyway. (Well, with one caveat, which is lazy function validation, but I don't think any current engine does that anymore.)

Technically, the two cases differentiate illegal programs that are still inter-convertible between binary and text format (but are semantically invalid) from those that do not even have a representation in the other (so are syntactically malformed). So, @keithw is correct about why datacount is handled the way it is. Not that I like that much either, but datacount is a terrible hack either way.

crates/wasmparser/src/validator/core.rs

crates/wasmprinter/src/lib.rs

This moves the bulk of the expression-reading logic into OperatorsReader (out of BinaryReader).

keithw · 2025-04-10T21:36:15Z

I've got some questions on particulars below, but I'm also realizing now at this point after going through this that the changes to src/bin/wasm-tools/wast.rs aren't here so my comments below about "tests still pass" may be less relevant. I assume though that the changes to src/bin/wasm-tools/wast.rs are relatively small, do you think they'd be reasonable to fold into this PR as well?

Yes, done! I just added the parts that run the assert_malformed tests through the parser alone -- will hold back the roundtripping of invalid modules for a later PR (this will require some more changes to wasmprinter/wasm-encoder).

…o validation)

Inspired by changes in bytecodealliance/wasm-tools#2134 and intended to reflect how the maximum page size is an artifact of validation, not binary parsing.

Event if the page size matches the default page size

No functional change, but helps keeps things localized to `check_*` functions.

alexcrichton

I've pushed up a few commits to this myself, notably taking you up on your suggestion to use permissive assertions for the threads/shared-everything-threads related flags as that I think looks pretty good in the end. I did a few other misc changes here and there as well.

One final comment about the default input to the CLI tooling, as to why the parse is needed, but otherwise looks good to me 👍

alexcrichton · 2025-04-11T14:53:32Z

src/lib.rs

    pub fn parse_input_wasm(&self) -> Result<Vec<u8>> {
-        self.input.parse_wasm()
+        let ret = self.get_input_wasm()?;
+        parse_binary_wasm(wasmparser::Parser::new(0), &ret)?;
+        Ok(ret)
+    }
+
+    pub fn get_input_wasm(&self) -> Result<Vec<u8>> {
+        self.input.get_binary_wasm()
    }


I'm a bit surprised by this, would it be possible to only have get_input_wasm? Most subcommands should already end up parsing the module for their puposes anyway and if sections are skipped entirely anywhere that seems reasonable to plumb through possibly-invalid things in that case

crates/wasmparser/src/readers/core/memories.rs

alexcrichton · 2025-04-11T15:07:09Z

ci/generate-spec-tests.rs

+
+    // Allow certain assert_malformed tests to be interpreted as assert_invalid
+    if src.iter().any(|p| p == "binary.wast") {
+        contents.push_str(";;      --assert permissive \\\n");
+    }
+


Ooh I like this, nice idea 👍

crates/wasmparser/src/validator/core.rs

alexcrichton · 2025-04-11T16:08:19Z

Slightly more substantial than my other changes, but @keithw I sent keithw#1 as a pr-to-this-pr (also happy to land that as a follow-up) and I'm curious how you feel about that

alexcrichton · 2025-04-15T21:09:42Z

I'm looking to do a release in the next day or so, so I'm going to go ahead and merge this. Thanks again for your work here @keithw!

keithw · 2025-04-15T21:15:45Z

Okay! I think we may have a challenge with these extra commits trying to roundtrip everything that is assert_invalid -- the problem is that I think some of these error messages (like "integer too large") really are caught by the wasmparser, so turning them into AssertInvalid then fails if we try to assert that everything that's AssertInvalid can be parsed and roundtripped through wasmprinter. But I will deal with that in the next PR...

This commit relaxes a check added in bytecodealliance#2134 which maintains a stack of frame kinds in the operators reader, in addition to the validator. The goal of bytecodealliance#2134 was to ensure that spec-wise-syntactically-invalid-modules are caught in the parser without the need of the validator, but investigation in bytecodealliance#2180 has shown that this is a source of at least some of a performance regression. The change here is to relax the check to still be able to pass spec tests while making such infrastructure cheaper. The reader now maintains just a small `depth: u32` counter instead of a stack of kinds. This means that the reader can still catch invalid modules such as instructions-after-`end`, but the validator is required to handle situations such as `else` outside of an `if` block. This required some adjustments to tests as well as some workarounds for the upstream spec tests that assert legacy exception-handling instructions are malformed, not invalid.

keithw force-pushed the parser-detects-all-malformed branch 7 times, most recently from 5ec28cf to 7e3c76e Compare April 10, 2025 08:52

alexcrichton reviewed Apr 10, 2025

View reviewed changes

alexcrichton mentioned this pull request Apr 10, 2025

Return an error earlier on instructions-after-function-end #2123

Closed

keithw added 2 commits April 10, 2025 12:49

wasmparser: detect "malformed" cases in parser alone (without validator)

1ca26fa

This moves the bulk of the expression-reading logic into OperatorsReader (out of BinaryReader).

Remove some now-redundant checks from the validator

865e0c3

keithw force-pushed the parser-detects-all-malformed branch 2 times, most recently from 5ee4fe4 to d6c5d43 Compare April 10, 2025 21:17

keithw force-pushed the parser-detects-all-malformed branch from 22cd6fa to b22c643 Compare April 10, 2025 23:12

keithw added 2 commits April 10, 2025 16:34

wast subcommand: run assert_malformed tests through parser alone

cc8d398

Don't enforce binary "data count section required" in parser (defer t…

5be6fcd

…o validation)

keithw force-pushed the parser-detects-all-malformed branch from b22c643 to 5be6fcd Compare April 10, 2025 23:38

alexcrichton mentioned this pull request Apr 11, 2025

Change an assert_malformed to assert_invalid WebAssembly/custom-page-sizes#43

Closed

alexcrichton added 5 commits April 11, 2025 08:28

Add some more errors to being permissive

769b55c

Drop --assert permissive from a test we have locally

f27d419

Add test that passive/declare segments are gated

2e8ba67

Disallow a custom page size if feature isn't enabled

3aab396

Event if the page size matches the default page size

Move feature check for tags

68ec57d

No functional change, but helps keeps things localized to `check_*` functions.

alexcrichton approved these changes Apr 11, 2025

View reviewed changes

alexcrichton added this pull request to the merge queue Apr 15, 2025

Merged via the queue into bytecodealliance:main with commit 0354dde Apr 15, 2025
32 checks passed

keithw mentioned this pull request Apr 15, 2025

cli: for subcommands other than parse, don't guarantee successful parse #2147

Merged

Robbepop mentioned this pull request May 5, 2025

Fix var_u{32,64} parsing in memories and tables #2178

Closed

alexcrichton mentioned this pull request May 22, 2025

Relax the block check in OperatorsReader<'_> #2202

Closed

keithw mentioned this pull request May 22, 2025

wasmparser OperatorsReader: remove unused data_index_occurred #2203

Merged

keithw mentioned this pull request Jun 9, 2025

wasmparser: improve visit_operator performance #2228

Merged

keithw mentioned this pull request Jun 17, 2025

wasmparser: 15-30% performance regressions from v228 -> v229 #2180

Open

Robbepop mentioned this pull request Jun 18, 2025

Optimize OperatorsReader control stack #2241

Merged

keithw mentioned this pull request Aug 31, 2025

--no-check produces malformed module WebAssembly/wabt#2629

Open

wasmparser: detect "malformed" cases in parser alone (without validator) #2134

wasmparser: detect "malformed" cases in parser alone (without validator) #2134

Uh oh!

Conversation

keithw commented Apr 10, 2025

Uh oh!

alexcrichton left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

keithw Apr 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

keithw commented Apr 10, 2025

Uh oh!

alexcrichton left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

alexcrichton commented Apr 11, 2025

Uh oh!

alexcrichton commented Apr 15, 2025

Uh oh!

keithw commented Apr 15, 2025

Uh oh!

Uh oh!

Uh oh!

keithw Apr 11, 2025 •

edited

Loading