-
Notifications
You must be signed in to change notification settings - Fork 297
wasmparser: detect "malformed" cases in parser alone (without validator) #2134
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
wasmparser: detect "malformed" cases in parser alone (without validator) #2134
Conversation
5ec28cf
to
7e3c76e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you again for putting this all together, I'm sure it took a lot of back-and-forth with the test suite!
I've got some questions on particulars below, but I'm also realizing now at this point after going through this that the changes to src/bin/wasm-tools/wast.rs
aren't here so my comments below about "tests still pass" may be less relevant. I assume though that the changes to src/bin/wasm-tools/wast.rs
are relatively small, do you think they'd be reasonable to fold into this PR as well?
Otherwise, at a high-level, I do very much prefer this as a solution than #2123. I think it's worth pushing on this instead and trying to find a balance between validation/parsing and checks and such.
match (data_index_allowed, self.data_index_occurred) { | ||
(false, Some(pos)) => bail!(pos, "data count section required"), | ||
_ => Ok(()), | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question on this: why was this moved out of the validator? From a binary-syntax point of view my read is that data.drop
is allowed with any index, and it's only validation that requires that data.drop
is preceded by a data count section. If I remove this check here the spec tests all look like they pass as well, so is the change here unnecessary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And to clarify, the reason I ask is that threading this around is pretty cumbersome it looks like so I'd prefer to remove it if possible and leave it purely to validation to figure this out
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I found this to be definitely the most cumbersome part of the binary format spec. :-( The tests that require this are at https://github.com/WebAssembly/spec/blob/05949f5/test/core/binary.wast#L492-L533 (which is probably clearer now that I pushed a wast.rs that runs the assert_malformed
tests through only the parser -- you'll see the failure if you run it now). The textual part of the binary format spec for this is https://webassembly.github.io/spec/core/binary/modules.html#binary-module :
if
$m^{?} \ne \epsilon \lor \texttt{dataidx}(\textit{code}^n) = \emptyset $
I assume the reason it works this way is that in the land of the spec, the validator runs on the abstract syntax (not the text or binary formats), and "the data count section disagreed with the data section about the number of data segments" or "the data count section was missing" aren't expressible in the abstract syntax.
I tried a few different ways to handle this (e.g. trying to keep state in the BinaryReader about whether a data index was currently allowed or not, which meant being careful to always reset that state when entering a new sub-parser which I wasn't confident I could get safely, etc.) but this seemed to be the least-bad way in the end. :-/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oof ok I see what you mean, I was looking in the wrong spot for the binary validation...
For this issue specifically, what do you think about diverging from the spec on this? My impression is that the spec is viewed by most authors/engine implementors as a guideline but engines aren't expected to match 1:1 with all the nuances of the spec in cases like this (e.g. precisely where an error shows up, precisely an error message, etc). In that sense I'm hesitant to take design principles from one engine, the spec interpreter, and force that to guide design decisions of other engines (e.g. wasm-tools/wasmtime/etc). More-or-less, I don't think it's worth the baggage necessary to implement this one rule.
That being said I also at the same time would not want to champion a change in the spec. I fear pushback along the lines of "well just don't do that in your engine and reject it in the validator", or resistance along the lines of that. Given that we nonetheless have to handle this test somehow with the change to assert_malformed
. WDYT of special-casing this error message and, in the case of this exact match, assert that the binary parsing is valid while the binary validation fails with the error message? That removes the need for all the infrastructure plumbing around whether a data index is allowed and keeps the "hack" somewhat scoped to just wast.rs
file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fine with me -- just done in 5be6fcd
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess, now that I've done this, we probably could use the same trick to move the "shared" feature checks out of the reader too if you want.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW, there is no need for implementations to distinguish "malformed" and "invalid" at all — and most engines fuse decoding and validation anyway. For all practical purposes, they have the exact same effect anyway. (Well, with one caveat, which is lazy function validation, but I don't think any current engine does that anymore.)
Technically, the two cases differentiate illegal programs that are still inter-convertible between binary and text format (but are semantically invalid) from those that do not even have a representation in the other (so are syntactically malformed). So, @keithw is correct about why datacount is handled the way it is. Not that I like that much either, but datacount is a terrible hack either way.
This moves the bulk of the expression-reading logic into OperatorsReader (out of BinaryReader).
5ee4fe4
to
d6c5d43
Compare
Yes, done! I just added the parts that run the |
22cd6fa
to
b22c643
Compare
b22c643
to
5be6fcd
Compare
Inspired by changes in bytecodealliance/wasm-tools#2134 and intended to reflect how the maximum page size is an artifact of validation, not binary parsing.
Event if the page size matches the default page size
No functional change, but helps keeps things localized to `check_*` functions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've pushed up a few commits to this myself, notably taking you up on your suggestion to use permissive assertions for the threads/shared-everything-threads related flags as that I think looks pretty good in the end. I did a few other misc changes here and there as well.
One final comment about the default input to the CLI tooling, as to why the parse is needed, but otherwise looks good to me 👍
pub fn parse_input_wasm(&self) -> Result<Vec<u8>> { | ||
self.input.parse_wasm() | ||
let ret = self.get_input_wasm()?; | ||
parse_binary_wasm(wasmparser::Parser::new(0), &ret)?; | ||
Ok(ret) | ||
} | ||
|
||
pub fn get_input_wasm(&self) -> Result<Vec<u8>> { | ||
self.input.get_binary_wasm() | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a bit surprised by this, would it be possible to only have get_input_wasm
? Most subcommands should already end up parsing the module for their puposes anyway and if sections are skipped entirely anywhere that seems reasonable to plumb through possibly-invalid things in that case
ci/generate-spec-tests.rs
Outdated
|
||
// Allow certain assert_malformed tests to be interpreted as assert_invalid | ||
if src.iter().any(|p| p == "binary.wast") { | ||
contents.push_str(";; --assert permissive \\\n"); | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ooh I like this, nice idea 👍
I'm looking to do a release in the next day or so, so I'm going to go ahead and merge this. Thanks again for your work here @keithw! |
Okay! I think we may have a challenge with these extra commits trying to roundtrip everything that is assert_invalid -- the problem is that I think some of these error messages (like "integer too large") really are caught by the wasmparser, so turning them into AssertInvalid then fails if we try to assert that everything that's AssertInvalid can be parsed and roundtripped through wasmprinter. But I will deal with that in the next PR... |
This commit relaxes a check added in bytecodealliance#2134 which maintains a stack of frame kinds in the operators reader, in addition to the validator. The goal of bytecodealliance#2134 was to ensure that spec-wise-syntactically-invalid-modules are caught in the parser without the need of the validator, but investigation in bytecodealliance#2180 has shown that this is a source of at least some of a performance regression. The change here is to relax the check to still be able to pass spec tests while making such infrastructure cheaper. The reader now maintains just a small `depth: u32` counter instead of a stack of kinds. This means that the reader can still catch invalid modules such as instructions-after-`end`, but the validator is required to handle situations such as `else` outside of an `if` block. This required some adjustments to tests as well as some workarounds for the upstream spec tests that assert legacy exception-handling instructions are malformed, not invalid.
This PR moves the checks required to detect "malformed" modules/components out of the validator and into the reader/parser. The biggest (textual) change is moving all of the operators reading logic into the OperatorsReader, which also now maintains a stack of blocks required to check that
end
opcodes are matched with the beginning of a block.