wasm-mutate: Add a mutator to modify data segments #447

alexcrichton · 2022-01-13T21:03:01Z

In an effort to minimize modules this commit adds a mutator which will
modify data segments to either add or remove bytes (although only
removal is used when reducing). The main meat of this commit is
refactoring out the translation from wasmparser types to wasm-encoder
types into a shared module for both the RemoveItem mutator and this
new mutator to use.

fitzgen

r=me with one thing below addressed

fitzgen · 2022-01-13T21:16:13Z

crates/wasm-mutate/src/mutators/modify_data.rs

+        // If reduction is configured then we never add data, otherwise
+        // arbitrarily choose to add or remove data.
+        if config.reduce || config.rng().gen() {
+            if data.len() > 0 {
+                // Remove a random subslice of data, if there's actually some
+                // data.
+                let start = config.rng().gen_range(0, data.len());
+                let end = config.rng().gen_range(start, data.len());
+                data[start..end].to_vec()
+            } else {
+                Vec::new()
+            }
+        } else {
+            // Insert some random bytes.
+            let start = config.rng().gen_range(0, data.len());
+            let len = config.rng().gen_range(0, 100);
+            data[0..start]
+                .iter()
+                .copied()
+                .chain(
+                    config
+                        .rng()
+                        .sample_iter(rand::distributions::Standard)
+                        .take(len),
+                )
+                .chain(data[start..].iter().copied())
+                .collect()
+        }
+    }


I think we should (either in a third logic branch or maybe replacing both branches but configuring the max sizes appropriately) add a mode where we call the configured mutate-this-raw-data hook which we can make call libfuzzer's internal mutators directly when we are fuzzing.

wasm-tools/crates/wasm-mutate/src/lib.rs

Lines 233 to 246 in 5d4f007

/// Set a custom raw mutation function.

///

/// This is used when we need some underlying raw bytes, for example when

/// mutating the contents of a data segment.

///

/// You can override this to use `libFuzzer`'s `LLVMFuzzerMutate` function

/// to get raw bytes from `libFuzzer`, for example.

pub fn raw_mutate_func(

&mut self,

raw_mutate_func: Option<Arc<dyn Fn(&mut Vec<u8>) -> Result<()>>>,

) -> &mut Self {

self.raw_mutate_func = raw_mutate_func;

self

}

https://docs.rs/libfuzzer-sys/latest/libfuzzer_sys/fn.fuzzer_mutate.html

To make sure I understand, you're imagining this saw raw_mutate_func-style hook being configured in WasmMutate? Something like an optional closure where, if present, is called here?

Yeah I think ideally we would write our own, default raw_mutate_func implementation and then just always call raw_mutate_func here, instead of maintaining N different code paths that do shrinking vs not shrinking etc. As long as raw_mutate_func abides by the size constraints passed to it, we shouldn't need separate branches for adding vs deleting vs editing.

How would that work though with the seed? We wouldn't pass in the seed or the Unstructured so what would the consumer used as the basis for mutating the input data?

Also if you're imagining that this is always an option should the CLI be the only one that implements this? By default would this mutator be disabled unless the callback were configured?

(sorry I'm not really sure how you're envisioning this working)

Hm yeah we should probably pass a seed into this callback.

We generally aren't using Unstructured here, not sure what you're getting at here.

In practice, we will hook raw_mutate_func up to https://docs.rs/libfuzzer-sys/latest/libfuzzer_sys/fn.fuzzer_mutate.html

If the raw_mutate_func is not configured, then we should use a default, internal implementation of the raw_mutate_func callback, similar to what you've done here but not hard-coded to be only for data segments.

impl WasmMutate { fn get_raw_mutate_func(&self) -> &dyn Fn(&mut Vec<u8>, ...) { self.raw_mutate_func.as_ref().unwrap_or(&default_raw_mutate_func) } } fn default_raw_mutate_func( data: &mut Vec<u8>, // probably need to enrichen the `raw_mutate_func` signature with // a seed and info about whether we are only shrinking... ) { // ... the stuff you are doing for data segments right now ... } // ... // Inside the data segment mutator. (config.raw_mutate_func())(&mut my_data_segment);

Does that make sense?

Also if you're imagining that this is always an option should the CLI be the only one that implements this? By default would this mutator be disabled unless the callback were configured?

I would say the other way around: the CLI shouldn't expose this functionality because what different kinds of implementations can it have? We're probably not going to pipe out to another program to mutate these little bits of data, so there isn't really much to configure other than maybe the source of entropy. Not really useful here.

However, when using wasm-mutate as a library inside a fuzz target, we definitely want to hook this up to https://docs.rs/libfuzzer-sys/latest/libfuzzer_sys/fn.fuzzer_mutate.html so that we can let libfuzzer use its dictionary and all of its builtin mutation schemes when mutating these data segments or export names or what have you.

Sorry I was mixing up sources of randomness and got confused with Unstructured vs just a plain old Rng here.

Also thanks for explaining! I've pushed up something which I think roughly does what you're thinking, but I think it'd be good to double-check as well.

In an effort to minimize modules this commit adds a mutator which will modify data segments to either add or remove bytes (although only removal is used when reducing). The main meat of this commit is refactoring out the translation from `wasmparser` types to `wasm-encoder` types into a shared module for both the `RemoveItem` mutator and this new mutator to use.

Also make the rename export mutator a bit more robust

fitzgen

r=me with max_size parameter

thanks for your patience on these back and forths!

fitzgen · 2022-01-21T18:37:47Z

crates/wasm-mutate/src/lib.rs

+    fn raw_mutate(&mut self, data: &mut Vec<u8>) -> Result<()> {
+        // If a raw mutation function is configured then that's prioritized.
+        if let Some(mutate) = &self.raw_mutate_func {
+            return mutate(data);


I think we need to pass a max_size parameter through to the raw mutation function. (a) libfuzzer_sys::fuzzer_mutate requires such a parameter, and (b) it lets us set max_size == data.len() - 1 when we are configured to only do size-reducing mutations. When we aren't shrinking, then we can either pass a max_size of data.capacity() or data.len() * 2 or something like that.

Makes sense! I threaded this through from the configuration of the pass itself, since the renaming of exports was already being configured with a maximum size.

crates/wasm-mutate/src/lib.rs

This fixes an issue from bytecodealliance#447 where the renaming export pass was mistakenly generating a duplicate export name because I forgot to add a check. The check is added here, however!

* wasm-mutate: Fix generating duplicate export names This fixes an issue from #447 where the renaming export pass was mistakenly generating a duplicate export name because I forgot to add a check. The check is added here, however! * Prevent an infinite loop * Consume fuel when generating random names. * Also allow generating the same name as before, notably the empty name.

This encoding change is necessary due to recent additions to the component model; see [bytecodealliance#447]. [bytecodealliance#447]: WebAssembly/component-model#447

[bytecodealliance#447] tries to make the built-in naming a bit more consistent; this change propagates that here. [bytecodealliance#447]: WebAssembly/component-model#447

This encoding change is necessary due to recent additions to the component model; see [bytecodealliance#447]. [bytecodealliance#447]: WebAssembly/component-model#447

[bytecodealliance#447] tries to make the built-in naming a bit more consistent; this change propagates that here. [bytecodealliance#447]: WebAssembly/component-model#447

This encoding change is necessary due to recent additions to the component model; see [bytecodealliance#447]. [bytecodealliance#447]: WebAssembly/component-model#447

[bytecodealliance#447] tries to make the built-in naming a bit more consistent; this change propagates that here. [bytecodealliance#447]: WebAssembly/component-model#447

This encoding change is necessary due to recent additions to the component model; see [bytecodealliance#447]. [bytecodealliance#447]: WebAssembly/component-model#447

* threads: add `thread.spawn_indirect` As discussed in [#89], this adds support for a new intrinsic, `thread.spawn_indirect`. This new operation would allow spawning a shared function stored in a table via a table index. This leaves some future work undone: - `thread.spawn` could/should be renamed to `thread.spawn_ref` - `thread.spawn_indirect` could/should take the encoding byte from `thread.hw_concurrency`--swap `0x07` for `0x06` - importantly, `thread.spawn_indirect` should gain a field indicating which type to expect in the indirect table; due to current limitations in `wasm-tools`, the locations to check once this is possible are marked with `TODO: spawn indirect types`. [#89]: WebAssembly/shared-everything-threads#89 * threads: rename `thread.spawn` to `thread.spawn_ref` [#447] tries to make the built-in naming a bit more consistent; this change propagates that here. [#447]: WebAssembly/component-model#447 * threads: move `thread.*` canonical opcodes to `0x40+` This encoding change is necessary due to recent additions to the component model; see [#447]. [#447]: WebAssembly/component-model#447 * threads: add function types to `thread.spawn_indirect` Initially I implemented `thread.spawn_indirect` without the ability to check the type of the function to spawn out of an abundance of caution (see the implementation issues described in [#89]). In the process of writing out the specification, we convinced ourselves that these problems should not apply to `thread.spawn_indirect`. This change adds the function type index necessary for doing some extra validation of `thread.spawn_indirect` and adds some tests related to this. One unimplemented TODO is what to do about shared tables: technically, the table used by a `thread.spawn_indirect` should be shared but we have so far prevented this in the component model; this can be resolved later, though. [#89]: WebAssembly/shared-everything-threads#89 * review: simplify WAST parsing

fitzgen approved these changes Jan 13, 2022

View reviewed changes

alexcrichton added 2 commits January 21, 2022 07:46

Use raw_mutate_func for data mutations

06058e8

alexcrichton force-pushed the modify-data branch from 4329cee to 06058e8 Compare January 21, 2022 16:11

Fix tests

85f177a

Also make the rename export mutator a bit more robust

fitzgen approved these changes Jan 21, 2022

View reviewed changes

Factor in a max size for data mutations

72c1771

alexcrichton merged commit a0bb80d into bytecodealliance:main Jan 21, 2022

alexcrichton deleted the modify-data branch January 21, 2022 20:58

alexcrichton mentioned this pull request Jan 24, 2022

wasm-mutate: Fix generating duplicate export names #455

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

wasm-mutate: Add a mutator to modify data segments #447

wasm-mutate: Add a mutator to modify data segments #447

Uh oh!

alexcrichton commented Jan 13, 2022

Uh oh!

fitzgen left a comment

Uh oh!

fitzgen Jan 13, 2022

Uh oh!

alexcrichton Jan 18, 2022

Uh oh!

fitzgen Jan 18, 2022

Uh oh!

alexcrichton Jan 19, 2022

Uh oh!

fitzgen Jan 19, 2022

Uh oh!

alexcrichton Jan 21, 2022

Uh oh!

fitzgen left a comment

Uh oh!

fitzgen Jan 21, 2022

Uh oh!

alexcrichton Jan 21, 2022

Uh oh!

Uh oh!

Uh oh!

	/// Set a custom raw mutation function.
	///
	/// This is used when we need some underlying raw bytes, for example when
	/// mutating the contents of a data segment.
	///
	/// You can override this to use `libFuzzer`'s `LLVMFuzzerMutate` function
	/// to get raw bytes from `libFuzzer`, for example.
	pub fn raw_mutate_func(
	&mut self,
	raw_mutate_func: Option<Arc<dyn Fn(&mut Vec<u8>) -> Result<()>>>,
	) -> &mut Self {
	self.raw_mutate_func = raw_mutate_func;
	self
	}

wasm-mutate: Add a mutator to modify data segments #447

wasm-mutate: Add a mutator to modify data segments #447

Uh oh!

Conversation

alexcrichton commented Jan 13, 2022

Uh oh!

fitzgen left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

fitzgen left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!