Skip to content

syntastica:0.1.0 #143

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from
Closed

syntastica:0.1.0 #143

wants to merge 1 commit into from

Conversation

RubixDev
Copy link
Contributor

I am submitting

  • a new package
  • an update for a package

I have read and followed the submission guidelines and, in particular, I

  • selected a name in conformance with the guidelines
  • added a typst.toml file with all required keys
  • added a README.md with documentation for my package
  • have chosen a license and added a LICENSE file or linked one in my README.md
  • ensured that my submission does not infringe upon the rights of a third party
  • tested my package locally on my system and it worked
  • named this PR as name:version of the submitted package
  • agree that my package will not be removed without good reason

Description:

Syntax highlighting of code blocks using tree-sitter. The package makes use of the syntastica Rust project and the new Wasm plugins. It generally provides better results and supports more/other languages than the built-in syntect highlighting. Tree-sitter based highlighting already was requested by others (typst/typst#967), but declined for good reasons:

Warning

This package is both slow and big. The included Wasm binary is currently 30+ MB in size, and compilation time goes up into LaTeX territories (having it run in Wasm doesn't help). I would understand if that causes this package to not be accepted here.

@laurmaedje
Copy link
Member

The hugeness in combination with the slowness is certainly not ideal. I assume there is no way to make this smaller somehow? Also, do you know which parts is causing the slowdown? Is it just that wasmi is too slow in general or does something particularly computationally expensive happen?

@RubixDev
Copy link
Contributor Author

RubixDev commented Sep 22, 2023

I haven't exactly investigated what exactly makes it this big, but I strongly assume it's mostly the accumulation of all the tree-sitter parsers and queries, and I don't think there is much to do against that. It may potentially be possible to have multiple packages for each supported language, plus one core package with the main logic, but that's much less ergonomic for users and I can't think of a way to implement it unless tree-sitter/tree-sitter#1864 finally gets done.

As for the slow speed, by far the longest time is spent on compiling the queries, which is an issue with tree-sitter itself and one that I already investigated outside of this Typst plugin. I am mainly waiting for tree-sitter to support some kind of pre-compilation of queries, which is in the works and will drastically improve the startup time (see tree-sitter/tree-sitter#2374 (comment) and tree-sitter/tree-sitter#2594). Although, even without that, what takes 2 seconds running natively on my machine takes 2+ minutes running through wasmi in typst (where one of my six CPU cores is on 100% usage). I don't know if it's because wasmi is slow and other runtimes may be faster (I could maybe do some investigating tomorrow), or if the Wasm compiled code is just less efficient, or whatever else might cause this.

Note that the README also mentions the poor performance and advises to only enable syntastica for final release builds, and to continue to use syntect highlighting during development.

If you want to compare yourself, you can compile the all_languages example from the syntastica-typst repo (2m 23s on my machine) and run the all_languages example of the main syntastica repo (9s in debug mode, 2.8s in release mode on my machine). They basically both do the same thing, although actually the latter does even more because some parsers cannot easily be compiled to wasm32-unknown-unknown and therefore aren't included in the Typst plugin.

@laurmaedje
Copy link
Member

I've thought a bit about this and also discussed it with a few people on Discord. We've come to the conclusion that, at least for the time being, this package is too big to merge into the official package repository. It's not a good user experience and will bloat the repository (especially if more versions with slightly different binaries would be added over time). Some limit needs exist and with the current size, the package would also be too big for crates.io or the VS Code marketplace (for comparison). We'll have to see down the road how the plugin situation evolves. For now, I would suggest adding the package to Awesome Typst so that people can download it and try it locally.

@laurmaedje laurmaedje closed this Sep 22, 2023
@RubixDev
Copy link
Contributor Author

I agree. I didn't really expect this to be merged, but thanks for considering. As you suggested, I opened a PR on awesome-typst qjcg/awesome-typst#104.

@RubixDev
Copy link
Contributor Author

RubixDev commented Sep 23, 2023

And I did a quick speed comparison between wasmi and wasmtime with these results:

wasmi load module: 	112.914601ms
wasmi instantiate: 	17.700499ms
wasmi func call:	27.992434881s
wasmi get output: 	40ns
wasmi TOTAL: 		28.137709013s

wasmtime load module:	1.148834577s
wasmtime instantiate: 	11.056743ms
wasmtime func call: 	753.88696ms
wasmtime get output: 	60ns
wasmtime TOTAL: 		1.919890563s

I assume the original reason wasmi was chosen over wasmtime and wasmer was binary size? Maybe it should be reconsidered 🤷

The full Rust code I used (not optimized)
use std::{error::Error, time::Instant};

const WASM: &[u8] = include_bytes!("../syntastica_typst.wasm");
const FUNC_NAME: &str = "highlight";
const ARGS: &[&[u8]] = &[b"fn main() {}", b"rust", b"one::dark"];

fn main() -> Result<(), Box<dyn Error>> {
    let start = Instant::now();
    run_wasmi::run()?;
    println!("wasmi TOTAL: {:?}", start.elapsed());

    let start = Instant::now();
    run_wasmtime::run()?;
    println!("wasmtime TOTAL: {:?}", start.elapsed());

    Ok(())
}

mod run_wasmi {
    use std::error::Error;

    use wasmi::{AsContextMut, Caller, Engine, Linker, Module, Store};

    use crate::{ARGS, FUNC_NAME, WASM};

    #[derive(Default)]
    struct StoreData {
        args: Vec<Vec<u8>>,
        output: Vec<u8>,
    }

    pub fn run() -> Result<(), Box<dyn Error>> {
        let start = std::time::Instant::now();
        let engine = Engine::default();
        let module = Module::new(&engine, WASM)?;

        let mut linker = Linker::new(&engine);
        linker.func_wrap(
            "typst_env",
            "wasm_minimal_protocol_send_result_to_host",
            wasm_minimal_protocol_send_result_to_host,
        )?;
        linker.func_wrap(
            "typst_env",
            "wasm_minimal_protocol_write_args_to_buffer",
            wasm_minimal_protocol_write_args_to_buffer,
        )?;

        println!("wasmi load module: {:?}", start.elapsed());
        let start = std::time::Instant::now();
        let mut store = Store::new(&engine, StoreData::default());
        let instance = linker
            .instantiate(&mut store, &module)
            .and_then(|pre_instance| pre_instance.start(&mut store))
            .map_err(|e| format!("{e}"))?;
        println!("wasmi instantiate: {:?}", start.elapsed());

        // Ensure that the plugin exports its memory.
        if !matches!(
            instance.get_export(&store, "memory"),
            Some(wasmi::Extern::Memory(_))
        ) {
            Err("plugin does not export its memory")?;
        }

        // Find the function with the given name.
        let func = instance
            .get_export(&store, FUNC_NAME)
            .and_then(|e| e.into_func())
            .ok_or_else(|| format!("plugin does not contain a function called {FUNC_NAME}"))?;

        // Collect the lengths of the argument buffers.
        let lengths = ARGS
            .iter()
            .map(|a| wasmi::Value::I32(a.len() as i32))
            .collect::<Vec<_>>();

        // Store the input data.
        store.data_mut().args = ARGS.iter().map(|arg| arg.to_vec()).collect();

        // Call the function.
        let start = std::time::Instant::now();
        let mut code = wasmi::Value::I32(-1);
        func.call(
            store.as_context_mut(),
            &lengths,
            std::slice::from_mut(&mut code),
        )
        .map_err(|err| format!("plugin panicked: {err}"))?;
        println!("wasmi func call: {:?}", start.elapsed());

        let start = std::time::Instant::now();
        // Extract the returned data.
        let output = std::mem::take(&mut store.data_mut().output);

        // Parse the functions return value.
        match code {
            wasmi::Value::I32(0) => {}
            wasmi::Value::I32(1) => match std::str::from_utf8(&output) {
                Ok(message) => Err(format!("plugin errored with: {message}"))?,
                Err(_) => Err("plugin errored, but did not return a valid error message")?,
            },
            _ => Err("plugin did not respect the protocol")?,
        };
        println!("wasmi get output: {:?}", start.elapsed());

        Ok(())
    }

    /// Write the arguments to the plugin function into the plugin's memory.
    fn wasm_minimal_protocol_write_args_to_buffer(mut caller: Caller<StoreData>, ptr: u32) {
        let memory = caller.get_export("memory").unwrap().into_memory().unwrap();
        let arguments = std::mem::take(&mut caller.data_mut().args);
        let mut offset = ptr as usize;
        for arg in arguments {
            memory.write(&mut caller, offset, arg.as_slice()).unwrap();
            offset += arg.len();
        }
    }

    /// Extracts the output of the plugin function from the plugin's memory.
    fn wasm_minimal_protocol_send_result_to_host(
        mut caller: Caller<StoreData>,
        ptr: u32,
        len: u32,
    ) {
        let memory = caller.get_export("memory").unwrap().into_memory().unwrap();
        let mut buffer = std::mem::take(&mut caller.data_mut().output);
        buffer.resize(len as usize, 0);
        memory.read(&caller, ptr as _, &mut buffer).unwrap();
        caller.data_mut().output = buffer;
    }
}

mod run_wasmtime {
    use std::error::Error;

    use wasmtime::{Caller, Engine, Linker, Module, Store};

    use crate::{ARGS, FUNC_NAME, WASM};

    #[derive(Default)]
    struct StoreData {
        args: Vec<Vec<u8>>,
        output: Vec<u8>,
    }

    pub fn run() -> Result<(), Box<dyn Error>> {
        let start = std::time::Instant::now();
        let engine = Engine::default();
        let module = Module::new(&engine, WASM)?;

        let mut linker = Linker::new(&engine);
        linker.func_wrap(
            "typst_env",
            "wasm_minimal_protocol_send_result_to_host",
            wasm_minimal_protocol_send_result_to_host,
        )?;
        linker.func_wrap(
            "typst_env",
            "wasm_minimal_protocol_write_args_to_buffer",
            wasm_minimal_protocol_write_args_to_buffer,
        )?;
        println!("wasmtime load module: {:?}", start.elapsed());

        let start = std::time::Instant::now();
        let mut store = Store::new(&engine, StoreData::default());
        let instance = linker
            .instantiate(&mut store, &module)
            .map_err(|e| format!("{e}"))?;
        println!("wasmtime instantiate: {:?}", start.elapsed());

        // Ensure that the plugin exports its memory.
        if !matches!(
            instance.get_export(&mut store, "memory"),
            Some(wasmtime::Extern::Memory(_))
        ) {
            Err("plugin does not export its memory")?;
        }

        // Find the function with the given name.
        let func = instance
            .get_export(&mut store, FUNC_NAME)
            .and_then(|e| e.into_func())
            .ok_or_else(|| format!("plugin does not contain a function called {FUNC_NAME}"))?;

        // Collect the lengths of the argument buffers.
        let lengths = ARGS
            .iter()
            .map(|a| wasmtime::Val::I32(a.len() as i32))
            .collect::<Vec<_>>();

        // Store the input data.
        store.data_mut().args = ARGS.iter().map(|arg| arg.to_vec()).collect();

        // Call the function.
        let start = std::time::Instant::now();
        let mut code = wasmtime::Val::I32(-1);
        func.call(&mut store, &lengths, std::slice::from_mut(&mut code))
            .map_err(|err| format!("plugin panicked: {err}"))?;
        println!("wasmtime func call: {:?}", start.elapsed());

        let start = std::time::Instant::now();
        // Extract the returned data.
        let output = std::mem::take(&mut store.data_mut().output);

        // Parse the functions return value.
        match code {
            wasmtime::Val::I32(0) => {}
            wasmtime::Val::I32(1) => match std::str::from_utf8(&output) {
                Ok(message) => Err(format!("plugin errored with: {message}"))?,
                Err(_) => Err("plugin errored, but did not return a valid error message")?,
            },
            _ => Err("plugin did not respect the protocol")?,
        };
        println!("wasmtime get output: {:?}", start.elapsed());

        Ok(())
    }

    /// Write the arguments to the plugin function into the plugin's memory.
    fn wasm_minimal_protocol_write_args_to_buffer(mut caller: Caller<StoreData>, ptr: u32) {
        let memory = caller.get_export("memory").unwrap().into_memory().unwrap();
        let arguments = std::mem::take(&mut caller.data_mut().args);
        let mut offset = ptr as usize;
        for arg in arguments {
            memory.write(&mut caller, offset, arg.as_slice()).unwrap();
            offset += arg.len();
        }
    }

    /// Extracts the output of the plugin function from the plugin's memory.
    fn wasm_minimal_protocol_send_result_to_host(
        mut caller: Caller<StoreData>,
        ptr: u32,
        len: u32,
    ) {
        let memory = caller.get_export("memory").unwrap().into_memory().unwrap();
        let mut buffer = std::mem::take(&mut caller.data_mut().output);
        buffer.resize(len as usize, 0);
        memory.read(&caller, ptr as _, &mut buffer).unwrap();
        caller.data_mut().output = buffer;
    }
}

@laurmaedje
Copy link
Member

It was chosen for binary size, dependency count and simplicity. Maybe we do need to switch to wasmtime after all. (And to native WebAssembly modules in the web app, right now it also uses wasmi, but that's really only because we didn't have time to implement it properly.)

@RubixDev
Copy link
Contributor Author

I personally also had a good experience with using wasmer in the web by simply enabling its "js" feature, which uses the browser's native Wasm support

@RubixDev
Copy link
Contributor Author

Just in case someone is interested, I also inspected the binary size a bit further. Most of the 30MB indeed come from the various tree-sitter parsers with the following approximate distribution:

12.05 MB  verilog         (12633248 B)
 4.10 MB  c_sharp         ( 4298933 B)
 0.74 MB  tsx             (  776696 B)
 0.72 MB  typescript      (  753613 B)
 0.63 MB  dart            (  660143 B)
 0.50 MB  rust            (  521345 B)
 0.49 MB  c               (  516491 B)
 0.27 MB  javascript      (  282965 B)
 0.25 MB  java            (  260383 B)
 0.25 MB  ql              (  257255 B)
 0.22 MB  go              (  230512 B)
 0.17 MB  wat             (  179282 B)
 0.12 MB  scss            (  123203 B)
 0.05 MB  css             (   47959 B)
 0.04 MB  rush            (   43693 B)
 0.02 MB  json5           (   22487 B)
 0.02 MB  toml            (   21570 B)
 0.01 MB  diff            (   14912 B)
 0.01 MB  jsdoc           (   11941 B)
 0.01 MB  regex           (    9422 B)
 0.01 MB  asm             (    8607 B)
 0.00 MB  comment         (    4171 B)
 0.00 MB  json            (    3507 B)
 0.00 MB  jsonc           (    3334 B)
 0.00 MB  ejs             (    3020 B)
 0.00 MB  erb             (    3020 B)
 0.00 MB  ebnf            (    1338 B)
 0.00 MB  hexdump         (       0 B)

The hexdump parser was used as a baseline to remove common code from the sizes, which is why it shows as 0 bytes here. I will probably remove support for verilog, which will already reduce the final Wasm binary size to 13 MB.

It's difficult to measure precisely, but also roughly 2 MB seem to come from just the tree-sitter core package, which is a bit surprising given the official web-tree-sitter Wasm release binary is only 182 KB 🤔 (and yes I did use both wasm-opt and wasm-strip)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants