Skip to content

Conversation

williamyaoh
Copy link
Contributor

@williamyaoh williamyaoh commented Jun 17, 2017

See #20.

This PR adds a struct SyntaxSetPool which pools SyntaxSets for parallel usage with the minimum amount of initialization of said SyntaxSets. By default, it will only initialize as many SyntaxSets as the running machine has CPUs, though this can be specified as well.

Examples of the usage of SyntaxSetPool can be found in examples/parsyncat.rs, which highlights multiple source files in parallel and prints the results to stdout, and benches/parallel_parsing.rs.

Issues

SyntaxDefinition's preexisting Clone implementation, which was merely incorrect before (due to Weak pointers in the resulting clone), can now cause memory unsafety due to their containing SyntaxSet being ferried across thread boundaries. Since this implementation was already incorrect, ideally this Clone implementation would be fixed or removed before merging this PR. Alternatively, there might be a way to hide it from library users.

@williamyaoh
Copy link
Contributor Author

I'd like some eyes on the unsafe code in parsing/syntax_set_pool.rs; from the benchmarks/tests, it seems to be safe the way I'm using it (barring the aforementioned Clone problem on SyntaxDefinition), but it doesn't hurt to check.

Copy link
Owner

@trishume trishume left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for taking on this feature, I appreciate it.

Unfortunately, I have a number of issues with the way you did it. I'm sorry you put the work into implementing it this way, if you had told me your plans in #20 I could have told you ahead of time what changes I'd request.

But, I've included suggestions for what changes you could make to get this PR to a state where I'd approve it.


#[bench]
fn bench_parallel_parsing(b: &mut Bencher) {
let files =
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer that all these files weren't added to the repo, but if you have a good reason I'm willing to be convinced.

My proposed alternative is just using the existing large test files (parser.rs and jquery.js) multiple times each.

use std::io::{stdout, BufReader, BufRead, Write};
use std::fs::File;

trait HasSync: Sync {}
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This appears to be unused.

/// Same as `new`, but with a defined maximum pool size.
pub fn with_pool_size(init_fn: F, pool_size: usize) -> Self {
SyntaxSetPool {
syntaxes: Mutex::new(repeating(LazyInit::new).take(pool_size).collect()),
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could be replaced with (0..pool_size).iter().map(|_| LazyInit::new).collect() or just a for loop using Vec::push and then the whole bunch repeating iterator code wouldn't need to be included just for this single use.

pub struct SyntaxSetPool<F: Fn() -> SyntaxSet> {
/// We intentionally use a *stack* of `SyntaxSet`s so that
/// already-initialized syntaxes get reused as much as possible.
syntaxes: Mutex<Vec<LazyInit<SyntaxSet>>>,
Copy link
Owner

@trishume trishume Jun 20, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of using this whole unsafe LazyInit business (which I'm very uncomfortable with, especially given that without it syntect itself contains no unsafe code), you could do something like this:

  1. Define `type PooledSet = Arc<Mutex>;
  2. In SyntaxSetPool use syntaxes: Mutex<Vec<PooledSet>>.
  3. Add a field max_pool_size that you set on initialization to the core count.
  4. Add a field num_in_pool that starts at 0.
  5. At initialization the syntaxes vector is empty.
  6. When getting a syntax, if there are none left in syntaxes:
    • If num_in_pool < max_pool_size, create a new one and bump num_in_pool.
    • Else block on has_syntax
  7. If there is a syntax, pop it, lock it, call go, then unlock it, lock the pool and return it.

So at the cost of one extra never-contested lock (which AFAIK should be very cheap) per syntax get, the amount of code required decreases substantially and it no longer requires unsafe.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately, I don't think that solution would work, since that wouldn't make SyntaxSetPool be Sync, since PooledSet wouldn't be Sync. This is the big sticking point for this pull request too, but I don't see a way to transport SyntaxSets across thread boundaries the way it's needed without using unsafe.

I agree with you that adding unsafe code is unsavoury, so maybe in the end the best way to resolve #20 would be to recommend people use thread_local! + rayon.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@williamyaoh you're right, my mistake. Looking into the docs for Send and Sync to try and understand why the Mutex doesn't end up Sync, I think the same issue might apply to your unsafe code.

I haven't thought about it very long yet but I think you might be able to pass an initializer function that clones one of the Rcs inside SyntaxSet, stashes it outside the closure inside a RefCell or something and use that to create a data race on the refcount.

Unless one of us thinks of a way to do this without unsafe, or perhaps more possibly, using a well-tested established crate (which might use unsafe) to do it, I'm leaning towards just recommending thread_local! and rayon in the docs like you say.

Copy link
Contributor Author

@williamyaoh williamyaoh Jun 22, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't believe, with SyntaxSet being written the way it is, that there's a way to implement pooling in syntect itself using only safe code.

For the purposes of pooling, the only marker we care about is making SyntaxSet be Send (If it were Sync, we could just share it between threads directly, with no need for pooling). My understanding of Send is that it denotes a type that's safe to move ownership of across thread boundaries, so making SyntaxSet be Send would require everything it owns to also be Send, which clearly can't happen because of the Rcs. For the same reason, no external crate could provide a safe interface to transport non-Send types across thread boundaries in general.

Without it being either Send or Sync, there's simply no way to safely share a SyntaxSet between threads. Either you move it, or you share it through a reference, whether that's through a normal reference or some kind of smart pointer.

I'd just go with recommending rayon. On reflection, the amount of code within the unsafety boundary caused by unsafely implementing Send on SyntaxSet is a bit too large to stomach.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@williamyaoh that's a good point, you're right. There's basically no way to prove to the compiler that you haven't cloned one of the Rcs and are going to access it from another thread. And I'm not even sure there is a way to stop you from doing that, even without proving it to the compiler.

I can update the docs and maybe add an example using Rayon and thread_local!, but it's inconvenient to contribute to syntect right now because I'm working at Google and it's used in a Google project, so I can't claim that it's unrelated and I'd have to go through a process.

So either I can do that in a couple months after my internship or I can accept a PR if you do it.

@@ -26,9 +26,11 @@ fnv = { version = "^1.0", optional = true }
serde = { version = "1.0", features = ["rc"] }
serde_derive = "1.0"
serde_json = "1.0"
num_cpus = "^1.5"
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be behind the parsing feature just like a bunch of the other crates. Same for the corresponding extern crate line.

/// Use syntax sets by passing a closure to `with_syntax_set()`. A
/// `SyntaxSet` will be removed from the pool, and a reference to it
/// passed to the given closure. Afterwards, the set will automatically
/// be returned to the pool.
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would add a note here to avoid using this when you're already using a crate like rayon that uses a small thread pool. In that case it is simpler to just use:

thread_local!(static SYNTAX_SET: SyntaxSet = SyntaxSet::load_defaults_newlines());

// and later:
SYNTAX_SET.with(|syntax_set| ... );

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants