Add pooling for parallel usage of SyntaxSet #78

williamyaoh · 2017-06-17T18:31:56Z

See #20.

This PR adds a struct SyntaxSetPool which pools SyntaxSets for parallel usage with the minimum amount of initialization of said SyntaxSets. By default, it will only initialize as many SyntaxSets as the running machine has CPUs, though this can be specified as well.

Examples of the usage of SyntaxSetPool can be found in examples/parsyncat.rs, which highlights multiple source files in parallel and prints the results to stdout, and benches/parallel_parsing.rs.

Issues

SyntaxDefinition's preexisting Clone implementation, which was merely incorrect before (due to Weak pointers in the resulting clone), can now cause memory unsafety due to their containing SyntaxSet being ferried across thread boundaries. Since this implementation was already incorrect, ideally this Clone implementation would be fixed or removed before merging this PR. Alternatively, there might be a way to hide it from library users.

williamyaoh · 2017-06-17T18:54:49Z

I'd like some eyes on the unsafe code in parsing/syntax_set_pool.rs; from the benchmarks/tests, it seems to be safe the way I'm using it (barring the aforementioned Clone problem on SyntaxDefinition), but it doesn't hurt to check.

trishume

Thanks for taking on this feature, I appreciate it.

Unfortunately, I have a number of issues with the way you did it. I'm sorry you put the work into implementing it this way, if you had told me your plans in #20 I could have told you ahead of time what changes I'd request.

But, I've included suggestions for what changes you could make to get this PR to a state where I'd approve it.

trishume · 2017-06-20T01:41:27Z

benches/parallel_parsing.rs

+
+#[bench]
+fn bench_parallel_parsing(b: &mut Bencher) {
+    let files =


I'd prefer that all these files weren't added to the repo, but if you have a good reason I'm willing to be convinced.

My proposed alternative is just using the existing large test files (parser.rs and jquery.js) multiple times each.

trishume · 2017-06-20T01:41:52Z

examples/parsyncat.rs

+use std::io::{stdout, BufReader, BufRead, Write};
+use std::fs::File;
+
+trait HasSync: Sync {}


This appears to be unused.

trishume · 2017-06-20T01:58:01Z

src/parsing/syntax_set_pool.rs

+    /// Same as `new`, but with a defined maximum pool size.
+    pub fn with_pool_size(init_fn: F, pool_size: usize) -> Self {
+        SyntaxSetPool {
+            syntaxes: Mutex::new(repeating(LazyInit::new).take(pool_size).collect()),


This could be replaced with (0..pool_size).iter().map(|_| LazyInit::new).collect() or just a for loop using Vec::push and then the whole bunch repeating iterator code wouldn't need to be included just for this single use.

trishume · 2017-06-20T02:21:45Z

src/parsing/syntax_set_pool.rs

+pub struct SyntaxSetPool<F: Fn() -> SyntaxSet> {
+    /// We intentionally use a *stack* of `SyntaxSet`s so that
+    /// already-initialized syntaxes get reused as much as possible.
+    syntaxes: Mutex<Vec<LazyInit<SyntaxSet>>>,


Instead of using this whole unsafe LazyInit business (which I'm very uncomfortable with, especially given that without it syntect itself contains no unsafe code), you could do something like this:

Define `type PooledSet = Arc<Mutex>;

In SyntaxSetPool use syntaxes: Mutex<Vec<PooledSet>>.

Add a field max_pool_size that you set on initialization to the core count.

Add a field num_in_pool that starts at 0.

At initialization the syntaxes vector is empty.

When getting a syntax, if there are none left in syntaxes:

If num_in_pool < max_pool_size, create a new one and bump num_in_pool.

Else block on has_syntax

If there is a syntax, pop it, lock it, call go, then unlock it, lock the pool and return it.

So at the cost of one extra never-contested lock (which AFAIK should be very cheap) per syntax get, the amount of code required decreases substantially and it no longer requires unsafe.

Unfortunately, I don't think that solution would work, since that wouldn't make SyntaxSetPool be Sync, since PooledSet wouldn't be Sync. This is the big sticking point for this pull request too, but I don't see a way to transport SyntaxSets across thread boundaries the way it's needed without using unsafe.

I agree with you that adding unsafe code is unsavoury, so maybe in the end the best way to resolve #20 would be to recommend people use thread_local! + rayon.

@williamyaoh you're right, my mistake. Looking into the docs for Send and Sync to try and understand why the Mutex doesn't end up Sync, I think the same issue might apply to your unsafe code.

I haven't thought about it very long yet but I think you might be able to pass an initializer function that clones one of the Rcs inside SyntaxSet, stashes it outside the closure inside a RefCell or something and use that to create a data race on the refcount.

Unless one of us thinks of a way to do this without unsafe, or perhaps more possibly, using a well-tested established crate (which might use unsafe) to do it, I'm leaning towards just recommending thread_local! and rayon in the docs like you say.

I don't believe, with SyntaxSet being written the way it is, that there's a way to implement pooling in syntect itself using only safe code.

For the purposes of pooling, the only marker we care about is making SyntaxSet be Send (If it were Sync, we could just share it between threads directly, with no need for pooling). My understanding of Send is that it denotes a type that's safe to move ownership of across thread boundaries, so making SyntaxSet be Send would require everything it owns to also be Send, which clearly can't happen because of the Rcs. For the same reason, no external crate could provide a safe interface to transport non-Send types across thread boundaries in general.

Without it being either Send or Sync, there's simply no way to safely share a SyntaxSet between threads. Either you move it, or you share it through a reference, whether that's through a normal reference or some kind of smart pointer.

I'd just go with recommending rayon. On reflection, the amount of code within the unsafety boundary caused by unsafely implementing Send on SyntaxSet is a bit too large to stomach.

@williamyaoh that's a good point, you're right. There's basically no way to prove to the compiler that you haven't cloned one of the Rcs and are going to access it from another thread. And I'm not even sure there is a way to stop you from doing that, even without proving it to the compiler.

I can update the docs and maybe add an example using Rayon and thread_local!, but it's inconvenient to contribute to syntect right now because I'm working at Google and it's used in a Google project, so I can't claim that it's unrelated and I'd have to go through a process.

So either I can do that in a couple months after my internship or I can accept a PR if you do it.

trishume · 2017-06-20T02:22:58Z

Cargo.toml

@@ -26,9 +26,11 @@ fnv = { version = "^1.0", optional = true }
 serde = { version = "1.0", features = ["rc"] }
 serde_derive = "1.0"
 serde_json = "1.0"
+num_cpus = "^1.5"


I think this should be behind the parsing feature just like a bunch of the other crates. Same for the corresponding extern crate line.

trishume · 2017-06-20T02:31:29Z

src/parsing/syntax_set_pool.rs

+/// Use syntax sets by passing a closure to `with_syntax_set()`. A
+/// `SyntaxSet` will be removed from the pool, and a reference to it
+/// passed to the given closure. Afterwards, the set will automatically
+/// be returned to the pool.


I would add a note here to avoid using this when you're already using a crate like rayon that uses a small thread pool. In that case it is simpler to just use:

thread_local!(static SYNTAX_SET: SyntaxSet = SyntaxSet::load_defaults_newlines()); // and later: SYNTAX_SET.with(|syntax_set| ... );

williamyaoh added 6 commits June 17, 2017 13:02

add working syntax set pool for parallel parsing (trishume#20)

7064875

Merge branch 'master' into syntax_set_pool

3d0c4c9

remove an oopsie

514c6e7

add useful documentation for SyntaxSetPool

0f92532

hopefully fix weird path error in Travis

32545d9

add feature annotations on syntax_set_pool

757de7b

trishume requested changes Jun 20, 2017

View reviewed changes

williamyaoh closed this Jun 25, 2017

This was referenced Jun 25, 2017

Recommend rayon, add examples/documentation of parallel usage #82

Merged

Enable multi-threaded use #20

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add pooling for parallel usage of SyntaxSet #78

Add pooling for parallel usage of SyntaxSet #78

Uh oh!

williamyaoh commented Jun 17, 2017 •

edited

Loading

Uh oh!

williamyaoh commented Jun 17, 2017

Uh oh!

trishume left a comment

Uh oh!

trishume Jun 20, 2017

Uh oh!

trishume Jun 20, 2017

Uh oh!

trishume Jun 20, 2017

Uh oh!

trishume Jun 20, 2017 •

edited

Loading

Uh oh!

williamyaoh Jun 20, 2017

Uh oh!

trishume Jun 20, 2017

Uh oh!

williamyaoh Jun 22, 2017 •

edited

Loading

Uh oh!

trishume Jun 23, 2017

Uh oh!

trishume Jun 20, 2017

Uh oh!

trishume Jun 20, 2017

Uh oh!

Uh oh!

Add pooling for parallel usage of SyntaxSet #78

Add pooling for parallel usage of SyntaxSet #78

Uh oh!

Conversation

williamyaoh commented Jun 17, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Issues

Uh oh!

williamyaoh commented Jun 17, 2017

Uh oh!

trishume left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

trishume Jun 20, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

williamyaoh Jun 22, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

williamyaoh commented Jun 17, 2017 •

edited

Loading

trishume Jun 20, 2017 •

edited

Loading

williamyaoh Jun 22, 2017 •

edited

Loading