-
Notifications
You must be signed in to change notification settings - Fork 155
Add pooling for parallel usage of SyntaxSet #78
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
I'd like some eyes on the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for taking on this feature, I appreciate it.
Unfortunately, I have a number of issues with the way you did it. I'm sorry you put the work into implementing it this way, if you had told me your plans in #20 I could have told you ahead of time what changes I'd request.
But, I've included suggestions for what changes you could make to get this PR to a state where I'd approve it.
|
||
#[bench] | ||
fn bench_parallel_parsing(b: &mut Bencher) { | ||
let files = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd prefer that all these files weren't added to the repo, but if you have a good reason I'm willing to be convinced.
My proposed alternative is just using the existing large test files (parser.rs
and jquery.js
) multiple times each.
use std::io::{stdout, BufReader, BufRead, Write}; | ||
use std::fs::File; | ||
|
||
trait HasSync: Sync {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This appears to be unused.
/// Same as `new`, but with a defined maximum pool size. | ||
pub fn with_pool_size(init_fn: F, pool_size: usize) -> Self { | ||
SyntaxSetPool { | ||
syntaxes: Mutex::new(repeating(LazyInit::new).take(pool_size).collect()), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could be replaced with (0..pool_size).iter().map(|_| LazyInit::new).collect()
or just a for loop using Vec::push
and then the whole bunch repeating
iterator code wouldn't need to be included just for this single use.
pub struct SyntaxSetPool<F: Fn() -> SyntaxSet> { | ||
/// We intentionally use a *stack* of `SyntaxSet`s so that | ||
/// already-initialized syntaxes get reused as much as possible. | ||
syntaxes: Mutex<Vec<LazyInit<SyntaxSet>>>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of using this whole unsafe LazyInit
business (which I'm very uncomfortable with, especially given that without it syntect
itself contains no unsafe code), you could do something like this:
- Define `type PooledSet = Arc<Mutex>;
- In SyntaxSetPool use
syntaxes: Mutex<Vec<PooledSet>>
. - Add a field
max_pool_size
that you set on initialization to the core count. - Add a field
num_in_pool
that starts at0
. - At initialization the
syntaxes
vector is empty. - When getting a syntax, if there are none left in
syntaxes
:- If
num_in_pool < max_pool_size
, create a new one and bumpnum_in_pool
. - Else block on
has_syntax
- If
- If there is a syntax, pop it, lock it, call
go
, then unlock it, lock the pool and return it.
So at the cost of one extra never-contested lock (which AFAIK should be very cheap) per syntax get, the amount of code required decreases substantially and it no longer requires unsafe
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately, I don't think that solution would work, since that wouldn't make SyntaxSetPool
be Sync
, since PooledSet
wouldn't be Sync
. This is the big sticking point for this pull request too, but I don't see a way to transport SyntaxSet
s across thread boundaries the way it's needed without using unsafe
.
I agree with you that adding unsafe
code is unsavoury, so maybe in the end the best way to resolve #20 would be to recommend people use thread_local!
+ rayon
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@williamyaoh you're right, my mistake. Looking into the docs for Send
and Sync
to try and understand why the Mutex
doesn't end up Sync
, I think the same issue might apply to your unsafe code.
I haven't thought about it very long yet but I think you might be able to pass an initializer function that clones one of the Rc
s inside SyntaxSet
, stashes it outside the closure inside a RefCell
or something and use that to create a data race on the refcount.
Unless one of us thinks of a way to do this without unsafe
, or perhaps more possibly, using a well-tested established crate (which might use unsafe) to do it, I'm leaning towards just recommending thread_local!
and rayon
in the docs like you say.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't believe, with SyntaxSet
being written the way it is, that there's a way to implement pooling in syntect
itself using only safe code.
For the purposes of pooling, the only marker we care about is making SyntaxSet
be Send
(If it were Sync
, we could just share it between threads directly, with no need for pooling). My understanding of Send
is that it denotes a type that's safe to move ownership of across thread boundaries, so making SyntaxSet
be Send
would require everything it owns to also be Send
, which clearly can't happen because of the Rc
s. For the same reason, no external crate could provide a safe interface to transport non-Send
types across thread boundaries in general.
Without it being either Send
or Sync
, there's simply no way to safely share a SyntaxSet
between threads. Either you move it, or you share it through a reference, whether that's through a normal reference or some kind of smart pointer.
I'd just go with recommending rayon
. On reflection, the amount of code within the unsafety boundary caused by unsafely impl
ementing Send
on SyntaxSet
is a bit too large to stomach.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@williamyaoh that's a good point, you're right. There's basically no way to prove to the compiler that you haven't cloned one of the Rc
s and are going to access it from another thread. And I'm not even sure there is a way to stop you from doing that, even without proving it to the compiler.
I can update the docs and maybe add an example using Rayon and thread_local!, but it's inconvenient to contribute to syntect right now because I'm working at Google and it's used in a Google project, so I can't claim that it's unrelated and I'd have to go through a process.
So either I can do that in a couple months after my internship or I can accept a PR if you do it.
@@ -26,9 +26,11 @@ fnv = { version = "^1.0", optional = true } | |||
serde = { version = "1.0", features = ["rc"] } | |||
serde_derive = "1.0" | |||
serde_json = "1.0" | |||
num_cpus = "^1.5" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should be behind the parsing
feature just like a bunch of the other crates. Same for the corresponding extern crate
line.
/// Use syntax sets by passing a closure to `with_syntax_set()`. A | ||
/// `SyntaxSet` will be removed from the pool, and a reference to it | ||
/// passed to the given closure. Afterwards, the set will automatically | ||
/// be returned to the pool. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would add a note here to avoid using this when you're already using a crate like rayon
that uses a small thread pool. In that case it is simpler to just use:
thread_local!(static SYNTAX_SET: SyntaxSet = SyntaxSet::load_defaults_newlines());
// and later:
SYNTAX_SET.with(|syntax_set| ... );
See #20.
This PR adds a struct
SyntaxSetPool
which poolsSyntaxSet
s for parallel usage with the minimum amount of initialization of saidSyntaxSet
s. By default, it will only initialize as manySyntaxSet
s as the running machine has CPUs, though this can be specified as well.Examples of the usage of
SyntaxSetPool
can be found inexamples/parsyncat.rs
, which highlights multiple source files in parallel and prints the results tostdout
, andbenches/parallel_parsing.rs
.Issues
SyntaxDefinition
's preexistingClone
implementation, which was merely incorrect before (due toWeak
pointers in the resulting clone), can now cause memory unsafety due to their containingSyntaxSet
being ferried across thread boundaries. Since this implementation was already incorrect, ideally thisClone
implementation would be fixed or removed before merging this PR. Alternatively, there might be a way to hide it from library users.