-
Notifications
You must be signed in to change notification settings - Fork 17
Description
Rayon has a ParallelIterator
trait which is based around recursively splitting tasks into subtasks until the set of workers in the pool is saturated. ParallelIterator
is fairly central to Rayon's overall design, and I would argue that the sum_rayon
implementation in the benchmark here, which calls join
for every single node in the tree, is really not idiomatic:
pub fn sum_rayon(pool: &rayon::ThreadPool, node: &Node<i64>) -> i64 {
if let (Some(left), Some(right)) = (&node.left, &node.right) {
let (left_result, right_result) =
pool.join(|| sum_rayon(pool, left), || sum_rayon(pool, right));
return node.value + left_result + right_result;
}
let mut result = node.value;
if let Some(child) = &node.left {
result += sum_rayon(pool, child);
}
if let Some(child) = &node.right {
result += sum_rayon(pool, child);
}
return result;
}
Here's an alternate version that makes use of Rayon's iter::split
function:
pub fn sum_split(pool: &rayon::ThreadPool, node: &Node<i64>) -> i64 {
use rayon::{iter, iter::ParallelIterator};
pool.install(|| {
iter::split(node, |node| {
if let (Some(left), Some(right)) = (&node.left, &node.right) {
(left, Some(right))
} else {
(node, None)
}
})
.map(sum)
.reduce(|| 0, |left, right| left + right)
})
}
When I benchmark these two implementations on a tree with 10,000,000 nodes (on a Intel Core i5-8400 CPU with 6 cores), the version that uses split
is significantly faster than the version that calls join
for every node, and is never slower than the baseline:
Threads | Baseline | Rayon | Rayon (split) |
---|---|---|---|
1 | 57.833 ms | 230.48 ms | 57.814 ms |
2 | 115.98 ms | 30.517 ms | |
4 | 59.531 ms | 17.483 ms | |
8 | 41.588 ms | 15.079 ms | |
16 | 41.426 ms | 14.772 ms | |
32 | 41.431 ms | 14.799 ms |
However, it is true that the split
version still shows the same poor scaling behavior for a tree with only 1,000 nodes:
Cores | Baseline | Rayon | Rayon (split) |
---|---|---|---|
1 | 2.4961 µs | 25.781 µs | 3.9588 µs |
2 | 17.522 µs | 8.5428 µs | |
4 | 16.180 µs | 12.508 µs | |
8 | 20.689 µs | 18.707 µs | |
16 | 28.541 µs | 28.069 µs | |
32 | 50.097 µs | 52.966 µs |
This is because the task is being eagerly split up to saturate the available workers without any regard for how small it is. Of course, there are usually good ad hoc solutions to this in the context of any particular problem: in this case, you could introduce a heuristic to make split
bottom out for tasks below a certain size, and IndexedParallelIterator
has methods like with_min_len
/with_max_len
and by_uniform_blocks
/by_exponential_blocks
. However, it's true that Rayon doesn't offer any completely automatic solution here, and the fact that Spice does makes it genuinely different and interesting.
Essentially, I think that there are two main factors responsible for the differences in the benchmarks between Rayon and Spice: first, that Spice is able to avoid the overhead of dynamic task dispatch by scheduling contiguous chunks of tasks for single workers; and second, that Spice is able to preempt those low-overhead contiguous chunks of tasks after they're already running in order to split them adaptively. The first thing isn't actually unique to Spice, and is in fact the purpose of Rayon's parallel iterator abstraction. Given that Rayon's docs push you pretty heavily to make use of parallel iterators, ignoring them entirely makes that aspect of the comparison feel somewhat misleading to me. However, the second thing is something that Rayon genuinely lacks the machinery for, which means that parallel iterators can sometimes make suboptimal splitting decisions, and so I think that aspect of the comparison is valid.