Skip to content

Comparison with Rayon should mention ParallelIterator #5

@micahrj

Description

@micahrj

Rayon has a ParallelIterator trait which is based around recursively splitting tasks into subtasks until the set of workers in the pool is saturated. ParallelIterator is fairly central to Rayon's overall design, and I would argue that the sum_rayon implementation in the benchmark here, which calls join for every single node in the tree, is really not idiomatic:

pub fn sum_rayon(pool: &rayon::ThreadPool, node: &Node<i64>) -> i64 {
    if let (Some(left), Some(right)) = (&node.left, &node.right) {
        let (left_result, right_result) =
            pool.join(|| sum_rayon(pool, left), || sum_rayon(pool, right));
        return node.value + left_result + right_result;
    }

    let mut result = node.value;
    if let Some(child) = &node.left {
        result += sum_rayon(pool, child);
    }
    if let Some(child) = &node.right {
        result += sum_rayon(pool, child);
    }
    return result;
}

Here's an alternate version that makes use of Rayon's iter::split function:

pub fn sum_split(pool: &rayon::ThreadPool, node: &Node<i64>) -> i64 {
    use rayon::{iter, iter::ParallelIterator};

    pool.install(|| {
        iter::split(node, |node| {
            if let (Some(left), Some(right)) = (&node.left, &node.right) {
                (left, Some(right))
            } else {
                (node, None)
            }
        })
        .map(sum)
        .reduce(|| 0, |left, right| left + right)
    })
}

When I benchmark these two implementations on a tree with 10,000,000 nodes (on a Intel Core i5-8400 CPU with 6 cores), the version that uses split is significantly faster than the version that calls join for every node, and is never slower than the baseline:

Threads Baseline Rayon Rayon (split)
1 57.833 ms 230.48 ms 57.814 ms
2 115.98 ms 30.517 ms
4 59.531 ms 17.483 ms
8 41.588 ms 15.079 ms
16 41.426 ms 14.772 ms
32 41.431 ms 14.799 ms

However, it is true that the split version still shows the same poor scaling behavior for a tree with only 1,000 nodes:

Cores Baseline Rayon Rayon (split)
1 2.4961 µs 25.781 µs 3.9588 µs
2 17.522 µs 8.5428 µs
4 16.180 µs 12.508 µs
8 20.689 µs 18.707 µs
16 28.541 µs 28.069 µs
32 50.097 µs 52.966 µs

This is because the task is being eagerly split up to saturate the available workers without any regard for how small it is. Of course, there are usually good ad hoc solutions to this in the context of any particular problem: in this case, you could introduce a heuristic to make split bottom out for tasks below a certain size, and IndexedParallelIterator has methods like with_min_len/with_max_len and by_uniform_blocks/by_exponential_blocks. However, it's true that Rayon doesn't offer any completely automatic solution here, and the fact that Spice does makes it genuinely different and interesting.

Essentially, I think that there are two main factors responsible for the differences in the benchmarks between Rayon and Spice: first, that Spice is able to avoid the overhead of dynamic task dispatch by scheduling contiguous chunks of tasks for single workers; and second, that Spice is able to preempt those low-overhead contiguous chunks of tasks after they're already running in order to split them adaptively. The first thing isn't actually unique to Spice, and is in fact the purpose of Rayon's parallel iterator abstraction. Given that Rayon's docs push you pretty heavily to make use of parallel iterators, ignoring them entirely makes that aspect of the comparison feel somewhat misleading to me. However, the second thing is something that Rayon genuinely lacks the machinery for, which means that parallel iterators can sometimes make suboptimal splitting decisions, and so I think that aspect of the comparison is valid.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions