Comparison with Rayon should mention `ParallelIterator`

Rayon has a [`ParallelIterator`](https://docs.rs/rayon/1.10.0/rayon/iter/trait.ParallelIterator.html) trait which is based around recursively splitting tasks into subtasks until the set of workers in the pool is saturated. `ParallelIterator` is fairly central to Rayon's overall design, and I would argue that the `sum_rayon` implementation in the benchmark here, which calls `join` for every single node in the tree, is really not idiomatic:

```rust
pub fn sum_rayon(pool: &rayon::ThreadPool, node: &Node<i64>) -> i64 {
    if let (Some(left), Some(right)) = (&node.left, &node.right) {
        let (left_result, right_result) =
            pool.join(|| sum_rayon(pool, left), || sum_rayon(pool, right));
        return node.value + left_result + right_result;
    }

    let mut result = node.value;
    if let Some(child) = &node.left {
        result += sum_rayon(pool, child);
    }
    if let Some(child) = &node.right {
        result += sum_rayon(pool, child);
    }
    return result;
}
```

Here's an alternate version that makes use of Rayon's [`iter::split`](https://docs.rs/rayon/1.10.0/rayon/iter/fn.split.html) function:

```rust
pub fn sum_split(pool: &rayon::ThreadPool, node: &Node<i64>) -> i64 {
    use rayon::{iter, iter::ParallelIterator};

    pool.install(|| {
        iter::split(node, |node| {
            if let (Some(left), Some(right)) = (&node.left, &node.right) {
                (left, Some(right))
            } else {
                (node, None)
            }
        })
        .map(sum)
        .reduce(|| 0, |left, right| left + right)
    })
}
```

When I benchmark these two implementations on a tree with 10,000,000 nodes (on a Intel Core i5-8400 CPU with 6 cores), the version that uses `split` is significantly faster than the version that calls `join` for every node, and is never slower than the baseline:

| Threads | Baseline | Rayon | Rayon (split) |
| ----- | -------- | ----- | ------------- |
| 1 | 57.833 ms | 230.48 ms | 57.814 ms |
| 2 | | 115.98 ms | 30.517 ms |
| 4 | | 59.531 ms | 17.483 ms |
| 8 | | 41.588 ms | 15.079 ms |
| 16 | | 41.426 ms | 14.772 ms |
| 32 | | 41.431 ms | 14.799 ms |

However, it is true that the `split` version still shows the same poor scaling behavior for a tree with only 1,000 nodes:

| Cores | Baseline | Rayon | Rayon (split) |
| ----- | -------- | ----- | ------------- |
| 1 | 2.4961 µs | 25.781 µs | 3.9588 µs |
| 2 | | 17.522 µs | 8.5428 µs |
| 4 | | 16.180 µs | 12.508 µs |
| 8 | | 20.689 µs | 18.707 µs |
| 16 | | 28.541 µs | 28.069 µs |
| 32 | | 50.097 µs | 52.966 µs |

This is because the task is being eagerly split up to saturate the available workers without any regard for how small it is. Of course, there are usually good ad hoc solutions to this in the context of any particular problem: in this case, you could introduce a heuristic to make `split` bottom out for tasks below a certain size, and [`IndexedParallelIterator`](https://docs.rs/rayon/1.10.0/rayon/iter/trait.IndexedParallelIterator.html) has methods like [`with_min_len`](https://docs.rs/rayon/1.10.0/rayon/iter/trait.IndexedParallelIterator.html#method.with_min_len)/[`with_max_len`](https://docs.rs/rayon/1.10.0/rayon/iter/trait.IndexedParallelIterator.html#method.with_max_len) and [`by_uniform_blocks`](https://docs.rs/rayon/1.10.0/rayon/iter/trait.IndexedParallelIterator.html#method.by_uniform_blocks)/[`by_exponential_blocks`](https://docs.rs/rayon/1.10.0/rayon/iter/trait.IndexedParallelIterator.html#method.by_exponential_blocks). However, it's true that Rayon doesn't offer any completely automatic solution here, and the fact that Spice does makes it genuinely different and interesting.

Essentially, I think that there are two main factors responsible for the differences in the benchmarks between Rayon and Spice: first, that Spice is able to avoid the overhead of dynamic task dispatch by scheduling contiguous chunks of tasks for single workers; and second, that Spice is able to preempt those low-overhead contiguous chunks of tasks after they're already running in order to split them adaptively. The first thing isn't actually unique to Spice, and is in fact the purpose of Rayon's parallel iterator abstraction. Given that Rayon's docs push you pretty heavily to make use of parallel iterators, ignoring them entirely makes that aspect of the comparison feel somewhat misleading to me. However, the second thing is something that Rayon genuinely lacks the machinery for, which means that parallel iterators can sometimes make suboptimal splitting decisions, and so I think that aspect of the comparison is valid.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Comparison with Rayon should mention `ParallelIterator` #5

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Threads	Baseline	Rayon	Rayon (split)
1	57.833 ms	230.48 ms	57.814 ms
2		115.98 ms	30.517 ms
4		59.531 ms	17.483 ms
8		41.588 ms	15.079 ms
16		41.426 ms	14.772 ms
32		41.431 ms	14.799 ms

Cores	Baseline	Rayon	Rayon (split)
1	2.4961 µs	25.781 µs	3.9588 µs
2		17.522 µs	8.5428 µs
4		16.180 µs	12.508 µs
8		20.689 µs	18.707 µs
16		28.541 µs	28.069 µs
32		50.097 µs	52.966 µs

Comparison with Rayon should mention ParallelIterator #5

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions

Comparison with Rayon should mention `ParallelIterator` #5