GuardWithIf is really slow on small-extent vectorized loops on architectures without predication

Two possible new tail strategies that could help:

RoundUpAndBlend, which would be like GuardWithIf but it would use a select instead of an if, so if GuardWithIf does something equivalent to:

f(x) = if x < extent then g(x) else dontcare

RoundUpAndBlend would do:

f(x) = select(x < extent, g(x), f(x))

I.e. it loads the vector it would store to, modifies some of the lanes, and then stores the result. This would be a race condition if there's an outer parallel loop in that dimension, so we'd have to check for that.

ShiftInwardsAndBlend would be similar, but shifting inwards instead of rounding up, so that the overall allocation bounds aren't expanded if the extent is at least one vector. It would be really useful for vectorizing pure vars in update definitions touching inputs and outputs when you expect the extent to be small.

Specifically, I want to use this schedule:

```
output.update().specialize(output.width() < vec);
output.update().vectorize(x, vec, TailStrategy::ShiftInwardsAndBlend);
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

GuardWithIf is really slow on small-extent vectorized loops on architectures without predication #7947

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

GuardWithIf is really slow on small-extent vectorized loops on architectures without predication #7947

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions