-
Notifications
You must be signed in to change notification settings - Fork 25.2k
fix flip() shape bug in CPU #13344
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix flip() shape bug in CPU #13344
Conversation
16f0e6d
to
cff159c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So what is the original cause of the bug? Also, could you benchmark on larger tensors? The OMP doesn't kick in until 1000 numel.
|
||
dim_list_to_bitset(dims, total_dims); |
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
const int64_t numel = in_tensor.numel(); | ||
auto strides = in_tensor.strides(); | ||
auto strides_v = strides.vec(); | ||
auto strides_t = at::CPU(kLong).tensorFromBlob(strides_v.data(), {static_cast<int64_t>(strides_v.size())}); |
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
@ssnl Thanks! The case I showed has numel = 10^6:
|
The previous bug was caused by misused of advance indexing that I still can't figure out. Since a customized kernel is faster, so I just use a kernel instead. |
@weiyangfb Ah you are right about the benchmark. Sorry about it! |
Maybe worth trying to look into the advanced indexing issue further in future. We may have a bug there. |
|
||
dim_list_to_bitset(dims, total_dims); // returned bitset is not used, here only check correctness of dims |
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
maybe_wrap_dims(flip_dims_v, total_dims); | ||
|
||
auto sizes = in_tensor.sizes(); | ||
auto flip_dims_t = at::CPU(kLong).tensorFromBlob(flip_dims_v.data(), {static_cast<int64_t>(flip_dims_v.size())}); |
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
c9dbfe4
to
176ce23
Compare
Yeah, definitely! can I land this PR to fix the bug on user side first? We can keep the issue open until I figure out the root cause of it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, that doesn't need to be done in this PR :)
Tensor out_tensor = at::empty_like(in_tensor); | ||
|
||
// create contiguous strides for input tensor | ||
Tensor stride_contiguous = at::zeros({total_dims}, kLong); |
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
void inline flip_cpu_kernel( | ||
const int64_t total_dims, | ||
const int64_t* stride_contiguous_d, | ||
const std::bitset<dim_bitset_size>& flip_dims_b, |
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
int64_t temp = cur_indices; | ||
cur_indices = cur_indices / stride_contiguous_d[d]; | ||
rem = temp - cur_indices * stride_contiguous_d[d]; | ||
if (flip_dims_b[d]) cur_indices = in_tensor.size(d) - 1 - cur_indices; |
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
176ce23
to
5cfe431
Compare
can I get a stamp on this? cc @ssnl |
wait... so which one should I look at? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@weiyangfb has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for late review
Summary: - a walk around for #13292, a complete fix requires investigation on the root cause when using advanced indexing - this PR brings in `filp()` CUDA implementation for CPU kernel - with this change: ``` >>> t = torch.randn(1, 3, 4, 5) >> t.flip(1, 3).shape torch.Size([1, 3, 4, 5]) ``` - performance: ``` ====== with this PR ====== >>> a = torch.randn(1000, 1000) >>> %timeit -r 100 a.flip(0, 1) 1.98 ms ± 579 µs per loop (mean ± std. dev. of 100 runs, 1000 loops each) ====== Perf at previous PR #7873 ====== 100 loops, best of 3: 11 ms per loop ``` Pull Request resolved: pytorch/pytorch#13344 Differential Revision: D12968003 Pulled By: weiyangfb fbshipit-source-id: 66f434049d143a0575a35b5c983b3e0577a1a28d
filp()
CUDA implementation for CPU kernel