[cpu][flash attention] fix nan issue #130014

Valentine233 · 2024-07-03T06:21:12Z

NaNs are generated in flash attention because the computation of std::exp((-inf) - (-inf)) and +/-inf * 0 in lazy softmax. We fix the issue by avoiding the related calculation.

cc @jgong5 @mingfeima @XiaobingSuper @sanchitintel @ashokei @jingxu10

pytorch-bot · 2024-07-03T06:21:15Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/130014

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 37f6864 with merge base 1e27af3 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

drisspg

Thanks!

Valentine233 · 2024-07-10T02:22:52Z

@pytorchbot merge

pytorchmergebot · 2024-07-10T02:25:00Z

Merge failed

Reason: This PR needs a release notes: label
If your changes are user facing and intended to be a part of release notes, please use a label starting with release notes:.

If not, please add the topic: not user facing label.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "topic: not user facing"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

Details for Dev Infra team

Raised by workflow job

Valentine233 · 2024-07-10T02:26:35Z

@pytorchbot merge

pytorchmergebot · 2024-07-10T02:28:04Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

Fixes pytorch#127055. NaNs are generated in flash attention because the computation of `std::exp((-inf) - (-inf))` and `+/-inf * 0` in lazy softmax. We fix the issue by avoiding the related calculation. Pull Request resolved: pytorch#130014 Approved by: https://github.com/jgong5, https://github.com/drisspg

atalman · 2024-08-15T17:57:38Z

@pytorchbot cherry-pick --onto release/2.4 -c critical --fixes #127055

Fixes #127055. NaNs are generated in flash attention because the computation of `std::exp((-inf) - (-inf))` and `+/-inf * 0` in lazy softmax. We fix the issue by avoiding the related calculation. Pull Request resolved: #130014 Approved by: https://github.com/jgong5, https://github.com/drisspg (cherry picked from commit 868d9a4)

pytorchbot · 2024-08-15T18:01:29Z

Cherry picking #130014

The cherry pick PR is at #133598 and it is linked with issue #127055. The following tracker issues are updated:

[v2.4.1] Release Tracker #132400 (comment)

Details for Dev Infra team

Raised by workflow job

[cpu][flash attention] fix nan issue (#130014) Fixes #127055. NaNs are generated in flash attention because the computation of `std::exp((-inf) - (-inf))` and `+/-inf * 0` in lazy softmax. We fix the issue by avoiding the related calculation. Pull Request resolved: #130014 Approved by: https://github.com/jgong5, https://github.com/drisspg (cherry picked from commit 868d9a4) Co-authored-by: Valentine233 <xuan.liao@intel.com>

Summary: Flash attention imlementation breaks q @ k matmul in chunks in both source seqlen and target seqlen (k cache) dim. When using masks typically masks are of shape [q seq len, k seq len], where k seq len == kv cache size. Imagine you have k seq len = 700 and mask that is like 0 1 2 3 ...........515........575 576.....697 698 699 700 -inf -inf -inf -inf............0..........0. -inf.... -inf -inf -inf -inf What this is doing really is telling you that you should attend only to the middle portion. For example when you are decoding pos 575 you want to attend to only previous 60 position but nothing before that. In that case position 515 to 575 in kv cache is what you care for. This is how sliding window attention can be implemented. Now comes the interesting part. Because flash attention implementation chunk along k seq len dim, we have this chunk size set to 512. Thus in the first chunk of q @ k you add attention mask of -inf. This makes your entire chunk -inf indicating you dont want to attend to this chunk at all. Well you could have honestly avoided this calculation entirely but maybe thats for another day. However, as a result of calculating this q @ k _and_ adding mask, you now have value containing all -infs. This introduces numerics issue in flash attention if not carefully guarded. All subsequent calculatings for softmax will now be nans. Why? Because how flash attention progressively calculates attention and makes final adjustments in the last stage. But because we have nans now, all subsequent calculatins also result in nans. I found this the hard way and thought wait, why is this not the problem in core from which much of this code is copied. Well indeed, it was and fixed after this code was copied. It was fixed in this PR pytorch/pytorch#130014 If we had better code sharing, this probably could have been avoided but we have diverged quite a bit now, plus the ugliness in both places are irreconcilable. Differential Revision: D73640471

Summary: Flash attention imlementation breaks q @ k matmul in chunks in both source seqlen and target seqlen (k cache) dim. When using masks typically masks are of shape [q seq len, k seq len], where k seq len == kv cache size. Imagine you have k seq len = 700 and mask that is like 0 1 2 3 ...........515........575 576.....697 698 699 700 -inf -inf -inf -inf............0..........0. -inf.... -inf -inf -inf -inf What this is doing really is telling you that you should attend only to the middle portion. For example when you are decoding pos 575 you want to attend to only previous 60 position but nothing before that. In that case position 515 to 575 in kv cache is what you care for. This is how sliding window attention can be implemented. Now comes the interesting part. Because flash attention implementation chunk along k seq len dim, we have this chunk size set to 512. Thus in the first chunk of q @ k you add attention mask of -inf. This makes your entire chunk -inf indicating you dont want to attend to this chunk at all. Well you could have honestly avoided this calculation entirely but maybe thats for another day. However, as a result of calculating this q @ k _and_ adding mask, you now have value containing all -infs. This introduces numerics issue in flash attention if not carefully guarded. All subsequent calculatings for softmax will now be nans. Why? Because how flash attention progressively calculates attention and makes final adjustments in the last stage. But because we have nans now, all subsequent calculatins also result in nans. I found this the hard way and thought wait, why is this not the problem in core from which much of this code is copied. Well indeed, it was and fixed after this code was copied. It was fixed in this PR pytorch/pytorch#130014 If we had better code sharing, this probably could have been avoided but we have diverged quite a bit now, plus the ugliness in both places are irreconcilable. Reviewed By: larryliu0820 Differential Revision: D73640471

Valentine233 added 2 commits July 2, 2024 23:05

[cpu][flash attention] fix nan issue

1a1c93e

[cpu][flash attention] fix nan issue

1acf6d5

pytorch-bot bot added the module: cpu CPU specific problem (e.g., perf, algorithm) label Jul 3, 2024

pytorchbot added the open source label Jul 3, 2024

Valentine233 requested review from jgong5 and removed request for jgong5 July 3, 2024 06:44

Valentine233 marked this pull request as draft July 3, 2024 07:07

Valentine233 added 2 commits July 3, 2024 00:24

[cpu][flash attention] fix nan issue

b56162b

[cpu][flash attention] fix nan issue

37f6864

Valentine233 requested a review from jgong5 July 3, 2024 13:27

Valentine233 marked this pull request as ready for review July 3, 2024 13:27

zou3519 added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Jul 3, 2024

jgong5 approved these changes Jul 4, 2024

View reviewed changes

Valentine233 requested a review from drisspg July 4, 2024 02:31

Valentine233 added the ciflow/trunk Trigger trunk jobs on your pull request label Jul 4, 2024

Valentine233 requested a review from eellison July 10, 2024 01:31

drisspg approved these changes Jul 10, 2024

View reviewed changes

drisspg added the topic: bug fixes topic category label Jul 10, 2024

pytorchmergebot added the merging label Jul 10, 2024

pytorchmergebot removed the merging label Jul 10, 2024

Valentine233 added the topic: not user facing topic category label Jul 10, 2024

pytorchmergebot added the merging label Jul 10, 2024

pytorchmergebot added the Merged label Jul 10, 2024

pytorchmergebot closed this in 868d9a4 Jul 10, 2024

pytorchmergebot removed the merging label Jul 10, 2024

drisspg added this to the 2.4.1 milestone Jul 28, 2024

pytorchbot mentioned this pull request Aug 15, 2024

[v2.4.1] Release Tracker #132400

Closed

atalman mentioned this pull request Aug 28, 2024

Release 2.4.1 validations checklist and cherry-picks #134694

Closed

40 tasks

github-actions bot deleted the fa_nan branch September 17, 2024 01:53

kimishpatel mentioned this pull request Apr 25, 2025

Fix, or rather "port", bug fix for sdpa pytorch/executorch#10466

Merged

sdbds mentioned this pull request Jun 5, 2025

Support Lumina-image-2.0 kohya-ss/sd-scripts#1927

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[cpu][flash attention] fix nan issue #130014

[cpu][flash attention] fix nan issue #130014

Uh oh!

Valentine233 commented Jul 3, 2024 •

edited by pytorch-bot bot

Loading

Uh oh!

pytorch-bot bot commented Jul 3, 2024 •

edited

Loading

Uh oh!

drisspg left a comment

Uh oh!

Valentine233 commented Jul 10, 2024

Uh oh!

pytorchmergebot commented Jul 10, 2024

Uh oh!

Valentine233 commented Jul 10, 2024

Uh oh!

pytorchmergebot commented Jul 10, 2024

Uh oh!

atalman commented Aug 15, 2024

Uh oh!

pytorchbot commented Aug 15, 2024

Uh oh!

Uh oh!

[cpu][flash attention] fix nan issue #130014

[cpu][flash attention] fix nan issue #130014

Uh oh!

Conversation

Valentine233 commented Jul 3, 2024 • edited by pytorch-bot bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Jul 3, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/130014

✅ No Failures

Uh oh!

drisspg left a comment

Choose a reason for hiding this comment

Uh oh!

Valentine233 commented Jul 10, 2024

Uh oh!

pytorchmergebot commented Jul 10, 2024

Merge failed

Uh oh!

Valentine233 commented Jul 10, 2024

Uh oh!

pytorchmergebot commented Jul 10, 2024

Merge started

Uh oh!

atalman commented Aug 15, 2024

Uh oh!

pytorchbot commented Aug 15, 2024

Cherry picking #130014

Uh oh!

Uh oh!

Valentine233 commented Jul 3, 2024 •

edited by pytorch-bot bot

Loading

pytorch-bot bot commented Jul 3, 2024 •

edited

Loading