Replies: 2 comments
-
Where are the cutlass calls coming from? The current code seems to only use cublas. |
Beta Was this translation helpful? Give feedback.
0 replies
-
amazing work, love this project so much! |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
[April 22, 2024]
I will post here once in a while on where the code is, focusing especially on the mainline CUDA code. These results can be calculated running
python profile_gpt2cu.py
(if you get a crash, add sudo).runtime, DRAM traffic, instructions:
Spending 76% in NVIDIA cutlass kernels, which is encouraging. This was run on an A10. On my A100 we are currently at ~73ms/iteration. PyTorch comparison (fp32, no flash attention, slightly stale PyTorch) is 78.2ms/iteration, so we are ~6.4% faster than PyTorch, in this constrained setting.
peak memory
On
nvidia-smi
we see a nice and constant 8753 MiB, this was heavily optimized by @ngc92 . In comparison, current PyTorch code comes up to 12879MiB. So we are 32% lower. To reproduce, run like:lines of code
train_gpt2.cu
is at 2097 of clean LOClatency
nvcc compile latency: 2.4 s
run latency (ENTER to first step): 2.2s
"big stones" ongoing work:
major merged improvements last few days:
first notable forks appearing
Beta Was this translation helpful? Give feedback.
All reactions