-
Notifications
You must be signed in to change notification settings - Fork 25.2k
Description
🐛 Bug
There is a huge RAM overhead for using the GPU even for processing small tensors.
Here's a standalone script:
# test.py
import torch
import argparse
parser = argparse.ArgumentParser()
parser.add_argument('size', type=int)
args = parser.parse_args()
torch.set_grad_enabled(False)
device = 'cuda' if torch.cuda.is_available() else 'cpu'
model = torch.nn.Conv2d(1, 1, 1).to(device)
x = torch.rand(1, 1, args.size, args.size).to(device)
y = model(x)
Recording using GNU time:
$ /usr/bin/time -v python test.py 100
Command being timed: "python test.py 100"
User time (seconds): 0.26
System time (seconds): 0.03
Percent of CPU this job got: 114%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:00.26
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 1904088
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 0
Minor (reclaiming a frame) page faults: 16238
Voluntary context switches: 40
Involuntary context switches: 19
Swaps: 0
File system inputs: 0
File system outputs: 0
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0
The line to pay attention here is: Maximum resident set size (kbytes): 1904088. It takes roughly 2 GB of RAM in order to simply use the GPU to process a 100x100 image. In contrast, doing the same for CPU:
$ CUDA_VISIBLE_DEVICES='' /usr/bin/time -v python test.py 100
Command being timed: "python test.py 100"
User time (seconds): 0.29
System time (seconds): 0.04
Percent of CPU this job got: 116%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:00.29
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 149352
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 0
Minor (reclaiming a frame) page faults: 16432
Voluntary context switches: 39
Involuntary context switches: 19
Swaps: 0
File system inputs: 0
File system outputs: 0
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0
takes only ~150 MB. Using the following script, I constructed a plot of RAM usage vs image size:
# test.pl
foreach my $device ('', 0) {
foreach (1..30) {
$_ *= 100;
my @outs = split /\n/, `CUDA_VISIBLE_DEVICES=$device /usr/bin/time -v python test.py $_ 2>&1`;
foreach (@outs) { print $1 / 1024 . ",\n" if m/Maximum resident.*?(\d+)/ }
}
}
Running perl test.pl
produces 60 lines of output; the first 30 are for CPU, the second 30 are GPU. Plotting these yields:
The numbers produced on my machine are as follows:
# CPU
145.5234375,
145.90234375,
145.609375,
145.43359375,
145.56640625,
145.33984375,
145.51171875,
145.3359375,
146.34375,
149.0078125,
150.75,
153.23046875,
156.47265625,
159.55859375,
162.66796875,
166.2734375,
170.31640625,
173.98046875,
178.40234375,
183.2109375,
187.625,
192.75390625,
197.88671875,
202.8828125,
209.078125,
214.2578125,
220.86328125,
226.41796875,
233.5078125,
239.9375,
# GPU
1859.98828125,
1859.20703125,
1859.90234375,
1861.25,
1862.359375,
1861.1171875,
1859.54296875,
1858.77734375,
1858.9765625,
1863.28125,
1862.94921875,
1859.296875,
1860.77734375,
1861.5625,
1862.75390625,
1859.83984375,
1859.99609375,
1860.80078125,
1860.09375,
1862.703125,
1858.71875,
1858.75,
1860.671875,
1859.6875,
1859.0234375,
1858.921875,
1859.98046875,
1860.04296875,
1859.015625,
1858.77734375,
Expected behavior
Memory usage on the GPU side should not be significantly higher than on the CPU side. Luckily, RAM usage does not grow substantially (as indeed it should not), but the high startup cost is concerning, especially since this is just a 1x1 conv operating on 1-d input.
Environment
PyTorch version: 1.0.0a0+ff608a9
Is debug build: No
CUDA used to build PyTorch: 9.0.176
OS: Ubuntu 16.04.5 LTS
GCC version: (Ubuntu 5.4.0-6ubuntu1~16.04.11) 5.4.0 20160609
CMake version: version 3.12.2
Python version: 3.6
Is CUDA available: Yes
CUDA runtime version: Could not collect
GPU models and configuration:
GPU 0: GeForce GTX 1080 Ti
GPU 1: GeForce GTX 1080 Ti
GPU 2: GeForce GTX 1080 Ti
GPU 3: GeForce GTX 1080 Ti
Nvidia driver version: 390.87
cuDNN version: 7.0.3
Versions of relevant libraries:
numpy (1.15.2)
Additional Notes
I've observed stranger behavior in the curve on the CPU where for small images the memory consumption grows exponentially up to ~2 GB then drops and grows linearly. I'm attempting to reproduce this behavior in a small, standalone script like the above.
cc @ngimel