Massive initial memory overhead GPU

## 🐛 Bug

There is a huge RAM overhead for using the GPU even for processing small tensors.

Here's a standalone script:

``` python
# test.py
import torch
import argparse

parser = argparse.ArgumentParser()
parser.add_argument('size', type=int)
args = parser.parse_args()

torch.set_grad_enabled(False)
device = 'cuda' if torch.cuda.is_available() else 'cpu'
model = torch.nn.Conv2d(1, 1, 1).to(device)
x = torch.rand(1, 1, args.size, args.size).to(device)
y = model(x)
```

Recording using [GNU time](https://www.gnu.org/software/time/):

``` shell
$ /usr/bin/time -v python test.py 100
        Command being timed: "python test.py 100"
        User time (seconds): 0.26
        System time (seconds): 0.03
        Percent of CPU this job got: 114%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 0:00.26
        Average shared text size (kbytes): 0
        Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
        Maximum resident set size (kbytes): 1904088
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 0
        Minor (reclaiming a frame) page faults: 16238
        Voluntary context switches: 40
        Involuntary context switches: 19
        Swaps: 0
        File system inputs: 0
        File system outputs: 0
        Socket messages sent: 0
        Socket messages received: 0
        Signals delivered: 0
        Page size (bytes): 4096
        Exit status: 0
```

The line to pay attention here is: **Maximum resident set size (kbytes): 1904088**. It takes roughly 2 GB of RAM in order to simply *use the GPU* to process a 100x100 image. In contrast, doing the same for CPU:

``` shell
$ CUDA_VISIBLE_DEVICES='' /usr/bin/time -v python test.py 100
        Command being timed: "python test.py 100"
        User time (seconds): 0.29
        System time (seconds): 0.04
        Percent of CPU this job got: 116%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 0:00.29
        Average shared text size (kbytes): 0
        Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
        Maximum resident set size (kbytes): 149352
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 0
        Minor (reclaiming a frame) page faults: 16432
        Voluntary context switches: 39
        Involuntary context switches: 19
        Swaps: 0
        File system inputs: 0
        File system outputs: 0
        Socket messages sent: 0
        Socket messages received: 0
        Signals delivered: 0
        Page size (bytes): 4096
        Exit status: 0
```

takes only ~150 MB. Using the following script, I constructed a plot of RAM usage vs image size:

``` perl
# test.pl
foreach my $device ('', 0) {
    foreach (1..30) {
        $_ *= 100;
        my @outs = split /\n/, `CUDA_VISIBLE_DEVICES=$device /usr/bin/time -v python test.py $_ 2>&1`;
        foreach (@outs) { print $1 / 1024 . ",\n" if m/Maximum resident.*?(\d+)/ }
    }
}
```

Running `perl test.pl` produces 60 lines of output; the first 30 are for CPU, the second 30 are GPU. Plotting these yields:

![memory use](https://raw.githubusercontent.com/davidmascharka/davidmascharka.com/master/img.png?token=AFWhVkGfeKu__pRyB5eke9RRz_U1G9gCks5b0yXLwA%3D%3D)

The numbers produced on my machine are as follows:

```
# CPU
145.5234375,
145.90234375,
145.609375,
145.43359375,
145.56640625,
145.33984375,
145.51171875,
145.3359375,
146.34375,
149.0078125,
150.75,
153.23046875,
156.47265625,
159.55859375,
162.66796875,
166.2734375,
170.31640625,
173.98046875,
178.40234375,
183.2109375,
187.625,
192.75390625,
197.88671875,
202.8828125,
209.078125,
214.2578125,
220.86328125,
226.41796875,
233.5078125,
239.9375,

# GPU
1859.98828125,
1859.20703125,
1859.90234375,
1861.25,
1862.359375,
1861.1171875,
1859.54296875,
1858.77734375,
1858.9765625,
1863.28125,
1862.94921875,
1859.296875,
1860.77734375,
1861.5625,
1862.75390625,
1859.83984375,
1859.99609375,
1860.80078125,
1860.09375,
1862.703125,
1858.71875,
1858.75,
1860.671875,
1859.6875,
1859.0234375,
1858.921875,
1859.98046875,
1860.04296875,
1859.015625,
1858.77734375,
```

## Expected behavior

Memory usage on the GPU side should not be significantly higher than on the CPU side. Luckily, RAM usage does not grow substantially (as indeed it should not), but the high startup cost is concerning, especially since this is just a 1x1 conv operating on 1-d input.

## Environment
```
PyTorch version: 1.0.0a0+ff608a9
Is debug build: No
CUDA used to build PyTorch: 9.0.176

OS: Ubuntu 16.04.5 LTS
GCC version: (Ubuntu 5.4.0-6ubuntu1~16.04.11) 5.4.0 20160609
CMake version: version 3.12.2

Python version: 3.6
Is CUDA available: Yes
CUDA runtime version: Could not collect
GPU models and configuration: 
GPU 0: GeForce GTX 1080 Ti
GPU 1: GeForce GTX 1080 Ti
GPU 2: GeForce GTX 1080 Ti
GPU 3: GeForce GTX 1080 Ti

Nvidia driver version: 390.87
cuDNN version: 7.0.3

Versions of relevant libraries:
numpy (1.15.2)
```

## Additional Notes
I've observed stranger behavior in the curve on the CPU where for small images the memory consumption grows exponentially up to ~2 GB then drops and grows linearly. I'm attempting to reproduce this behavior in a small, standalone script like the above.

cc @ngimel

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Massive initial memory overhead GPU #12873

🐛 Bug

Expected behavior

Environment

Additional Notes

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Massive initial memory overhead GPU #12873

Description

🐛 Bug

Expected behavior

Environment

Additional Notes

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions