Skip to content

Massive initial memory overhead GPU #12873

@davidmascharka

Description

@davidmascharka

🐛 Bug

There is a huge RAM overhead for using the GPU even for processing small tensors.

Here's a standalone script:

# test.py
import torch
import argparse

parser = argparse.ArgumentParser()
parser.add_argument('size', type=int)
args = parser.parse_args()

torch.set_grad_enabled(False)
device = 'cuda' if torch.cuda.is_available() else 'cpu'
model = torch.nn.Conv2d(1, 1, 1).to(device)
x = torch.rand(1, 1, args.size, args.size).to(device)
y = model(x)

Recording using GNU time:

$ /usr/bin/time -v python test.py 100
        Command being timed: "python test.py 100"
        User time (seconds): 0.26
        System time (seconds): 0.03
        Percent of CPU this job got: 114%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 0:00.26
        Average shared text size (kbytes): 0
        Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
        Maximum resident set size (kbytes): 1904088
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 0
        Minor (reclaiming a frame) page faults: 16238
        Voluntary context switches: 40
        Involuntary context switches: 19
        Swaps: 0
        File system inputs: 0
        File system outputs: 0
        Socket messages sent: 0
        Socket messages received: 0
        Signals delivered: 0
        Page size (bytes): 4096
        Exit status: 0

The line to pay attention here is: Maximum resident set size (kbytes): 1904088. It takes roughly 2 GB of RAM in order to simply use the GPU to process a 100x100 image. In contrast, doing the same for CPU:

$ CUDA_VISIBLE_DEVICES='' /usr/bin/time -v python test.py 100
        Command being timed: "python test.py 100"
        User time (seconds): 0.29
        System time (seconds): 0.04
        Percent of CPU this job got: 116%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 0:00.29
        Average shared text size (kbytes): 0
        Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
        Maximum resident set size (kbytes): 149352
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 0
        Minor (reclaiming a frame) page faults: 16432
        Voluntary context switches: 39
        Involuntary context switches: 19
        Swaps: 0
        File system inputs: 0
        File system outputs: 0
        Socket messages sent: 0
        Socket messages received: 0
        Signals delivered: 0
        Page size (bytes): 4096
        Exit status: 0

takes only ~150 MB. Using the following script, I constructed a plot of RAM usage vs image size:

# test.pl
foreach my $device ('', 0) {
    foreach (1..30) {
        $_ *= 100;
        my @outs = split /\n/, `CUDA_VISIBLE_DEVICES=$device /usr/bin/time -v python test.py $_ 2>&1`;
        foreach (@outs) { print $1 / 1024 . ",\n" if m/Maximum resident.*?(\d+)/ }
    }
}

Running perl test.pl produces 60 lines of output; the first 30 are for CPU, the second 30 are GPU. Plotting these yields:

memory use

The numbers produced on my machine are as follows:

# CPU
145.5234375,
145.90234375,
145.609375,
145.43359375,
145.56640625,
145.33984375,
145.51171875,
145.3359375,
146.34375,
149.0078125,
150.75,
153.23046875,
156.47265625,
159.55859375,
162.66796875,
166.2734375,
170.31640625,
173.98046875,
178.40234375,
183.2109375,
187.625,
192.75390625,
197.88671875,
202.8828125,
209.078125,
214.2578125,
220.86328125,
226.41796875,
233.5078125,
239.9375,

# GPU
1859.98828125,
1859.20703125,
1859.90234375,
1861.25,
1862.359375,
1861.1171875,
1859.54296875,
1858.77734375,
1858.9765625,
1863.28125,
1862.94921875,
1859.296875,
1860.77734375,
1861.5625,
1862.75390625,
1859.83984375,
1859.99609375,
1860.80078125,
1860.09375,
1862.703125,
1858.71875,
1858.75,
1860.671875,
1859.6875,
1859.0234375,
1858.921875,
1859.98046875,
1860.04296875,
1859.015625,
1858.77734375,

Expected behavior

Memory usage on the GPU side should not be significantly higher than on the CPU side. Luckily, RAM usage does not grow substantially (as indeed it should not), but the high startup cost is concerning, especially since this is just a 1x1 conv operating on 1-d input.

Environment

PyTorch version: 1.0.0a0+ff608a9
Is debug build: No
CUDA used to build PyTorch: 9.0.176

OS: Ubuntu 16.04.5 LTS
GCC version: (Ubuntu 5.4.0-6ubuntu1~16.04.11) 5.4.0 20160609
CMake version: version 3.12.2

Python version: 3.6
Is CUDA available: Yes
CUDA runtime version: Could not collect
GPU models and configuration: 
GPU 0: GeForce GTX 1080 Ti
GPU 1: GeForce GTX 1080 Ti
GPU 2: GeForce GTX 1080 Ti
GPU 3: GeForce GTX 1080 Ti

Nvidia driver version: 390.87
cuDNN version: 7.0.3

Versions of relevant libraries:
numpy (1.15.2)

Additional Notes

I've observed stranger behavior in the curve on the CPU where for small images the memory consumption grows exponentially up to ~2 GB then drops and grows linearly. I'm attempting to reproduce this behavior in a small, standalone script like the above.

cc @ngimel

Metadata

Metadata

Assignees

No one assigned

    Labels

    module: cudaRelated to torch.cuda, and CUDA support in generalmodule: memory usagePyTorch is using more memory than it should, or it is leaking memorytriagedThis issue has been looked at a team member, and triaged and prioritized into an appropriate module

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions