Skip to content

OOM with a lot of memory untouched  #9578

@jakwisn

Description

@jakwisn

The problem

I am training a sentence classification model using a transformer and a pipeline that is based on the default config. I am doing it on the custom dataset. When I start training I get:

RuntimeError: CUDA out of memory. Tried to allocate 20.00 MiB (GPU 0; 6.00 GiB total capacity; 1.64 GiB already allocated; 0 bytes free; 1.73 GiB reserved in total by PyTorch)

The weird things are:

  • This appears despite a change of the batchsize (breaks with size 1).
  • Length of sentences is insignificant
  • It appears regardless of a machine (tried on a cluster with 32GB GPU)
  • My trainset has 15k sentences but if I lower this to 12 it works properly

Can I specifically ask spacy/torch to reserve more memory? There must be something wrong with memory allocation or something draining the memory out.

How to reproduce the behavior

I am running with deft definition dataset
and this is in my base config:

# This is an auto-generated partial config. To use it with 'spacy train'
# you can run spacy init fill-config to auto-fill all default settings:
# python -m spacy init fill-config ./base_config.cfg ./config.cfg
[paths]
train = "../../data/definition_data/train.spacy"
dev = "../../data/definition_data/dev.spacy"

[system]
gpu_allocator = "tensorflow"

[nlp]
lang = "en"
pipeline = ["transformer","textcat"]
batch_size = 256

[components]

[components.transformer]
factory = "transformer"

[components.transformer.model]
@architectures = "spacy-transformers.TransformerModel.v1"
name = "bert-base-uncased"
tokenizer_config = {"use_fast": true}

[components.transformer.model.get_spans]
@span_getters = "spacy-transformers.strided_spans.v1"
window = 128
stride = 96

[components.textcat]
factory = "textcat"

[components.textcat.model]
@architectures = "spacy.TextCatBOW.v2"
exclusive_classes = true
ngram_size = 1
no_output_layer = false

[corpora]

[corpora.train]
@readers = "spacy.Corpus.v1"
path = ${paths.train}
max_length = 0

[corpora.dev]
@readers = "spacy.Corpus.v1"
path = ${paths.dev}
max_length = 0

[training]
accumulate_gradient = 3
dev_corpus = "corpora.dev"
train_corpus = "corpora.train"

[training.optimizer]
@optimizers = "Adam.v1"

[training.optimizer.learn_rate]
@schedules = "warmup_linear.v1"
warmup_steps = 250
total_steps = 20000
initial_rate = 5e-5

[training.batcher]
@batchers = "spacy.batch_by_padded.v1"
discard_oversize = true
size = 2000
buffer = 256

[initialize]
vectors = ${paths.vectors}

Your Environment

  • spaCy version: 3.1.0
  • Platform: Linux-5.10.60.1-microsoft-standard-WSL2-x86_64-with-glibc2.29
  • Python version: 3.8.11
  • Pipelines: en_core_web_lg (3.1.0), en_core_web_sm (3.1.0), en_core_web_trf (3.1.0)

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions