CUDA OOM when running deepscaler tutorial on 1 A6000

**Describe the bug**

I was running deepscaler tutorial https://github.com/NVIDIA/NeMo-RL/blob/main/docs/guides/grpo-deepscaler.md on a single A6000 with 48GB

Got this after some time
```bash
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 560.00 MiB. GPU 0 has a total capacity of 47.41 GiB of which 370.38 MiB is free. Process 21440 has 366.00 MiB memory in use. Including non-PyTorch memory, this process has 46.64 GiB memory in use. Of the allocated memory 45.80 GiB is allocated by PyTorch, and 215.61 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
```

**Steps/Code to reproduce bug**

```bash
uv run examples/run_grpo_math.py --config=examples/configs/grpo-deepscaler-1.5b-8K.yaml
```

**Expected behavior**

Not OOM.

If this does not fit A6000, please document the minimal system requirement where this tutorial should work.

**Environment overview (please complete the following information)**

 - Environment location: [Bare-metal, Docker, Cloud(specify cloud provider - AWS, Azure, GCP, Collab)]
 - Method of install: [pip install or from source]. Please specify exact commands you used to install.
 - If method of install is [Docker], provide `docker pull` & `docker run` commands used

**Environment details**

- OS version: Ubuntu 22.04
Running using uv

**Additional context**

GPU model: NVIDIA RTX A6000 with 48GB


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

CUDA OOM when running deepscaler tutorial on 1 A6000 #493

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

CUDA OOM when running deepscaler tutorial on 1 A6000 #493

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions