Skip to content

OOM when trying to reproduce the grpo-deepscaler run #456

@LeonMalteW

Description

@LeonMalteW

Describe the bug

When attempting to run the second stage of GRPO on DeepScaler with
max_total_sequence_length set to 16384, I encounter out-of-memory (OOM) errors related to VRAM. The only way to successfully run the second stage is by drastically reducing the GRPO batch size to 2 and adjusting other configurations, which significantly increases training time (estimated 20x or more).

Steps/Code to reproduce bug

Expected behavior

reproducing similar training as described in GRPO on DeepScaler

Environment overview and details

  • Environment location: Determined AI environment
    Linux f300d81ccc7f 4.18.0-513.5.1.el8_9.x86_64 #1 SMP Fri Sep 29 05:21:10 EDT 2023 x86_64 x86_64 x86_64 GNU/Linux
    Ubuntu 22.04.4 LTS

    NVIDIA-SMI 555.42.06
    Driver Version: 555.42.06
    CUDA Version: 12.5

    nvcc: NVIDIA (R) Cuda compiler driver
    Copyright (c) 2005-2024 NVIDIA Corporation
    Built on Thu_Mar_28_02:18:24_PDT_2024
    Cuda compilation tools, release 12.4, V12.4.131
    Build cuda_12.4.r12.4/compiler.34097967_0

  • Python 3.12.10

  • Method of install: same as in Prerequisites

Additional context
8x NVIDIA A100-SXM4-80GB (same as described in GRPO on DeepScaler

Metadata

Metadata

Assignees

Labels

bugSomething isn't workingexternal

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions