Skip to content

OutOfMemoryError #1335

@kyre-99

Description

@kyre-99

I am conducting experiments on 2*H100. Training works fine, but I get an error when saving the weights. Why is this happening? I couldn't find a solution or the reason in the official documentation.

Log:
ray.exceptions.OutOfMemoryError: Task was killed due to the node running low on memory.
Memory on the node (IP: 10.100.235.24, ID: 9526eac435204c9b603952ec58ec09efd3740c33a4cc14071cc41183) where the task (task ID: 8c35c363359aab51efe39d6b1824c2384d27ee4a01000000, name=main_task, pid=2121529, memory used=0.63GB) was running was 482.39GB / 503.52GB (0.958038), which exceeds the memory usage threshold of 0.95. Ray killed this worker (ID: 88da655e87896cb114534d656618471eb0f186444981f46a9079f647) because it was the most recently scheduled task; to see more information about memory usage on this node, use ray logs raylet.out -ip 10.100.235.24. To see the logs of the worker, use ray logs worker-88da655e87896cb114534d656618471eb0f186444981f46a9079f647*out -ip 10.100.235.24. Top 10 memory users: PID MEM(GB) COMMAND 2122099 260.26 ray::WorkerDict.critic_save_checkpoint 2122347 191.01 ray::WorkerDict.critic_save_checkpoint 2027324 0.88 /data/.vscode-server/cli/servers/Stable-ddc367ed5c8936efe395cffeec279b04ffd7db78/server/no... 2121529 0.63 ray::main_task 389176 0.57 /data/miniconda3/envs/tools/bin/python -c from multiprocessing.spawn import spawn_main; sp... 2113840 0.55 /data/miniconda3/envs/searchr1/lib/python3.9/site-packages/ray/core/src/ray/gcs/gcs_server... 2027172 0.53 /data/.vscode-server/cli/servers/Stable-ddc367ed5c8936efe395cffeec279b04ffd7db78/server/no... 2026893 0.35 /data/.vscode-server/cli/servers/Stable-ddc367ed5c8936efe395cffeec279b04ffd7db78/server/no... 2113749 0.33 python3 -m verl.trainer.main_ppo data.train_files=data/medqa_search_split/train.parquet data.val_fil... 2026915 0.17 /data/.vscode-server/cli/servers/Stable-ddc367ed5c8936efe395cffeec279b04ffd7db78/server/no... Refer to the documentation on how to address the out of memory issue: https://docs.ray.io/en/latest/ray-core/scheduling/ray-oom-prevention.html. Consider provisioning more memory on this node or reducing task parallelism by requesting more CPUs per task. To adjust the kill threshold, set the environment variable RAY_memory_usage_thresholdwhen starting Ray. To disable worker killing, set the environment variableRAY_memory_monitor_refresh_ms` to zero.

Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions