OutOfMemoryError

I am conducting experiments on 2*H100. Training works fine, but I get an error when saving the weights. Why is this happening? I couldn't find a solution or the reason in the official documentation.

Log：
ray.exceptions.OutOfMemoryError: Task was killed due to the node running low on memory.
Memory on the node (IP: 10.100.235.24, ID: 9526eac435204c9b603952ec58ec09efd3740c33a4cc14071cc41183) where the task (task ID: 8c35c363359aab51efe39d6b1824c2384d27ee4a01000000, name=main_task, pid=2121529, memory used=0.63GB) was running was 482.39GB / 503.52GB (0.958038), which exceeds the memory usage threshold of 0.95. Ray killed this worker (ID: 88da655e87896cb114534d656618471eb0f186444981f46a9079f647) because it was the most recently scheduled task; to see more information about memory usage on this node, use `ray logs raylet.out -ip 10.100.235.24`. To see the logs of the worker, use `ray logs worker-88da655e87896cb114534d656618471eb0f186444981f46a9079f647*out -ip 10.100.235.24. Top 10 memory users:
PID	MEM(GB)	COMMAND
2122099	260.26	ray::WorkerDict.critic_save_checkpoint
2122347	191.01	ray::WorkerDict.critic_save_checkpoint
2027324	0.88	/data/.vscode-server/cli/servers/Stable-ddc367ed5c8936efe395cffeec279b04ffd7db78/server/no...
2121529	0.63	ray::main_task
389176	0.57	/data/miniconda3/envs/tools/bin/python -c from multiprocessing.spawn import spawn_main; sp...
2113840	0.55	/data/miniconda3/envs/searchr1/lib/python3.9/site-packages/ray/core/src/ray/gcs/gcs_server...
2027172	0.53	/data/.vscode-server/cli/servers/Stable-ddc367ed5c8936efe395cffeec279b04ffd7db78/server/no...
2026893	0.35	/data/.vscode-server/cli/servers/Stable-ddc367ed5c8936efe395cffeec279b04ffd7db78/server/no...
2113749	0.33	python3 -m verl.trainer.main_ppo data.train_files=data/medqa_search_split/train.parquet data.val_fil...
2026915	0.17	/data/.vscode-server/cli/servers/Stable-ddc367ed5c8936efe395cffeec279b04ffd7db78/server/no...
Refer to the documentation on how to address the out of memory issue: https://docs.ray.io/en/latest/ray-core/scheduling/ray-oom-prevention.html. Consider provisioning more memory on this node or reducing task parallelism by requesting more CPUs per task. To adjust the kill threshold, set the environment variable `RAY_memory_usage_threshold` when starting Ray. To disable worker killing, set the environment variable `RAY_memory_monitor_refresh_ms` to zero.

Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

OutOfMemoryError #1335

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

OutOfMemoryError #1335

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions