-
Notifications
You must be signed in to change notification settings - Fork 2.2k
Description
I am conducting experiments on 2*H100. Training works fine, but I get an error when saving the weights. Why is this happening? I couldn't find a solution or the reason in the official documentation.
Log:
ray.exceptions.OutOfMemoryError: Task was killed due to the node running low on memory.
Memory on the node (IP: 10.100.235.24, ID: 9526eac435204c9b603952ec58ec09efd3740c33a4cc14071cc41183) where the task (task ID: 8c35c363359aab51efe39d6b1824c2384d27ee4a01000000, name=main_task, pid=2121529, memory used=0.63GB) was running was 482.39GB / 503.52GB (0.958038), which exceeds the memory usage threshold of 0.95. Ray killed this worker (ID: 88da655e87896cb114534d656618471eb0f186444981f46a9079f647) because it was the most recently scheduled task; to see more information about memory usage on this node, use ray logs raylet.out -ip 10.100.235.24
. To see the logs of the worker, use ray logs worker-88da655e87896cb114534d656618471eb0f186444981f46a9079f647*out -ip 10.100.235.24. Top 10 memory users: PID MEM(GB) COMMAND 2122099 260.26 ray::WorkerDict.critic_save_checkpoint 2122347 191.01 ray::WorkerDict.critic_save_checkpoint 2027324 0.88 /data/.vscode-server/cli/servers/Stable-ddc367ed5c8936efe395cffeec279b04ffd7db78/server/no... 2121529 0.63 ray::main_task 389176 0.57 /data/miniconda3/envs/tools/bin/python -c from multiprocessing.spawn import spawn_main; sp... 2113840 0.55 /data/miniconda3/envs/searchr1/lib/python3.9/site-packages/ray/core/src/ray/gcs/gcs_server... 2027172 0.53 /data/.vscode-server/cli/servers/Stable-ddc367ed5c8936efe395cffeec279b04ffd7db78/server/no... 2026893 0.35 /data/.vscode-server/cli/servers/Stable-ddc367ed5c8936efe395cffeec279b04ffd7db78/server/no... 2113749 0.33 python3 -m verl.trainer.main_ppo data.train_files=data/medqa_search_split/train.parquet data.val_fil... 2026915 0.17 /data/.vscode-server/cli/servers/Stable-ddc367ed5c8936efe395cffeec279b04ffd7db78/server/no... Refer to the documentation on how to address the out of memory issue: https://docs.ray.io/en/latest/ray-core/scheduling/ray-oom-prevention.html. Consider provisioning more memory on this node or reducing task parallelism by requesting more CPUs per task. To adjust the kill threshold, set the environment variable
RAY_memory_usage_thresholdwhen starting Ray. To disable worker killing, set the environment variable
RAY_memory_monitor_refresh_ms` to zero.
Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.