-
Notifications
You must be signed in to change notification settings - Fork 2.3k
Open
Description
When I was running ppo with verl, the following problem occurred:
2025-03-04 19:27:33,171 INFO worker.py:1654 -- Connecting to existing Ray cluster at address: 10.0.0.1:6379...
2025-03-04 19:27:33,188 INFO worker.py:1841 -- Connected to Ray cluster.
�[36m(main_task pid=26626)�[0m {'actor_rollout_ref': {'actor': {'clip_ratio': 0.2,
�[36m(main_task pid=26626)�[0m 'entropy_coeff': 0.001,
�[36m(main_task pid=26626)�[0m 'fsdp_config': {'fsdp_size': -1,
�[36m(main_task pid=26626)�[0m 'optimizer_offload': False,
�[36m(main_task pid=26626)�[0m 'param_offload': False,
�[36m(main_task pid=26626)�[0m 'wrap_policy': {'min_num_params': 0}},
�[36m(main_task pid=26626)�[0m 'grad_clip': 1.0,
�[36m(main_task pid=26626)�[0m 'kl_loss_coef': 0.001,
�[36m(main_task pid=26626)�[0m 'kl_loss_type': 'low_var_kl',
�[36m(main_task pid=26626)�[0m 'optim': {'lr': 1e-06,
�[36m(main_task pid=26626)�[0m 'lr_warmup_steps_ratio': 0.0,
�[36m(main_task pid=26626)�[0m 'min_lr_ratio': None,
�[36m(main_task pid=26626)�[0m 'total_training_steps': -1,
�[36m(main_task pid=26626)�[0m 'warmup_style': 'constant'},
�[36m(main_task pid=26626)�[0m 'ppo_epochs': 1,
�[36m(main_task pid=26626)�[0m 'ppo_max_token_len_per_gpu': 16384,
�[36m(main_task pid=26626)�[0m 'ppo_micro_batch_size': None,
�[36m(main_task pid=26626)�[0m 'ppo_micro_batch_size_per_gpu': 2,
�[36m(main_task pid=26626)�[0m 'ppo_mini_batch_size': 2,
�[36m(main_task pid=26626)�[0m 'shuffle': False,
�[36m(main_task pid=26626)�[0m 'strategy': 'fsdp',
�[36m(main_task pid=26626)�[0m 'ulysses_sequence_parallel_size': 1,
�[36m(main_task pid=26626)�[0m 'use_dynamic_bsz': False,
�[36m(main_task pid=26626)�[0m 'use_kl_loss': True},
�[36m(main_task pid=26626)�[0m 'hybrid_engine': True,
�[36m(main_task pid=26626)�[0m 'model': {'enable_gradient_checkpointing': True,
�[36m(main_task pid=26626)�[0m 'external_lib': None,
�[36m(main_task pid=26626)�[0m 'override_config': {},
�[36m(main_task pid=26626)�[0m 'path': '/fs/archive/share/Qwen/Qwen2___5-0___5B-Instruct',
�[36m(main_task pid=26626)�[0m 'use_remove_padding': True},
�[36m(main_task pid=26626)�[0m 'ref': {'fsdp_config': {'param_offload': True,
�[36m(main_task pid=26626)�[0m 'wrap_policy': {'min_num_params': 0}},
�[36m(main_task pid=26626)�[0m 'log_prob_max_token_len_per_gpu': 16384,
�[36m(main_task pid=26626)�[0m 'log_prob_micro_batch_size': None,
�[36m(main_task pid=26626)�[0m 'log_prob_micro_batch_size_per_gpu': 2,
�[36m(main_task pid=26626)�[0m 'log_prob_use_dynamic_bsz': False,
�[36m(main_task pid=26626)�[0m 'ulysses_sequence_parallel_size': 1},
�[36m(main_task pid=26626)�[0m 'rollout': {'disable_log_stats': True,
�[36m(main_task pid=26626)�[0m 'do_sample': True,
�[36m(main_task pid=26626)�[0m 'dtype': 'bfloat16',
�[36m(main_task pid=26626)�[0m 'enable_chunked_prefill': True,
�[36m(main_task pid=26626)�[0m 'enforce_eager': True,
�[36m(main_task pid=26626)�[0m 'free_cache_engine': True,
�[36m(main_task pid=26626)�[0m 'gpu_memory_utilization': 0.6,
�[36m(main_task pid=26626)�[0m 'ignore_eos': False,
�[36m(main_task pid=26626)�[0m 'load_format': 'dummy_dtensor',
�[36m(main_task pid=26626)�[0m 'log_prob_max_token_len_per_gpu': 16384,
�[36m(main_task pid=26626)�[0m 'log_prob_micro_batch_size': None,
�[36m(main_task pid=26626)�[0m 'log_prob_micro_batch_size_per_gpu': 2,
�[36m(main_task pid=26626)�[0m 'log_prob_use_dynamic_bsz': False,
�[36m(main_task pid=26626)�[0m 'max_model_len': None,
�[36m(main_task pid=26626)�[0m 'max_num_batched_tokens': 8192,
�[36m(main_task pid=26626)�[0m 'max_num_seqs': 1024,
�[36m(main_task pid=26626)�[0m 'n': 5,
�[36m(main_task pid=26626)�[0m 'name': 'vllm',
�[36m(main_task pid=26626)�[0m 'prompt_length': 512,
�[36m(main_task pid=26626)�[0m 'response_length': 1024,
�[36m(main_task pid=26626)�[0m 'temperature': 1.0,
�[36m(main_task pid=26626)�[0m 'tensor_model_parallel_size': 1,
�[36m(main_task pid=26626)�[0m 'top_k': -1,
�[36m(main_task pid=26626)�[0m 'top_p': 1,
�[36m(main_task pid=26626)�[0m 'use_fire_sampling': False}},
�[36m(main_task pid=26626)�[0m 'algorithm': {'adv_estimator': 'grpo',
�[36m(main_task pid=26626)�[0m 'gamma': 1.0,
�[36m(main_task pid=26626)�[0m 'kl_ctrl': {'kl_coef': 0.001, 'type': 'fixed'},
�[36m(main_task pid=26626)�[0m 'kl_penalty': 'kl',
�[36m(main_task pid=26626)�[0m 'lam': 1.0},
�[36m(main_task pid=26626)�[0m 'critic': {'cliprange_value': 0.5,
�[36m(main_task pid=26626)�[0m 'forward_max_token_len_per_gpu': 32768,
�[36m(main_task pid=26626)�[0m 'forward_micro_batch_size': None,
�[36m(main_task pid=26626)�[0m 'forward_micro_batch_size_per_gpu': 4,
�[36m(main_task pid=26626)�[0m 'grad_clip': 1.0,
�[36m(main_task pid=26626)�[0m 'model': {'enable_gradient_checkpointing': True,
�[36m(main_task pid=26626)�[0m 'external_lib': None,
�[36m(main_task pid=26626)�[0m 'fsdp_config': {'fsdp_size': -1,
�[36m(main_task pid=26626)�[0m 'optimizer_offload': False,
�[36m(main_task pid=26626)�[0m 'param_offload': False,
�[36m(main_task pid=26626)�[0m 'wrap_policy': {'min_num_params': 0}},
�[36m(main_task pid=26626)�[0m 'override_config': {},
�[36m(main_task pid=26626)�[0m 'path': '/fs/archive/share/Qwen/Qwen2___5-0___5B-Instruct',
�[36m(main_task pid=26626)�[0m 'tokenizer_path': '/fs/archive/share/Qwen/Qwen2___5-0___5B-Instruct',
�[36m(main_task pid=26626)�[0m 'use_remove_padding': False},
�[36m(main_task pid=26626)�[0m 'optim': {'lr': 1e-05,
�[36m(main_task pid=26626)�[0m 'lr_warmup_steps_ratio': 0.0,
�[36m(main_task pid=26626)�[0m 'min_lr_ratio': None,
�[36m(main_task pid=26626)�[0m 'total_training_steps': -1,
�[36m(main_task pid=26626)�[0m 'warmup_style': 'constant'},
�[36m(main_task pid=26626)�[0m 'ppo_epochs': 1,
�[36m(main_task pid=26626)�[0m 'ppo_max_token_len_per_gpu': 32768,
�[36m(main_task pid=26626)�[0m 'ppo_micro_batch_size': None,
�[36m(main_task pid=26626)�[0m 'ppo_micro_batch_size_per_gpu': 4,
�[36m(WorkerDict pid=27499)�[0m [2025-03-04 19:28:29,936 E 27499 27499] logging.cc:113: Unhandled exception: N5boost10wrapexceptINS_6system12system_errorEEE. what(): thread: Resource temporarily unavailable [system:11]
�[36m(WorkerDict pid=27499)�[0m [2025-03-04 19:28:29,993 E 27499 27499] logging.cc:120: Stack trace:
�[36m(WorkerDict pid=27499)�[0m /fs/archive/share/verl_env/lib/python3.9/site-packages/ray/_raylet.so(+0x13aa8ea) [0x147f239018ea] ray::operator<<()
�[36m(WorkerDict pid=27499)�[0m /fs/archive/share/verl_env/lib/python3.9/site-packages/ray/_raylet.so(+0x13ae022) [0x147f23905022] ray::TerminateHandler()
�[36m(WorkerDict pid=27499)�[0m /fs/archive/share/verl_env/bin/../lib/libstdc++.so.6(+0xb135a) [0x147f221e135a] __cxxabiv1::__terminate()
�[36m(WorkerDict pid=27499)�[0m /fs/archive/share/verl_env/bin/../lib/libstdc++.so.6(+0xb13c5) [0x147f221e13c5]
�[36m(WorkerDict pid=27499)�[0m /fs/archive/share/verl_env/bin/../lib/libstdc++.so.6(+0xb1658) [0x147f221e1658]
�[36m(WorkerDict pid=27499)�[0m /fs/archive/share/verl_env/lib/python3.9/site-packages/ray/_raylet.so(+0x6eb4a0) [0x147f22c424a0] boost::throw_exception<>()
�[36m(WorkerDict pid=27499)�[0m /fs/archive/share/verl_env/lib/python3.9/site-packages/ray/_raylet.so(+0x13ba3cb) [0x147f239113cb] boost::asio::detail::do_throw_error()
�[36m(WorkerDict pid=27499)�[0m /fs/archive/share/verl_env/lib/python3.9/site-packages/ray/_raylet.so(+0x13badeb) [0x147f23911deb] boost::asio::detail::posix_thread::start_thread()
�[36m(WorkerDict pid=27499)�[0m /fs/archive/share/verl_env/lib/python3.9/site-packages/ray/_raylet.so(+0x13bb24c) [0x147f2391224c] boost::asio::thread_pool::thread_pool()
�[36m(WorkerDict pid=27499)�[0m /fs/archive/share/verl_env/lib/python3.9/site-packages/ray/_raylet.so(+0xccd864) [0x147f23224864] ray::rpc::(anonymous namespace)::_GetServerCallExecutor()
�[36m(WorkerDict pid=27499)�[0m /fs/archive/share/verl_env/lib/python3.9/site-packages/ray/_raylet.so(_ZN3ray3rpc21GetServerCallExecutorEv+0x9) [0x147f232248f9] ray::rpc::GetServerCallExecutor()
�[36m(WorkerDict pid=27499)�[0m /fs/archive/share/verl_env/lib/python3.9/site-packages/ray/_raylet.so(_ZNSt17_Function_handlerIFvN3ray6StatusESt8functionIFvvEES4_EZNS0_3rpc14ServerCallImplINS6_24CoreWorkerServiceHandlerENS6_15PushTaskRequestENS6_13PushTaskReplyELNS6_8AuthTypeE0EE17HandleRequestImplEbEUlS1_S4_S4_E0_E9_M_invokeERKSt9_Any_dataOS1_OS4_SJ_+0x12b) [0x147f22ea5c4b] std::_Function_handler<>::_M_invoke()
�[36m(WorkerDict pid=27499)�[0m /fs/archive/share/verl_env/lib/python3.9/site-packages/ray/_raylet.so(+0x9976f6) [0x147f22eee6f6] ray::core::TaskReceiver::HandleTask()::{lambda()#1}::operator()()
�[36m(WorkerDict pid=27499)�[0m /fs/archive/share/verl_env/lib/python3.9/site-packages/ray/_raylet.so(+0x99870a) [0x147f22eef70a] std::_Function_handler<>::_M_invoke()
�[36m(WorkerDict pid=27499)�[0m /fs/archive/share/verl_env/lib/python3.9/site-packages/ray/_raylet.so(+0x99faa2) [0x147f22ef6aa2] ray::core::InboundRequest::Accept()
�[36m(WorkerDict pid=27499)�[0m /fs/archive/share/verl_env/lib/python3.9/site-packages/ray/_raylet.so(+0x9bcdfd) [0x147f22f13dfd] ray::core::NormalSchedulingQueue::ScheduleRequests()
�[36m(WorkerDict pid=27499)�[0m /fs/archive/share/verl_env/lib/python3.9/site-packages/ray/_raylet.so(+0xd0cd38) [0x147f23263d38] EventTracker::RecordExecution()
�[36m(WorkerDict pid=27499)�[0m /fs/archive/share/verl_env/lib/python3.9/site-packages/ray/_raylet.so(+0xcf116e) [0x147f2324816e] std::_Function_handler<>::_M_invoke()
�[36m(WorkerDict pid=27499)�[0m /fs/archive/share/verl_env/lib/python3.9/site-packages/ray/_raylet.so(+0xcf15e6) [0x147f232485e6] boost::asio::detail::completion_handler<>::do_complete()
�[36m(WorkerDict pid=27499)�[0m /fs/archive/share/verl_env/lib/python3.9/site-packages/ray/_raylet.so(+0x13b7a5b) [0x147f2390ea5b] boost::asio::detail::scheduler::do_run_one()
�[36m(WorkerDict pid=27499)�[0m /fs/archive/share/verl_env/lib/python3.9/site-packages/ray/_raylet.so(+0x13b93d9) [0x147f239103d9] boost::asio::detail::scheduler::run()
�[36m(WorkerDict pid=27499)�[0m /fs/archive/share/verl_env/lib/python3.9/site-packages/ray/_raylet.so(+0x13b9ae2) [0x147f23910ae2] boost::asio::io_context::run()
�[36m(WorkerDict pid=27499)�[0m /fs/archive/share/verl_env/lib/python3.9/site-packages/ray/_raylet.so(_ZN3ray4core10CoreWorker20RunTaskExecutionLoopEv+0x117) [0x147f22e4d4b7] ray::core::CoreWorker::RunTaskExecutionLoop()
�[36m(WorkerDict pid=27499)�[0m /fs/archive/share/verl_env/lib/python3.9/site-packages/ray/_raylet.so(_ZN3ray4core21CoreWorkerProcessImpl26RunWorkerTaskExecutionLoopEv+0x41) [0x147f22ef3221] ray::core::CoreWorkerProcessImpl::RunWorkerTaskExecutionLoop()
�[36m(WorkerDict pid=27499)�[0m /fs/archive/share/verl_env/lib/python3.9/site-packages/ray/_raylet.so(_ZN3ray4core17CoreWorkerProcess20RunTaskExecutionLoopEv+0x1d) [0x147f22ef343d] ray::core::CoreWorkerProcess::RunTaskExecutionLoop()
�[36m(WorkerDict pid=27499)�[0m /fs/archive/share/verl_env/lib/python3.9/site-packages/ray/_raylet.so(+0x74ab81) [0x147f22ca1b81] __pyx_pw_3ray_7_raylet_10CoreWorker_5run_task_loop()
�[36m(WorkerDict pid=27499)�[0m ray::WorkerDict(_PyEval_EvalFrameDefault+0x68f) [0x4e808f] _PyEval_EvalFrameDefault
�[36m(WorkerDict pid=27499)�[0m ray::WorkerDict() [0x4f81d3] function_code_fastcall
�[36m(WorkerDict pid=27499)�[0m ray::WorkerDict(_PyEval_EvalFrameDefault+0x68f) [0x4e808f] _PyEval_EvalFrameDefault
�[36m(WorkerDict pid=27499)�[0m ray::WorkerDict() [0x4e6afa] _PyEval_EvalCode
�[36m(WorkerDict pid=27499)�[0m ray::WorkerDict(_PyEval_EvalCodeWithName+0x47) [0x4e6787] _PyEval_EvalCodeWithName
�[36m(WorkerDict pid=27499)�[0m ray::WorkerDict(PyEval_EvalCodeEx+0x39) [0x4e6739] PyEval_EvalCodeEx
�[36m(WorkerDict pid=27499)�[0m ray::WorkerDict(PyEval_EvalCode+0x1b) [0x5942bb] PyEval_EvalCode
�[36m(WorkerDict pid=27499)�[0m ray::WorkerDict() [0x5c1777] run_eval_code_obj
�[36m(WorkerDict pid=27499)�[0m ray::WorkerDict() [0x5bd780] run_mod
�[36m(WorkerDict pid=27499)�[0m ray::WorkerDict() [0x456695] pyrun_file.cold
�[36m(WorkerDict pid=27499)�[0m ray::WorkerDict(PyRun_SimpleFileExFlags+0x1a2) [0x5b7462] PyRun_SimpleFileExFlags
�[36m(WorkerDict pid=27499)�[0m ray::WorkerDict(Py_RunMain+0x37e) [0x5b49de] Py_RunMain
�[36m(WorkerDict pid=27499)�[0m ray::WorkerDict(Py_BytesMain+0x39) [0x588369] Py_BytesMain
�[36m(WorkerDict pid=27499)�[0m /usr/lib64/libc.so.6(__libc_start_main+0xf5) [0x147f244b2555] __libc_start_main
�[36m(WorkerDict pid=27499)�[0m ray::WorkerDict() [0x58821e]
�[36m(WorkerDict pid=27499)�[0m
�[36m(WorkerDict pid=27499)�[0m *** SIGABRT received at time=1741087709 on cpu 41 ***
�[36m(WorkerDict pid=27499)�[0m PC: @ 0x147f244c6387 (unknown) raise
�[36m(WorkerDict pid=27499)�[0m @ 0x147f24f76630 (unknown) (unknown)
�[36m(WorkerDict pid=27499)�[0m @ 0x147f221e135a (unknown) __cxxabiv1::__terminate()
�[36m(WorkerDict pid=27499)�[0m @ 0x147f221e1580 (unknown) (unknown)
�[36m(WorkerDict pid=27499)�[0m [2025-03-04 19:28:29,993 E 27499 27499] logging.cc:484: *** SIGABRT received at time=1741087709 on cpu 41 ***
�[36m(WorkerDict pid=27499)�[0m [2025-03-04 19:28:29,993 E 27499 27499] logging.cc:484: PC: @ 0x147f244c6387 (unknown) raise
�[36m(WorkerDict pid=27499)�[0m [2025-03-04 19:28:29,994 E 27499 27499] logging.cc:484: @ 0x147f24f76630 (unknown) (unknown)
�[36m(WorkerDict pid=27499)�[0m [2025-03-04 19:28:29,994 E 27499 27499] logging.cc:484: @ 0x147f221e135a (unknown) __cxxabiv1::__terminate()
�[36m(WorkerDict pid=27499)�[0m [2025-03-04 19:28:29,994 E 27499 27499] logging.cc:484: @ 0x147f221e1580 (unknown) (unknown)
�[36m(WorkerDict pid=27499)�[0m Fatal Python error: Aborted
�[36m(WorkerDict pid=27499)�[0m
�[36m(WorkerDict pid=27499)�[0m Stack (most recent call first):
�[36m(WorkerDict pid=27499)�[0m File "/fs/archive/share/verl_env/lib/python3.9/site-packages/ray/_private/worker.py", line 935 in main_loop
�[36m(WorkerDict pid=27499)�[0m File "/fs/archive/share/verl_env/lib/python3.9/site-packages/ray/_private/workers/default_worker.py", line 297 in <module>
�[36m(WorkerDict pid=27497)�[0m /fs/archive/share/verl_env/bin/../lib/libstdc++.so.6(+0xb135a) [0x14dd449c935a] __cxxabiv1::__terminate()
�[36m(WorkerDict pid=27497)�[0m /fs/archive/share/verl_env/bin/../lib/libstdc++.so.6(+0xb13c5) [0x14dd449c93c5]
�[36m(WorkerDict pid=27497)�[0m /fs/archive/share/verl_env/bin/../lib/libstdc++.so.6(+0xb1658) [0x14dd449c9658]
�[36m(WorkerDict pid=27497)�[0m /fs/archive/share/verl_env/lib/python3.9/site-packages/ray/_raylet.so(+0x997f29) [0x14dd456d6f29] ray::core::TaskReceiver::HandleTask()::{lambda()#1}::operator()()
�[36m(WorkerDict pid=27497)�[0m /fs/archive/share/verl_env/lib/python3.9/site-packages/ray/_raylet.so(+0x74ab81) [0x14dd45489b81] __pyx_pw_3ray_7_raylet_10CoreWorker_5run_task_loop()
�[36m(WorkerDict pid=27497)�[0m
�[36m(WorkerDict pid=27497)�[0m
�[36m(WorkerDict pid=27110)�[0m /fs/archive/share/verl_env/bin/../lib/libstdc++.so.6(+0xb135a) [0x14e6918f435a] __cxxabiv1::__terminate()
�[36m(WorkerDict pid=27110)�[0m /fs/archive/share/verl_env/bin/../lib/libstdc++.so.6(+0xb13c5) [0x14e6918f43c5]
�[36m(WorkerDict pid=27110)�[0m /fs/archive/share/verl_env/bin/../lib/libstdc++.so.6(+0xb1658) [0x14e6918f4658]
�[36m(WorkerDict pid=27110)�[0m /fs/archive/share/verl_env/lib/python3.9/site-packages/ray/_raylet.so(+0x8e8cab) [0x14e692552cab] ray::core::CoreWorker::HandleWaitForActorRefDeleted()::{lambda()#1}::operator()()
�[36m(WorkerDict pid=27110)�[0m /fs/archive/share/verl_env/lib/python3.9/site-packages/ray/_raylet.so(_ZN3ray4core16ReferenceCounter14EraseReferenceEN4absl12lts_2023080218container_internal12raw_hash_setINS4_17FlatHashMapPolicyINS_8ObjectIDENS1_9ReferenceEEENS3_13hash_internal4HashIS7_EESt8equal_toIS7_ESaISt4pairIKS7_S8_EEE8iteratorE+0x333) [0x14e69267d983] ray::core::ReferenceCounter::EraseReference()
�[36m(WorkerDict pid=27110)�[0m /fs/archive/share/verl_env/lib/python3.9/site-packages/ray/_raylet.so(_ZN3ray4core16ReferenceCounter23DeleteReferenceInternalEN4absl12lts_2023080218container_internal12raw_hash_setINS4_17FlatHashMapPolicyINS_8ObjectIDENS1_9ReferenceEEENS3_13hash_internal4HashIS7_EESt8equal_toIS7_ESaISt4pairIKS7_S8_EEE8iteratorEPSt6vectorIS7_SaIS7_EE+0x526) [0x14e69267e726] ray::core::ReferenceCounter::DeleteReferenceInternal()
�[36m(WorkerDict pid=27110)�[0m /fs/archive/share/verl_env/lib/python3.9/site-packages/ray/_raylet.so(+0xa155e1) [0x14e69267f5e1] ray::core::ReferenceCounter::ReleaseAllLocalReferences()
�[36m(WorkerDict pid=27110)�[0m /fs/archive/share/verl_env/lib/python3.9/site-packages/ray/_raylet.so(_ZN3ray4core10CoreWorker4ExitENS_3rpc14WorkerExitTypeERKSsRKSt10shared_ptrINS_17LocalMemoryBufferEE+0x232) [0x14e692562d22] ray::core::CoreWorker::Exit()
�[36m(WorkerDict pid=27110)�[0m /fs/archive/share/verl_env/lib/python3.9/site-packages/ray/_raylet.so(_ZN3ray4core10CoreWorker11ExecuteTaskERKNS_17TaskSpecificationESt8optionalISt13unordered_mapISsSt6vectorISt4pairIldESaIS9_EESt4hashISsESt8equal_toISsESaIS8_IKSsSB_EEEEPS7_IS8_INS_8ObjectIDESt10shared_ptrINS_9RayObjectEEESaISP_EESS_PS7_IS8_ISL_bESaIST_EEPN6google8protobuf16RepeatedPtrFieldINS_3rpc20ObjectReferenceCountEEEPbPSs+0x1c71) [0x14e6925f9441] ray::core::CoreWorker::ExecuteTask()
�[36m(WorkerDict pid=27110)�[0m /fs/archive/share/verl_env/lib/python3.9/site-packages/ray/_raylet.so(+0x997278) [0x14e692601278] ray::core::TaskReceiver::HandleTask()::{lambda()#1}::operator()()
�[36m(WorkerDict pid=27110)�[0m /fs/archive/share/verl_env/lib/python3.9/site-packages/ray/_raylet.so(+0x74ab81) [0x14e6923b4b81] __pyx_pw_3ray_7_raylet_10CoreWorker_5run_task_loop()
�[36m(WorkerDict pid=27110)�[0m
�[36m(WorkerDict pid=27110)�[0m
�[36m(main_task pid=26626)�[0m 'ppo_mini_batch_size': 2,
�[36m(main_task pid=26626)�[0m 'shuffle': False,
�[36m(main_task pid=26626)�[0m 'strategy': 'fsdp',
�[36m(main_task pid=26626)�[0m 'ulysses_sequence_parallel_size': 1,
�[36m(main_task pid=26626)�[0m 'use_dynamic_bsz': False},
�[36m(main_task pid=26626)�[0m 'data': {'image_key': 'images',
�[36m(main_task pid=26626)�[0m 'max_prompt_length': 512,
�[36m(main_task pid=26626)�[0m 'max_response_length': 1024,
�[36m(main_task pid=26626)�[0m 'prompt_key': 'question',
�[36m(main_task pid=26626)�[0m 'return_raw_chat': False,
�[36m(main_task pid=26626)�[0m 'return_raw_input_ids': False,
�[36m(main_task pid=26626)�[0m 'shuffle': True,
�[36m(main_task pid=26626)�[0m 'tokenizer': None,
�[36m(main_task pid=26626)�[0m 'train_batch_size': 4,
�[36m(main_task pid=26626)�[0m 'train_files': '/home/u2024001049/Reasoning/data/gsm8k/train-00000-of-00001.parquet',
�[36m(main_task pid=26626)�[0m 'val_batch_size': 4,
�[36m(main_task pid=26626)�[0m 'val_files': '/home/u2024001049/Reasoning/data/gsm8k/test-00000-of-00001.parquet'},
�[36m(main_task pid=26626)�[0m 'reward_model': {'enable': False,
�[36m(main_task pid=26626)�[0m 'forward_max_token_len_per_gpu': 32768,
�[36m(main_task pid=26626)�[0m 'max_length': None,
�[36m(main_task pid=26626)�[0m 'micro_batch_size': None,
�[36m(main_task pid=26626)�[0m 'micro_batch_size_per_gpu': None,
�[36m(main_task pid=26626)�[0m 'model': {'external_lib': None,
�[36m(main_task pid=26626)�[0m 'fsdp_config': {'fsdp_size': -1,
�[36m(main_task pid=26626)�[0m 'min_num_params': 0,
�[36m(main_task pid=26626)�[0m 'param_offload': False},
�[36m(main_task pid=26626)�[0m 'input_tokenizer': '/fs/archive/share/Qwen/Qwen2___5-0___5B-Instruct',
�[36m(main_task pid=26626)�[0m 'path': '/fs/archive/share/Qwen/Qwen2___5-0___5B-Instruct',
�[36m(main_task pid=26626)�[0m 'use_remove_padding': False},
�[36m(main_task pid=26626)�[0m 'reward_manager': 'naive',
�[36m(main_task pid=26626)�[0m 'strategy': 'fsdp',
�[36m(main_task pid=26626)�[0m 'ulysses_sequence_parallel_size': 1,
�[36m(main_task pid=26626)�[0m 'use_dynamic_bsz': False},
�[36m(main_task pid=26626)�[0m 'trainer': {'critic_warmup': 0,
�[36m(main_task pid=26626)�[0m 'default_hdfs_dir': None,
�[36m(main_task pid=26626)�[0m 'default_local_dir': '/home/u2024001049/Reasoning/model/stage2/qwen2.5-0.5b-instruct',
�[36m(main_task pid=26626)�[0m 'del_local_ckpt_after_load': False,
�[36m(main_task pid=26626)�[0m 'experiment_name': 'qwen2_0.5b_function_rm',
�[36m(main_task pid=26626)�[0m 'logger': ['wandb'],
�[36m(main_task pid=26626)�[0m 'n_gpus_per_node': 4,
�[36m(main_task pid=26626)�[0m 'nnodes': 1,
�[36m(main_task pid=26626)�[0m 'project_name': 'verl_grpo_example_gsm8k',
�[36m(main_task pid=26626)�[0m 'remove_previous_ckpt_in_save': False,
�[36m(main_task pid=26626)�[0m 'resume_from_path': False,
�[36m(main_task pid=26626)�[0m 'resume_mode': 'auto',
�[36m(main_task pid=26626)�[0m 'save_freq': -1,
�[36m(main_task pid=26626)�[0m 'test_freq': 5,
�[36m(main_task pid=26626)�[0m 'total_epochs': 1,
�[36m(main_task pid=26626)�[0m 'total_training_steps': None,
�[36m(main_task pid=26626)�[0m 'val_generations_to_log_to_wandb': 0}}
�[36m(main_task pid=26626)�[0m WARNING: val_batch_size is deprecated. Validation datasets are sent to inference engines as a whole batch, which will schedule the memory themselves.
�[36m(main_task pid=26626)�[0m [validate_config] All configuration checks passed successfully!
�[36m(main_task pid=26626)�[0m original dataset len: 7473
�[36m(main_task pid=26626)�[0m filter dataset len: 7473
�[36m(main_task pid=26626)�[0m original dataset len: 1319
�[36m(main_task pid=26626)�[0m filter dataset len: 1319
�[36m(main_task pid=26626)�[0m Size of train dataloader: 1868
�[36m(main_task pid=26626)�[0m Total training steps: 1868
�[33m(raylet)�[0m A worker died or was killed while executing a task by an unexpected system error. To troubleshoot the problem, check the logs for the dead worker. RayTask ID: ffffffffffffffff308b51161c6df986f5e8ce0502000000 Worker ID: ecb9585c3a722865f339a86d7d9ae9abb131b502d1452edadee7b0e3 Node ID: ed9580448251cc916430246358f4b61106db8cf2a06ff38a04a51eb1 Worker IP address: 10.0.0.1 Worker port: 10074 Worker PID: 27499 Worker exit type: SYSTEM_ERROR Worker exit detail: Worker unexpectedly exits with a connection error code 2. End of file. There are some potential root causes. (1) The process is killed by SIGKILL by OOM killer due to high memory usage. (2) ray stop --force is called. (3) The worker is crashed unexpectedly due to SIGSEGV or other unexpected errors.
Error executing job with overrides: ['algorithm.adv_estimator=grpo', 'data.train_files=/home/u2024001049/Reasoning/data/gsm8k/train-00000-of-00001.parquet', 'data.val_files=/home/u2024001049/Reasoning/data/gsm8k/test-00000-of-00001.parquet', 'data.train_batch_size=4', 'data.val_batch_size=4', 'data.max_prompt_length=512', 'data.max_response_length=1024', 'actor_rollout_ref.model.path=/fs/archive/share/Qwen/Qwen2___5-0___5B-Instruct', 'actor_rollout_ref.actor.optim.lr=1e-6', 'actor_rollout_ref.model.use_remove_padding=True', 'actor_rollout_ref.actor.ppo_mini_batch_size=2', 'actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=2', 'actor_rollout_ref.actor.use_kl_loss=True', 'actor_rollout_ref.actor.kl_loss_coef=0.001', 'actor_rollout_ref.actor.kl_loss_type=low_var_kl', 'actor_rollout_ref.model.enable_gradient_checkpointing=True', 'actor_rollout_ref.actor.fsdp_config.param_offload=False', 'actor_rollout_ref.actor.fsdp_config.optimizer_offload=False', 'actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=2', 'actor_rollout_ref.rollout.tensor_model_parallel_size=1', 'actor_rollout_ref.rollout.name=vllm', 'actor_rollout_ref.rollout.gpu_memory_utilization=0.6', 'actor_rollout_ref.rollout.n=5', 'actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu=2', 'actor_rollout_ref.ref.fsdp_config.param_offload=True', 'algorithm.kl_ctrl.kl_coef=0.001', 'trainer.critic_warmup=0', 'trainer.logger=[wandb]', 'trainer.project_name=verl_grpo_example_gsm8k', 'trainer.experiment_name=qwen2_0.5b_function_rm', 'trainer.n_gpus_per_node=4', 'trainer.nnodes=1', 'trainer.save_freq=-1', 'trainer.test_freq=5', 'trainer.total_epochs=1']
Traceback (most recent call last):
File "/home/u2024001049/verl/verl/trainer/main_ppo.py", line 25, in main
run_ppo(config)
File "/home/u2024001049/verl/verl/trainer/main_ppo.py", line 33, in run_ppo
ray.get(main_task.remote(config, compute_score))
File "/fs/archive/share/verl_env/lib/python3.9/site-packages/ray/_private/auto_init_hook.py", line 21, in auto_init_wrapper
return fn(*args, **kwargs)
File "/fs/archive/share/verl_env/lib/python3.9/site-packages/ray/_private/client_mode_hook.py", line 103, in wrapper
return func(*args, **kwargs)
File "/fs/archive/share/verl_env/lib/python3.9/site-packages/ray/_private/worker.py", line 2771, in get
values, debugger_breakpoint = worker.get_objects(object_refs, timeout=timeout)
File "/fs/archive/share/verl_env/lib/python3.9/site-packages/ray/_private/worker.py", line 919, in get_objects
raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(ActorDiedError): �[36mray::main_task()�[39m (pid=26626, ip=10.0.0.1)
File "/home/u2024001049/verl/verl/trainer/main_ppo.py", line 127, in main_task
trainer.init_workers()
File "/home/u2024001049/verl/verl/trainer/ppo/ray_trainer.py", line 749, in init_workers
self.ref_policy_wg.init_model()
File "/home/u2024001049/verl/verl/single_controller/ray/base.py", line 42, in func
output = ray.get(output)
ray.exceptions.ActorDiedError: The actor died unexpectedly before finishing this task.
class_name: create_colocated_worker_cls.<locals>.WorkerDict
actor_id: 308b51161c6df986f5e8ce0502000000
pid: 27499
name: 31cVuRWorkerDict_0:3
namespace: 339a5d5d-4623-4811-8206-b1e9bdfd8cb0
ip: 10.0.0.1
The actor is dead because its worker process has died. Worker exit type: SYSTEM_ERROR Worker exit detail: Worker unexpectedly exits with a connection error code 2. End of file. There are some potential root causes. (1) The process is killed by SIGKILL by OOM killer due to high memory usage. (2) ray stop --force is called. (3) The worker is crashed unexpectedly due to SIGSEGV or other unexpected errors.
The actor never ran - it was cancelled before it started running.
Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.
�[36m(WorkerDict pid=27497)�[0m [2025-03-04 19:28:30,014 E 27497 27497] logging.cc:113: Unhandled exception: N5boost10wrapexceptINS_6system12system_errorEEE. what(): thread: Resource temporarily unavailable [system:11]�[32m [repeated 2x across cluster] (Ray deduplicates logs by default. Set RAY_DEDUP_LOGS=0 to disable log deduplication, or see https://docs.ray.io/en/master/ray-observability/user-guides/configure-logging.html#log-deduplication for more options.)�[0m
�[36m(WorkerDict pid=27110)�[0m [2025-03-04 19:28:30,101 E 27110 27110] logging.cc:120: Stack trace: �[32m [repeated 2x across cluster]�[0m
�[36m(WorkerDict pid=27110)�[0m /fs/archive/share/verl_env/lib/python3.9/site-packages/ray/_raylet.so(+0x13aa8ea) [0x14e6930148ea] ray::operator<<()�[32m [repeated 2x across cluster]�[0m
�[36m(WorkerDict pid=27110)�[0m /fs/archive/share/verl_env/lib/python3.9/site-packages/ray/_raylet.so(+0x13ae022) [0x14e693018022] ray::TerminateHandler()�[32m [repeated 2x across cluster]�[0m
�[36m(WorkerDict pid=27110)�[0m /fs/archive/share/verl_env/lib/python3.9/site-packages/ray/_raylet.so(+0x6eb4a0) [0x14e6923554a0] boost::throw_exception<>()�[32m [repeated 2x across cluster]�[0m
�[36m(WorkerDict pid=27110)�[0m /fs/archive/share/verl_env/lib/python3.9/site-packages/ray/_raylet.so(+0x13ba3cb) [0x14e6930243cb] boost::asio::detail::do_throw_error()�[32m [repeated 2x across cluster]�[0m
�[36m(WorkerDict pid=27110)�[0m /fs/archive/share/verl_env/lib/python3.9/site-packages/ray/_raylet.so(+0x13badeb) [0x14e693024deb] boost::asio::detail::posix_thread::start_thread()�[32m [repeated 2x across cluster]�[0m
�[36m(WorkerDict pid=27110)�[0m /fs/archive/share/verl_env/lib/python3.9/site-packages/ray/_raylet.so(+0x13bb24c) [0x14e69302524c] boost::asio::thread_pool::thread_pool()�[32m [repeated 2x across cluster]�[0m
�[36m(WorkerDict pid=27110)�[0m /fs/archive/share/verl_env/lib/python3.9/site-packages/ray/_raylet.so(+0xccd864) [0x14e692937864] ray::rpc::(anonymous namespace)::_GetServerCallExecutor()�[32m [repeated 2x across cluster]�[0m
�[36m(WorkerDict pid=27110)�[0m /fs/archive/share/verl_env/lib/python3.9/site-packages/ray/_raylet.so(_ZN3ray3rpc21GetServerCallExecutorEv+0x9) [0x14e6929378f9] ray::rpc::GetServerCallExecutor()�[32m [repeated 2x across cluster]�[0m
�[36m(WorkerDict pid=27110)�[0m /fs/archive/share/verl_env/lib/python3.9/site-packages/ray/_raylet.so(+0xcf116e) [0x14e69295b16e] std::_Function_handler<>::_M_invoke()�[32m [repeated 7x across cluster]�[0m
�[36m(WorkerDict pid=27110)�[0m /fs/archive/share/verl_env/lib/python3.9/site-packages/ray/_raylet.so(+0x99faa2) [0x14e692609aa2] ray::core::InboundRequest::Accept()�[32m [repeated 2x across cluster]�[0m
�[36m(WorkerDict pid=27110)�[0m /fs/archive/share/verl_env/lib/python3.9/site-packages/ray/_raylet.so(+0x9bcdfd) [0x14e692626dfd] ray::core::NormalSchedulingQueue::ScheduleRequests()�[32m [repeated 2x across cluster]�[0m
�[36m(WorkerDict pid=27110)�[0m /fs/archive/share/verl_env/lib/python3.9/site-packages/ray/_raylet.so(+0xd0cd38) [0x14e692976d38] EventTracker::RecordExecution()�[32m [repeated 2x across cluster]�[0m
�[36m(WorkerDict pid=27110)�[0m /fs/archive/share/verl_env/lib/python3.9/site-packages/ray/_raylet.so(+0xcf15e6) [0x14e69295b5e6] boost::asio::detail::completion_handler<>::do_complete()�[32m [repeated 2x across cluster]�[0m
�[36m(WorkerDict pid=27110)�[0m /fs/archive/share/verl_env/lib/python3.9/site-packages/ray/_raylet.so(+0x13b7a5b) [0x14e693021a5b] boost::asio::detail::scheduler::do_run_one()�[32m [repeated 2x across cluster]�[0m
�[36m(WorkerDict pid=27110)�[0m /fs/archive/share/verl_env/lib/python3.9/site-packages/ray/_raylet.so(+0x13b93d9) [0x14e6930233d9] boost::asio::detail::scheduler::run()�[32m [repeated 2x across cluster]�[0m
�[36m(WorkerDict pid=27110)�[0m /fs/archive/share/verl_env/lib/python3.9/site-packages/ray/_raylet.so(+0x13b9ae2) [0x14e693023ae2] boost::asio::io_context::run()�[32m [repeated 2x across cluster]�[0m
�[36m(WorkerDict pid=27110)�[0m /fs/archive/share/verl_env/lib/python3.9/site-packages/ray/_raylet.so(_ZN3ray4core10CoreWorker20RunTaskExecutionLoopEv+0x117) [0x14e6925604b7] ray::core::CoreWorker::RunTaskExecutionLoop()�[32m [repeated 2x across cluster]�[0m
�[36m(WorkerDict pid=27110)�[0m /fs/archive/share/verl_env/lib/python3.9/site-packages/ray/_raylet.so(_ZN3ray4core21CoreWorkerProcessImpl26RunWorkerTaskExecutionLoopEv+0x41) [0x14e692606221] ray::core::CoreWorkerProcessImpl::RunWorkerTaskExecutionLoop()�[32m [repeated 2x across cluster]�[0m
�[36m(WorkerDict pid=27110)�[0m /fs/archive/share/verl_env/lib/python3.9/site-packages/ray/_raylet.so(_ZN3ray4core17CoreWorkerProcess20RunTaskExecutionLoopEv+0x1d) [0x14e69260643d] ray::core::CoreWorkerProcess::RunTaskExecutionLoop()�[32m [repeated 2x across cluster]�[0m
�[36m(WorkerDict pid=27110)�[0m ray::WorkerDict(_PyEval_EvalFrameDefault+0x68f) [0x4e808f] _PyEval_EvalFrameDefault�[32m [repeated 4x across cluster]�[0m
�[36m(WorkerDict pid=27110)�[0m ray::WorkerDict() [0x4f81d3] function_code_fastcall�[32m [repeated 2x across cluster]�[0m
�[36m(WorkerDict pid=27110)�[0m ray::WorkerDict() [0x4e6afa] _PyEval_EvalCode�[32m [repeated 2x across cluster]�[0m
�[36m(WorkerDict pid=27110)�[0m ray::WorkerDict(_PyEval_EvalCodeWithName+0x47) [0x4e6787] _PyEval_EvalCodeWithName�[32m [repeated 2x across cluster]�[0m
�[36m(WorkerDict pid=27110)�[0m ray::WorkerDict(PyEval_EvalCodeEx+0x39) [0x4e6739] PyEval_EvalCodeEx�[32m [repeated 2x across cluster]�[0m
�[36m(WorkerDict pid=27110)�[0m ray::WorkerDict(PyEval_EvalCode+0x1b) [0x5942bb] PyEval_EvalCode�[32m [repeated 2x across cluster]�[0m
�[36m(WorkerDict pid=27110)�[0m ray::WorkerDict() [0x5c1777] run_eval_code_obj�[32m [repeated 2x across cluster]�[0m
�[36m(WorkerDict pid=27110)�[0m ray::WorkerDict() [0x5bd780] run_mod�[32m [repeated 2x across cluster]�[0m
�[36m(WorkerDict pid=27110)�[0m ray::WorkerDict() [0x456695] pyrun_file.cold�[32m [repeated 2x across cluster]�[0m
�[36m(WorkerDict pid=27110)�[0m ray::WorkerDict(PyRun_SimpleFileExFlags+0x1a2) [0x5b7462] PyRun_SimpleFileExFlags�[32m [repeated 2x across cluster]�[0m
�[36m(WorkerDict pid=27110)�[0m ray::WorkerDict(Py_RunMain+0x37e) [0x5b49de] Py_RunMain�[32m [repeated 2x across cluster]�[0m
�[36m(WorkerDict pid=27110)�[0m ray::WorkerDict(Py_BytesMain+0x39) [0x588369] Py_BytesMain�[32m [repeated 2x across cluster]�[0m
�[36m(WorkerDict pid=27110)�[0m /usr/lib64/libc.so.6(__libc_start_main+0xf5) [0x14e693bc5555] __libc_start_main�[32m [repeated 2x across cluster]�[0m
�[36m(WorkerDict pid=27110)�[0m ray::WorkerDict() [0x58821e]�[32m [repeated 2x across cluster]�[0m
�[36m(WorkerDict pid=27110)�[0m *** SIGABRT received at time=1741087710 on cpu 40 ***�[32m [repeated 2x across cluster]�[0m
�[36m(WorkerDict pid=27110)�[0m PC: @ 0x14e693bd9387 (unknown) raise�[32m [repeated 2x across cluster]�[0m
�[36m(WorkerDict pid=27110)�[0m @ 0x14e6918f4580 (unknown) (unknown)�[32m [repeated 4x across cluster]�[0m
�[36m(WorkerDict pid=27110)�[0m @ 0x14e6918f435a (unknown) __cxxabiv1::__terminate()�[32m [repeated 2x across cluster]�[0m
�[36m(WorkerDict pid=27110)�[0m [2025-03-04 19:28:30,102 E 27110 27110] logging.cc:484: *** SIGABRT received at time=1741087710 on cpu 40 ***�[32m [repeated 2x across cluster]�[0m
�[36m(WorkerDict pid=27110)�[0m [2025-03-04 19:28:30,102 E 27110 27110] logging.cc:484: PC: @ 0x14e693bd9387 (unknown) raise�[32m [repeated 2x across cluster]�[0m
�[36m(WorkerDict pid=27110)�[0m [2025-03-04 19:28:30,102 E 27110 27110] logging.cc:484: @ 0x14e6918f4580 (unknown) (unknown)�[32m [repeated 4x across cluster]�[0m
�[36m(WorkerDict pid=27110)�[0m [2025-03-04 19:28:30,102 E 27110 27110] logging.cc:484: @ 0x14e6918f435a (unknown) __cxxabiv1::__terminate()�[32m [repeated 2x across cluster]�[0m
�[36m(WorkerDict pid=27110)�[0m Fatal Python error: Aborted�[32m [repeated 2x across cluster]�[0m
�[36m(WorkerDict pid=27110)�[0m Stack (most recent call first):�[32m [repeated 2x across cluster]�[0m
�[36m(WorkerDict pid=27110)�[0m File "/fs/archive/share/verl_env/lib/python3.9/site-packages/ray/_private/worker.py", line 935 in main_loop�[32m [repeated 2x across cluster]�[0m
�[36m(WorkerDict pid=27110)�[0m File "/fs/archive/share/verl_env/lib/python3.9/site-packages/ray/_private/workers/default_worker.py", line 297 in <module>�[32m [repeated 2x across cluster]�[0m
And my running code is as follows:
ray stop
pkill -9 ray
module load cuda/12.1.1
module load gcc/9.5.0
set -x
export VLLM_ATTENTION_BACKEND=XFORMERS
export WANDB_API_KEY=xxxx
# export HOME=""
export CUDA_VISIBLE_DEVICES=1,2,3,4
ray start --head --num-gpus=4
ray status
python3 -c "import ray; ray.init(address='auto'); print(ray.available_resources())"
python3 -m verl.trainer.main_ppo \
algorithm.adv_estimator=grpo \
data.train_files=/home/Reasoning/data/gsm8k/train-00000-of-00001.parquet \
data.val_files=/home/Reasoning/data/gsm8k/test-00000-of-00001.parquet \
data.train_batch_size=4 \
data.val_batch_size=4 \
data.max_prompt_length=512 \
data.max_response_length=1024 \
actor_rollout_ref.model.path=/fs/archive/share/Qwen/Qwen2___5-0___5B-Instruct \
actor_rollout_ref.actor.optim.lr=1e-6 \
actor_rollout_ref.model.use_remove_padding=True \
actor_rollout_ref.actor.ppo_mini_batch_size=2 \
actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=2 \
actor_rollout_ref.actor.use_kl_loss=True \
actor_rollout_ref.actor.kl_loss_coef=0.001 \
actor_rollout_ref.actor.kl_loss_type=low_var_kl \
actor_rollout_ref.model.enable_gradient_checkpointing=True \
actor_rollout_ref.actor.fsdp_config.param_offload=False \
actor_rollout_ref.actor.fsdp_config.optimizer_offload=False \
actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=2 \
actor_rollout_ref.rollout.tensor_model_parallel_size=1 \
actor_rollout_ref.rollout.name=vllm \
actor_rollout_ref.rollout.gpu_memory_utilization=0.6 \
actor_rollout_ref.rollout.n=5 \
actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu=2 \
actor_rollout_ref.ref.fsdp_config.param_offload=True \
algorithm.kl_ctrl.kl_coef=0.001 \
trainer.critic_warmup=0 \
trainer.logger=['wandb'] \
trainer.project_name='verl_grpo_example_gsm8k' \
trainer.experiment_name='qwen2_0.5b_function_rm' \
trainer.n_gpus_per_node=4 \
trainer.nnodes=1 \
trainer.save_freq=-1 \
trainer.test_freq=5 \
trainer.total_epochs=1 2>&1 | tee log.txt
It seems that here are the problems:
A worker died or was killed while executing a task by an unexpected system error. To troubleshoot the problem, check the logs for the dead worker. RayTask ID: ffffffffffffffff308b51161c6df986f5e8ce0502000000 Worker ID: ecb9585c3a722865f339a86d7d9ae9abb131b502d1452edadee7b0e3 Node ID: ed9580448251cc916430246358f4b61106db8cf2a06ff38a04a51eb1 Worker IP address: 10.0.0.1 Worker port: 10074 Worker PID: 27499 Worker exit type: SYSTEM_ERROR Worker exit detail: Worker unexpectedly exits with a connection error code 2. End of file. There are some potential root causes. (1) The process is killed by SIGKILL by OOM killer due to high memory usage. (2) ray stop --force is called. (3) The worker is crashed unexpectedly due to SIGSEGV or other unexpected errors.
Error executing job with overrides: ['algorithm.adv_estimator=grpo......trainer.total_epochs=1']
The actor is dead because its worker process has died. Worker exit type: SYSTEM_ERROR Worker exit detail: Worker unexpectedly exits with a connection error code 2. End of file. There are some potential root causes. (1) The process is killed by SIGKILL by OOM killer due to high memory usage. (2) ray stop --force is called. (3) The worker is crashed unexpectedly due to SIGSEGV or other unexpected errors.
The actor never ran - it was cancelled before it started running.
Hope someone could help me with this plz:-)
ShuaibinLi
Metadata
Metadata
Assignees
Labels
No labels