You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When I trained the DS-1.5B on single node with 8*80G GPUs, the program did not report any errors. However, the Ray process became stuck after 300+ steps, no new workers were allocated, and most Python processes on the server were in the S state. Has anyone encountered this issue before?