-
Notifications
You must be signed in to change notification settings - Fork 120
Closed as not planned
Labels
Description
Currently general FSDP2 for non-MoE models all run well, but not run well with MoE models. (e.g., Qwen3-30B-A3B
, DeepSeek-V2-Lite
)
-
for
Qwen3-30B-A3B
, it is obviously slower than Qwen3-32B, especially on the refit process or using hf-tp-plan with dtensor tp > 1.
-
for
DeepSeek-V2-Lite
, fail on the following error onmodel.layers.0.self_attn.rotary_emb.cos_cached
, saidv.shape=torch.Size([2048, 64]) and self.reference_model_buffers[k].shape=torch.Size([163840, 64])
File "/workspace/nemo_rl/models/policy/dtensor_policy_worker.py", line 649, in get_reference_policy_logprobs
with self.use_reference_model():
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/yukih/.local/share/uv/python/cpython-3.12.9-linux-x86_64-gnu/lib/python3.12/contextlib.py", line 137, in __enter__
return next(self.gen)
^^^^^^^^^^^^^^
File "/workspacenemo_rl/models/policy/dtensor_policy_worker.py", line 626, in use_reference_model
val.copy_(self.reference_model_buffers[k])
RuntimeError: The size of tensor a (2048) must match the size of tensor b (163840) at non-singleton dimension 0