You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We noticed that the load balancing loss (aux_loss) that is implemented in MoEs modeling_mixtral.py#L1244 is not added to the loss implemented in DPO/KTO trainers.
Isn't this needed so that the load balancing between experts is still guaranteed after fine-tuning or does the KL loss penalization sufficiently prevents the router weights from changing too much?