[DPO/KTO] Mixtral Load Balancing Loss

Question towards @lewtun @kashif:

We noticed that the load balancing loss (aux_loss) that is implemented in MoEs [modeling_mixtral.py#L1244](https://github.com/huggingface/transformers/blob/main/src/transformers/models/mixtral/modeling_mixtral.py#L1391) is not added to the loss implemented in DPO/KTO trainers.  

Isn't this needed so that the load balancing between experts is still guaranteed after fine-tuning or does the KL loss penalization sufficiently prevents the router weights from changing too much?   


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[DPO/KTO] Mixtral Load Balancing Loss #1544

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[DPO/KTO] Mixtral Load Balancing Loss #1544

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions