Skip to content

Conversation

Krish0909
Copy link
Contributor

Fixes #39392

This change improves loss calculation correctness for multi-GPU training by enabling proper token averaging across devices by default.

What does this PR do?

Changes the default value of average_tokens_across_devices from False to True in TrainingArguments. This ensures more accurate loss calculation in multi-GPU training scenarios by properly averaging tokens across devices.

As noted in the original issue, this feature provides reproducibility and correctness benefits with no downsides, so there's no reason to keep it disabled by default.

Before submitting

Who can review?

@zach-huggingface @SunMarc @qgallouedec (trainer maintainers)

Fixes huggingface#39392

This change improves loss calculation correctness for multi-GPU training by enabling proper token averaging across devices by default.
@SunMarc
Copy link
Member

SunMarc commented Jul 16, 2025

@qgallouedec, did you face any issues after changing this default in trl ? Happy to do it transformers otherwise

@Krish0909
Copy link
Contributor Author

@SunMarc Hey! Happy to help test this on the trl side as well if needed. Let me know if you’d like me to run any specific checks!

@qgallouedec
Copy link
Member

No. No issue that I'm aware of

Copy link
Member

@SunMarc SunMarc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@SunMarc SunMarc changed the title Enable average_tokens_across_devices by default in TrainingArguments 🚨🚨🚨 [Trainer] Enable average_tokens_across_devices by default in TrainingArguments Jul 18, 2025
@SunMarc SunMarc enabled auto-merge (squash) July 18, 2025 15:16
@Krish0909
Copy link
Contributor Author

Hey @SunMarc , I noticed the tests are failing on ci/circleci: tests_torch and run_tests, while the rest of the checks seem fine.
Given the scope of this PR is limited to changing the default argument with no functional impact beyond configuration, should I investigate and fix these failures or are these known unrelated issues and safe to merge as-is?
Also, the workflows are awaiting approval — could you please approve them if a full CI run is required?
Thanks!

@SunMarc
Copy link
Member

SunMarc commented Jul 21, 2025

Given the scope of this PR is limited to changing the default argument with no functional impact beyond configuration, should I investigate and fix these failures or are these known unrelated issues and safe to merge as-is?
Also, the workflows are awaiting approval — could you please approve them if a full CI run is required?
Thanks!

No it's fine, i'll take care of merging this PR

@Krish0909
Copy link
Contributor Author

Given the scope of this PR is limited to changing the default argument with no functional impact beyond configuration, should I investigate and fix these failures or are these known unrelated issues and safe to merge as-is?

Also, the workflows are awaiting approval — could you please approve them if a full CI run is required?

Thanks!

No it's fine, i'll take care of merging this PR

Thanks!

@SunMarc SunMarc merged commit fdc0566 into huggingface:main Jul 21, 2025
25 checks passed
@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

zucchini-nlp pushed a commit to zucchini-nlp/transformers that referenced this pull request Jul 22, 2025
…rainingArguments` (huggingface#39395)

Enable average_tokens_across_devices by default in TrainingArguments

Fixes huggingface#39392

This change improves loss calculation correctness for multi-GPU training by enabling proper token averaging across devices by default.

Co-authored-by: Krishnan Vignesh <krishnanvignesh@Krishnans-MacBook-Air.local>
Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Enabling average_tokens_across_devices by default in Trainer
4 participants