🚨🚨🚨 [Trainer] Enable `average_tokens_across_devices` by default in `TrainingArguments` #39395

Krish0909 · 2025-07-14T06:52:43Z

This change improves loss calculation correctness for multi-GPU training by enabling proper token averaging across devices by default.

What does this PR do?

Changes the default value of average_tokens_across_devices from False to True in TrainingArguments. This ensures more accurate loss calculation in multi-GPU training scenarios by properly averaging tokens across devices.

As noted in the original issue, this feature provides reproducibility and correctness benefits with no downsides, so there's no reason to keep it disabled by default.

Before submitting

Was this discussed/approved via a Github issue? Yes - Enabling average_tokens_across_devices by default in Trainer #39392
Did you make sure to update the documentation with your changes? Yes - updated docstring

Who can review?

@zach-huggingface @SunMarc @qgallouedec (trainer maintainers)

Fixes huggingface#39392 This change improves loss calculation correctness for multi-GPU training by enabling proper token averaging across devices by default.

SunMarc · 2025-07-16T12:27:12Z

@qgallouedec, did you face any issues after changing this default in trl ? Happy to do it transformers otherwise

Krish0909 · 2025-07-17T04:37:58Z

@SunMarc Hey! Happy to help test this on the trl side as well if needed. Let me know if you’d like me to run any specific checks!

qgallouedec · 2025-07-17T15:08:31Z

No. No issue that I'm aware of

SunMarc

LGTM

Krish0909 · 2025-07-20T09:42:56Z

Hey @SunMarc , I noticed the tests are failing on ci/circleci: tests_torch and run_tests, while the rest of the checks seem fine.
Given the scope of this PR is limited to changing the default argument with no functional impact beyond configuration, should I investigate and fix these failures or are these known unrelated issues and safe to merge as-is?
Also, the workflows are awaiting approval — could you please approve them if a full CI run is required?
Thanks!

SunMarc · 2025-07-21T11:59:41Z

Given the scope of this PR is limited to changing the default argument with no functional impact beyond configuration, should I investigate and fix these failures or are these known unrelated issues and safe to merge as-is?
Also, the workflows are awaiting approval — could you please approve them if a full CI run is required?
Thanks!

No it's fine, i'll take care of merging this PR

Krish0909 · 2025-07-21T12:00:52Z

Given the scope of this PR is limited to changing the default argument with no functional impact beyond configuration, should I investigate and fix these failures or are these known unrelated issues and safe to merge as-is?

Also, the workflows are awaiting approval — could you please approve them if a full CI run is required?

Thanks!

No it's fine, i'll take care of merging this PR

Thanks!

HuggingFaceDocBuilderDev · 2025-07-21T12:11:41Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

…rainingArguments` (huggingface#39395) Enable average_tokens_across_devices by default in TrainingArguments Fixes huggingface#39392 This change improves loss calculation correctness for multi-GPU training by enabling proper token averaging across devices by default. Co-authored-by: Krishnan Vignesh <krishnanvignesh@Krishnans-MacBook-Air.local> Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com>

Enable average_tokens_across_devices by default in TrainingArguments

6e96018

Fixes huggingface#39392 This change improves loss calculation correctness for multi-GPU training by enabling proper token averaging across devices by default.

Merge branch 'main' into fix-average-tokens-default

d2a2060

SunMarc and others added 3 commits July 17, 2025 17:12

Merge branch 'main' into fix-average-tokens-default

3dd8cc4

Merge branch 'main' into fix-average-tokens-default

adee0bf

Merge branch 'main' into fix-average-tokens-default

c2cf8a3

SunMarc approved these changes Jul 18, 2025

View reviewed changes

SunMarc changed the title ~~Enable average_tokens_across_devices by default in TrainingArguments~~ 🚨🚨🚨 [Trainer] Enable average_tokens_across_devices by default in TrainingArguments Jul 18, 2025

SunMarc enabled auto-merge (squash) July 18, 2025 15:16

Merge branch 'main' into fix-average-tokens-default

dda3b1f

qgallouedec approved these changes Jul 18, 2025

View reviewed changes

qgallouedec mentioned this pull request Jul 18, 2025

Add comment for average_tokens_across_devices huggingface/trl#3746

Merged

Merge branch 'main' into fix-average-tokens-default

69bd3d8

Merge branch 'main' into fix-average-tokens-default

71171e3

SunMarc merged commit fdc0566 into huggingface:main Jul 21, 2025
25 checks passed

winglian mentioned this pull request Jul 24, 2025

Use DP+FSDP device mesh dimensions for scaling loss with default value of average_tokens_across_devices: True #39648

Open

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

🚨🚨🚨 [Trainer] Enable `average_tokens_across_devices` by default in `TrainingArguments` #39395

🚨🚨🚨 [Trainer] Enable `average_tokens_across_devices` by default in `TrainingArguments` #39395

Uh oh!

Krish0909 commented Jul 14, 2025

Uh oh!

SunMarc commented Jul 16, 2025

Uh oh!

Krish0909 commented Jul 17, 2025

Uh oh!

qgallouedec commented Jul 17, 2025

Uh oh!

SunMarc left a comment

Uh oh!

Krish0909 commented Jul 20, 2025

Uh oh!

SunMarc commented Jul 21, 2025

Uh oh!

Krish0909 commented Jul 21, 2025

Uh oh!

Uh oh!

HuggingFaceDocBuilderDev commented Jul 21, 2025

Uh oh!

Uh oh!

🚨🚨🚨 [Trainer] Enable average_tokens_across_devices by default in TrainingArguments #39395

🚨🚨🚨 [Trainer] Enable average_tokens_across_devices by default in TrainingArguments #39395

Uh oh!

Conversation

Krish0909 commented Jul 14, 2025

What does this PR do?

Before submitting

Who can review?

Uh oh!

SunMarc commented Jul 16, 2025

Uh oh!

Krish0909 commented Jul 17, 2025

Uh oh!

qgallouedec commented Jul 17, 2025

Uh oh!

SunMarc left a comment

Choose a reason for hiding this comment

Uh oh!

Krish0909 commented Jul 20, 2025

Uh oh!

SunMarc commented Jul 21, 2025

Uh oh!

Krish0909 commented Jul 21, 2025

Uh oh!

Uh oh!

HuggingFaceDocBuilderDev commented Jul 21, 2025

Uh oh!

Uh oh!

🚨🚨🚨 [Trainer] Enable `average_tokens_across_devices` by default in `TrainingArguments` #39395

🚨🚨🚨 [Trainer] Enable `average_tokens_across_devices` by default in `TrainingArguments` #39395