fix: remove tie weight check #700

RayenTian · 2025-07-21T05:40:02Z

What does this PR do ?

This pull request removes the tie weight check in the DTensor Worker, addressing a previous limitation with training models that utilize tie_word_embeddings when tensor parallel size (tp_size) > 1.

Issues

closes #684

Test Result

For remove `NRL_SKIP_TIED_WEIGHT_CHECK` env variable

Experiments were run with meta-llama/Llama-3.2-1B, Qwen/Qwen2.5-1.5B-Instruct and google/gemma-2-2b-it. With the truncate rate set within a reasonable range, both tp_size=1 and tp_size=2 produced comparable reward results, indicating that training proceeds as expected. This demonstrates consistency in model behavior across different tensor parallel configurations, confirming correct handling of tied word embeddings during training.

Main setup:

policy.generation.temperature=1.0
policy.max_total_sequence_length=2048
cluster.gpus_per_node=8

Llama-3.2-1B

Qwen2.5-1.5B-Instruct

google/gemma-2-2b-it

Test with:

policy.dynamic_batching.enabled=true
policy.sequence_packing.enabled=false
policy.train_global_batch_size=128

For remove `model.config.tie_word_embeddings` in `nemo_rl/models/dtensor/parallelize.py`

Experiments were conducted using both the Qwen/Qwen2.5-1.5B-Instruct and Qwen/Qwen2.5-7B-Instruct models, representing the original tied and untied weight configurations, respectively. I disabled Qwen's optimized parallel plan in PARALLIZE_FUNCTIONS, forcing both models to utilize the HP plan for testing.

The results indicate that, after removing the model.config.tie_word_embeddings check, the HP plan and the optimized plan produce identical outcomes.

Qwen/Qwen2.5-1.5B-Instruct

This model uses tied word embeddings by default.
Current results are from a short test of 10 steps.

Qwen/Qwen2.5-7B-Instruct

This model uses untied word embeddings by default.
Current results are from a short test of 10 steps.

More Custom Test

Some additional tests were also conducted here.
For models like Qwen/Qwen2.5-1.5B-Instruct, which by default use tied word embeddings, we forcibly disabled the tie_word_embedding setting. The results showed significant anomalies in both the token_mult_prob_error and the reward. We suspect this may be because vLLM still keeps the embeddings tied on its side, so when you call update_weights, there might be some issues (in practice, vLLM may be using either the embed or the lm_head weight, rather than truly untying them as in training). This causes the token_mult_prob_error to behave abnormally.

For models like Qwen/Qwen2.5-7B-Instruct, which by default do not use tied embeddings, we forcibly enabled tie_word_embedding. The results showed large differences in reward, but the token_mult_prob_error remained stable. This might be because, originally, the embed and lm_head weights were different due to being untied; forcing a tie essentially removes the original lm_head weights. When we call update_weights to vLLM, its lm_head is also overwritten with the embed weights, so the token_mult_prob_error remains normal. However, since we forcibly replaced the original weights, the reward becomes abnormal.

Additional Notes on Gemma Model Testing

gemma-3-1b-it: This model sets kv_head=1, which is currently not compatible with tensor parallelism (TP=2). As a result, it could not be tested in a TP=2 configuration.

nemo_rl/models/huggingface/common.py

nemo_rl/models/policy/dtensor_policy_worker.py

yuki-97 · 2025-07-22T06:03:26Z

Thanks @RayenTian for verifying and removing these things!

Can you also check if we can also remove model.config.tie_word_embeddings in HF TP plan in nemo_rl/models/dtensor/parallelize.py?

yuki-97 · 2025-07-22T07:14:03Z

One more thing, can you also search issues/227 in the repo and remove (or update) them?

SahilJain314 · 2025-07-23T21:17:42Z

can you please attach the token_mult_prob_error plots to the description? (we can't always see the errors when we just look at convergence plots)

docs/model-quirks.md

RayenTian · 2025-07-24T05:22:01Z

can you please attach the token_mult_prob_error plots to the description? (we can't always see the errors when we just look at convergence plots)

Thank you for the suggestion. I have added the plots as requested.

SahilJain314

thanks for rebasing, lgtm now.

Signed-off-by: ruit <ruit@nvidia.com>

…h are not used anymore Signed-off-by: ruit <ruit@nvidia.com>

Signed-off-by: ruit <ruit@nvidia.com>

Signed-off-by: ruit <ruit@nvidia.com> Signed-off-by: Qidong Su <qidongs@nvidia.com>

commit b246e55 Author: Youngeun Kwon <youngeunk@nvidia.com> Date: Mon Aug 25 15:05:48 2025 -0700 update the script Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com> commit 5315a6b Author: Youngeun Kwon <youngeunk@nvidia.com> Date: Mon Aug 25 13:59:16 2025 -0700 script update Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com> commit 4437402 Author: Youngeun Kwon <youngeunk@nvidia.com> Date: Tue Jul 15 17:42:23 2025 -0700 local Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com> wip Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com> add script Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com> update script Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com> update script Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com> interactive Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com> commit b721703 Author: Charlie Truong <chtruong@nvidia.com> Date: Mon Aug 18 11:22:54 2025 -0500 build: Fix pytorch image ref in Dockerfile.ngc_pytorch (NVIDIA-NeMo#936) Signed-off-by: Charlie Truong <chtruong@nvidia.com> commit 70b9666 Author: Charlie Truong <chtruong@nvidia.com> Date: Sun Aug 17 21:17:58 2025 -0500 build: Add Dockerfile that uses NGC pytorch image (NVIDIA-NeMo#897) Signed-off-by: Charlie Truong <chtruong@nvidia.com> commit df31c1b Author: pjin-nvidia <pjin@nvidia.com> Date: Thu Aug 14 18:34:50 2025 -0700 feat: chunked logprob calculation with deferred fp32 cast to help with OOM (NVIDIA-NeMo#918) Signed-off-by: Peter Jin <pjin@nvidia.com> commit 83c6bfc Author: yuki <48991475+yuki-666@users.noreply.github.com> Date: Thu Aug 14 21:48:55 2025 +0800 refactor: split sync/async vllm worker ([1/2] of refactor vllm worker) (NVIDIA-NeMo#900) Signed-off-by: Yuki Huang <yukih@nvidia.com> commit 9f7825e Author: Rayen <130129397+RayenTian@users.noreply.github.com> Date: Thu Aug 14 12:38:27 2025 +0800 feat: Add TP to embed_tokens and lm_head for Gemma models (NVIDIA-NeMo#879) Signed-off-by: ruit <ruit@nvidia.com> commit e1f56c4 Author: Terry Kong <terrycurtiskong@gmail.com> Date: Tue Aug 12 13:09:37 2025 -0700 feat: add diagnostic script for problematic embeddings (NVIDIA-NeMo#896) Signed-off-by: Terry Kong <terryk@nvidia.com> commit 223bfa8 Author: Gerald Shen <119401249+gshennvm@users.noreply.github.com> Date: Mon Aug 11 18:19:52 2025 -0700 feat: add nemotron5 sharding (NVIDIA-NeMo#481) Signed-off-by: Terry Kong <terryk@nvidia.com> Co-authored-by: Terry Kong <terryk@nvidia.com> commit 18b9e2c Author: Terry Kong <terrycurtiskong@gmail.com> Date: Mon Aug 11 15:08:52 2025 -0700 test: lower step count on gemma nightly test to finish within 4 hours (NVIDIA-NeMo#880) Signed-off-by: Terry Kong <terryk@nvidia.com> commit 8fd8c96 Author: guyueh1 <140554423+guyueh1@users.noreply.github.com> Date: Mon Aug 11 10:46:29 2025 -0700 feat: Fix and enhances for Nsight system profiling (NVIDIA-NeMo#865) Signed-off-by: Guyue Huang <guyueh@nvidia.com> commit 2b87def Author: Qidong Su <soodoshll@gmail.com> Date: Fri Aug 8 18:54:20 2025 -0400 fix: OOM in deepscaler1.5b with sequence length = 16/24k (NVIDIA-NeMo#875) Signed-off-by: Qidong Su <qidongs@nvidia.com> commit fecf71e Author: Rayen <130129397+RayenTian@users.noreply.github.com> Date: Sat Aug 9 06:42:07 2025 +0800 fix: remove tie weight check (NVIDIA-NeMo#700) Signed-off-by: ruit <ruit@nvidia.com> commit d45ff3f Author: Terry Kong <terrycurtiskong@gmail.com> Date: Fri Aug 8 10:07:02 2025 -0700 test: add deepscaler tests + pipe-clean configs + fix eval for deepscaler (NVIDIA-NeMo#866) Signed-off-by: Terry Kong <terryk@nvidia.com> commit d73c942 Author: Anna Shors <ashors@nvidia.com> Date: Fri Aug 8 09:27:15 2025 -0700 feat: qwen3 export to HF (NVIDIA-NeMo#873) Signed-off-by: Abdalgader Abubaker <136640907+abdalgader-a@users.noreply.github.com> Signed-off-by: Anna Shors <ashors@nvidia.com> Co-authored-by: Abdalgader Abubaker <136640907+abdalgader-a@users.noreply.github.com> commit e924d33 Author: Shang Wang <samshang.wang@mail.utoronto.ca> Date: Fri Aug 8 12:15:34 2025 -0400 docs: Link uv's installation instructions to uv's website (NVIDIA-NeMo#837) Signed-off-by: Shang Wang <samshang.wang@mail.utoronto.ca> commit bbbb3d6 Author: yuki <48991475+yuki-666@users.noreply.github.com> Date: Fri Aug 8 23:26:15 2025 +0800 fix: fix non-colocated with cpu_offload enabled (NVIDIA-NeMo#861) Signed-off-by: Yuki Huang <yukih@nvidia.com> commit 88a399e Author: yuki <48991475+yuki-666@users.noreply.github.com> Date: Fri Aug 8 14:04:08 2025 +0800 chore: remove old fsdp1 unit test (NVIDIA-NeMo#871) Signed-off-by: Yuki Huang <yukih@nvidia.com> commit b8a89a9 Author: yuki <48991475+yuki-666@users.noreply.github.com> Date: Fri Aug 8 13:56:19 2025 +0800 feat: support non-colocated in mcore (NVIDIA-NeMo#613) Signed-off-by: Yuki Huang <yukih@nvidia.com> commit 5910abb Author: Anna Shors <ashors@nvidia.com> Date: Thu Aug 7 13:11:43 2025 -0700 feat: support DTensor CP in DPO and SFT (NVIDIA-NeMo#798) Signed-off-by: ashors1 <ashors@nvidia.com> commit 0988a7d Author: Felipe Vieira Frujeri <ffrujeri@gmail.com> Date: Wed Aug 6 22:01:32 2025 -0700 fix: Fix error message in VllmGenerationWorker. (NVIDIA-NeMo#633) Signed-off-by: Felipe Vieira Frujeri <ffrujeri@nvidia.com> commit 233cc07 Author: Parth Chadha <pchadha@nvidia.com> Date: Wed Aug 6 15:14:22 2025 -0700 fix: force use of eager (disabled cuda graphs) due to convergence issues (NVIDIA-NeMo#857) Signed-off-by: Parth Chadha <pchadha@nvidia.com> commit 0557402 Author: Terry Kong <terrycurtiskong@gmail.com> Date: Wed Aug 6 14:44:29 2025 -0700 chore: 0.3.0 -> 0.4.0rc0 (NVIDIA-NeMo#840) Signed-off-by: Terry Kong <terryk@nvidia.com> commit 03472a0 Author: Terry Kong <terrycurtiskong@gmail.com> Date: Wed Aug 6 14:43:55 2025 -0700 feat: dockerfile can build hermetically or from build context (NVIDIA-NeMo#799) Signed-off-by: Terry Kong <terryk@nvidia.com> commit 9af0a52 Author: Anna Shors <ashors@nvidia.com> Date: Wed Aug 6 12:35:51 2025 -0700 fix: fix grpo + mcore checkpointing without validation (NVIDIA-NeMo#844) Signed-off-by: ashors1 <ashors@nvidia.com> commit b6269f7 Author: Yubo Gao <yubog@nvidia.com> Date: Tue Aug 5 16:55:02 2025 -0400 feat: track policy training compute throughput (NVIDIA-NeMo#632) Signed-off-by: Yubo Gao <yubog@nvidia.com> commit b74c5d0 Author: Wei Du <wedu@nvidia.com> Date: Tue Aug 5 15:05:13 2025 -0500 feat: save checkpoint before timeout to avoid 4-hour runtime limit (NVIDIA-NeMo#734) Signed-off-by: Wei Du <wedu@nvidia.com> Signed-off-by: Terry Kong <terrycurtiskong@gmail.com> Co-authored-by: Terry Kong <terrycurtiskong@gmail.com> commit c784dd9 Author: Zhiyu Li <zhiyul@NVIDIA.com> Date: Tue Aug 5 10:47:30 2025 -0700 feat: add data shuffle and random seed option (NVIDIA-NeMo#334) Signed-off-by: Zhiyu Li <zhiyul@nvidia.com> Signed-off-by: Zhiyu Li <zhiyul@NVIDIA.com> commit c249efc Author: Abdalgader Abubaker <136640907+abdalgader-a@users.noreply.github.com> Date: Tue Aug 5 21:33:28 2025 +0400 docs: fix checkpointing command for megatron->hf export (NVIDIA-NeMo#823) Signed-off-by: abdalgader-a <abdalgader.abubaker@tii.ae> Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

Signed-off-by: ruit <ruit@nvidia.com> Signed-off-by: Julien Veron Vialard <jveronvialar@nvidia.com>

RayenTian added the CI:L1 Run doctests, unit tests, and functional tests label Jul 21, 2025

RayenTian temporarily deployed to nemo-ci July 21, 2025 05:40 — with GitHub Actions Inactive

RayenTian added CI:L0 Run doctests and unit tests and removed CI:L1 Run doctests, unit tests, and functional tests labels Jul 21, 2025

RayenTian temporarily deployed to nemo-ci July 21, 2025 08:20 — with GitHub Actions Inactive

RayenTian temporarily deployed to public July 21, 2025 08:21 — with GitHub Actions Inactive

RayenTian requested review from yuki-97 and joyang-nv July 21, 2025 09:17

RayenTian added CI:L0 Run doctests and unit tests and removed CI:L0 Run doctests and unit tests labels Jul 22, 2025

RayenTian temporarily deployed to nemo-ci July 22, 2025 03:16 — with GitHub Actions Inactive

RayenTian force-pushed the ruit/remove_tie_weight_check branch from e610b0e to a4dbcec Compare July 22, 2025 04:32

RayenTian temporarily deployed to public July 22, 2025 04:34 — with GitHub Actions Inactive

RayenTian added CI:L1 Run doctests, unit tests, and functional tests and removed CI:L0 Run doctests and unit tests labels Jul 22, 2025

RayenTian temporarily deployed to nemo-ci July 22, 2025 05:05 — with GitHub Actions Inactive

yuki-97 reviewed Jul 22, 2025

View reviewed changes

nemo_rl/models/huggingface/common.py Outdated Show resolved Hide resolved

nemo_rl/models/policy/dtensor_policy_worker.py Outdated Show resolved Hide resolved

RayenTian force-pushed the ruit/remove_tie_weight_check branch from a4dbcec to b1cd9b6 Compare July 23, 2025 01:50

RayenTian temporarily deployed to public July 23, 2025 01:52 — with GitHub Actions Inactive

RayenTian force-pushed the ruit/remove_tie_weight_check branch from b1cd9b6 to fea99a2 Compare July 24, 2025 03:21

github-actions bot added the documentation Improvements or additions to documentation label Jul 24, 2025

RayenTian temporarily deployed to public July 24, 2025 03:23 — with GitHub Actions Inactive

RayenTian temporarily deployed to public July 24, 2025 03:27 — with GitHub Actions Inactive

RayenTian commented Jul 24, 2025

View reviewed changes

docs/model-quirks.md Outdated Show resolved Hide resolved

RayenTian force-pushed the ruit/remove_tie_weight_check branch from c6386b5 to d741557 Compare July 29, 2025 03:26

RayenTian temporarily deployed to public July 29, 2025 03:28 — with GitHub Actions Inactive

RayenTian added the CI:L0 Run doctests and unit tests label Aug 6, 2025

RayenTian temporarily deployed to nemo-ci August 6, 2025 01:20 — with GitHub Actions Inactive

RayenTian temporarily deployed to nemo-ci August 6, 2025 02:21 — with GitHub Actions Inactive

RayenTian force-pushed the ruit/remove_tie_weight_check branch from c9f41b3 to 292cc83 Compare August 6, 2025 07:15

RayenTian added CI:L0 Run doctests and unit tests and removed CI:L0 Run doctests and unit tests labels Aug 6, 2025

RayenTian temporarily deployed to nemo-ci August 6, 2025 07:16 — with GitHub Actions Inactive

RayenTian temporarily deployed to public August 6, 2025 07:17 — with GitHub Actions Inactive

terrykong previously approved these changes Aug 8, 2025

View reviewed changes

SahilJain314 previously approved these changes Aug 8, 2025

View reviewed changes

RayenTian added 5 commits August 8, 2025 22:39

remove tie weight check

5bcf392

Signed-off-by: ruit <ruit@nvidia.com>

remove SKIP_DTENSOR_TIED_WEIGHTS_CHECK and self.num_tied_weights whic…

95fb484

…h are not used anymore Signed-off-by: ruit <ruit@nvidia.com>

remove model.config.tie_word_embeddings

b860e9c

Signed-off-by: ruit <ruit@nvidia.com>

modify doc for tie weight

ba53b32

Signed-off-by: ruit <ruit@nvidia.com>

docs: remove tied weights section for Gemma-3 models

a5d70d8

Signed-off-by: ruit <ruit@nvidia.com>

terrykong dismissed stale reviews from SahilJain314, yuki-97, and themself via a5d70d8 August 8, 2025 22:40

terrykong force-pushed the ruit/remove_tie_weight_check branch from 292cc83 to a5d70d8 Compare August 8, 2025 22:40

terrykong enabled auto-merge August 8, 2025 22:40

terrykong approved these changes Aug 8, 2025

View reviewed changes

terrykong temporarily deployed to public August 8, 2025 22:42 — with GitHub Actions Inactive

terrykong added this pull request to the merge queue Aug 8, 2025

Merged via the queue into main with commit fecf71e Aug 9, 2025
23 checks passed

terrykong deleted the ruit/remove_tie_weight_check branch August 9, 2025 01:18

soodoshll pushed a commit to soodoshll/RL that referenced this pull request Aug 13, 2025

fix: remove tie weight check (NVIDIA-NeMo#700)

759f0c4

Signed-off-by: ruit <ruit@nvidia.com> Signed-off-by: Qidong Su <qidongs@nvidia.com>

jveronvialard pushed a commit that referenced this pull request Aug 27, 2025

fix: remove tie weight check (#700)

3ec7ea9

Signed-off-by: ruit <ruit@nvidia.com> Signed-off-by: Julien Veron Vialard <jveronvialar@nvidia.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: remove tie weight check #700

fix: remove tie weight check #700

Uh oh!

RayenTian commented Jul 21, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

yuki-97 commented Jul 22, 2025

Uh oh!

yuki-97 commented Jul 22, 2025

Uh oh!

SahilJain314 commented Jul 23, 2025

Uh oh!

Uh oh!

RayenTian commented Jul 24, 2025

Uh oh!

SahilJain314 left a comment

Uh oh!

Uh oh!

Uh oh!

fix: remove tie weight check #700

fix: remove tie weight check #700

Uh oh!

Conversation

RayenTian commented Jul 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do ?

Issues

Test Result

For remove NRL_SKIP_TIED_WEIGHT_CHECK env variable

Llama-3.2-1B

Qwen2.5-1.5B-Instruct

google/gemma-2-2b-it

For remove model.config.tie_word_embeddings in nemo_rl/models/dtensor/parallelize.py

Qwen/Qwen2.5-1.5B-Instruct

Qwen/Qwen2.5-7B-Instruct

More Custom Test

Additional Notes on Gemma Model Testing

Uh oh!

Uh oh!

Uh oh!

yuki-97 commented Jul 22, 2025

Uh oh!

yuki-97 commented Jul 22, 2025

Uh oh!

SahilJain314 commented Jul 23, 2025

Uh oh!

Uh oh!

RayenTian commented Jul 24, 2025

Uh oh!

SahilJain314 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

RayenTian commented Jul 21, 2025 •

edited

Loading

For remove `NRL_SKIP_TIED_WEIGHT_CHECK` env variable

For remove `model.config.tie_word_embeddings` in `nemo_rl/models/dtensor/parallelize.py`