feat: Fix and enhances for Nsight system profiling #865

guyueh1 · 2025-08-07T22:26:06Z

What does this PR do ?

Fix and enhances for Nsight system profiling.

Fix profiling the vllm executor; passing ray_workers_use_nsight to llm_kwargs
Add nvtx ranges to dtensor, megatron, and vllm backend

Issues

List issues that this PR closes (syntax):

Usage

You can potentially add a usage example below

# Add a code snippet demonstrating how to use this

Before your PR is "Ready for review"

Pre checks:

Make sure you read and followed Contributor guidelines
Did you write any new necessary tests?
Did you run the unit tests and functional tests locally? Visit our Testing Guide for how to run tests
Did you add or update any necessary documentation? Visit our Document Development Guide for how to write, build and test the docs.

Additional Information

...

terrykong

generally lgtm. thanks!

does this mean vllm profiling works now? if so, could you add back those instructions into the nsys-profiling page? I had removed since it didn't work at the time

nemo_rl/models/generation/vllm.py

guyueh1 · 2025-08-07T23:06:51Z

generally lgtm. thanks!

does this mean vllm profiling works now? if so, could you add back those instructions into the nsys-profiling page? I had removed since it didn't work at the time

It doesn't work the same way as expected, the vllm worker still generate profiles because we still pass nsight kwargs to the VllmGenerationWorker, but those files have zero cuda events; meanwhile a new set of profiles named worker_process_%procid.nsys-rep are generated containing cuda events for vllm engine, we can't control the name or other nsight arguments as it is hardcoded in vllm here

Should we just remove the profiling wrapped around VllmGenerationWorker because that is basically an empty profile?

TODO i'll update the doc;

terrykong · 2025-08-08T00:31:13Z

Should we just remove the profiling wrapped around VllmGenerationWorker because that is basically an empty profile?

+1 to removing, if it's not giving anything

guyueh1 · 2025-08-08T18:14:30Z

@terrykong I ended up keeping the original wrapper on VllmGeneration and the new one, because we need the previous one for Vllm's MP=1 case, and the new one for MP>1 case; I added an explanation about the generated files in the doc.

terrykong · 2025-08-09T01:02:08Z

lgtm. @guyueh1 could you resolve the DCO job?

Signed-off-by: Guyue Huang <guyueh@nvidia.com>

guyueh1 · 2025-08-11T16:39:38Z

the DCO was broken by a previous commit brought in by merge; I had to force overwrite the branch but it's resolved now

Signed-off-by: Guyue Huang <guyueh@nvidia.com> Signed-off-by: Qidong Su <qidongs@nvidia.com>

commit b246e55 Author: Youngeun Kwon <youngeunk@nvidia.com> Date: Mon Aug 25 15:05:48 2025 -0700 update the script Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com> commit 5315a6b Author: Youngeun Kwon <youngeunk@nvidia.com> Date: Mon Aug 25 13:59:16 2025 -0700 script update Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com> commit 4437402 Author: Youngeun Kwon <youngeunk@nvidia.com> Date: Tue Jul 15 17:42:23 2025 -0700 local Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com> wip Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com> add script Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com> update script Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com> update script Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com> interactive Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com> commit b721703 Author: Charlie Truong <chtruong@nvidia.com> Date: Mon Aug 18 11:22:54 2025 -0500 build: Fix pytorch image ref in Dockerfile.ngc_pytorch (NVIDIA-NeMo#936) Signed-off-by: Charlie Truong <chtruong@nvidia.com> commit 70b9666 Author: Charlie Truong <chtruong@nvidia.com> Date: Sun Aug 17 21:17:58 2025 -0500 build: Add Dockerfile that uses NGC pytorch image (NVIDIA-NeMo#897) Signed-off-by: Charlie Truong <chtruong@nvidia.com> commit df31c1b Author: pjin-nvidia <pjin@nvidia.com> Date: Thu Aug 14 18:34:50 2025 -0700 feat: chunked logprob calculation with deferred fp32 cast to help with OOM (NVIDIA-NeMo#918) Signed-off-by: Peter Jin <pjin@nvidia.com> commit 83c6bfc Author: yuki <48991475+yuki-666@users.noreply.github.com> Date: Thu Aug 14 21:48:55 2025 +0800 refactor: split sync/async vllm worker ([1/2] of refactor vllm worker) (NVIDIA-NeMo#900) Signed-off-by: Yuki Huang <yukih@nvidia.com> commit 9f7825e Author: Rayen <130129397+RayenTian@users.noreply.github.com> Date: Thu Aug 14 12:38:27 2025 +0800 feat: Add TP to embed_tokens and lm_head for Gemma models (NVIDIA-NeMo#879) Signed-off-by: ruit <ruit@nvidia.com> commit e1f56c4 Author: Terry Kong <terrycurtiskong@gmail.com> Date: Tue Aug 12 13:09:37 2025 -0700 feat: add diagnostic script for problematic embeddings (NVIDIA-NeMo#896) Signed-off-by: Terry Kong <terryk@nvidia.com> commit 223bfa8 Author: Gerald Shen <119401249+gshennvm@users.noreply.github.com> Date: Mon Aug 11 18:19:52 2025 -0700 feat: add nemotron5 sharding (NVIDIA-NeMo#481) Signed-off-by: Terry Kong <terryk@nvidia.com> Co-authored-by: Terry Kong <terryk@nvidia.com> commit 18b9e2c Author: Terry Kong <terrycurtiskong@gmail.com> Date: Mon Aug 11 15:08:52 2025 -0700 test: lower step count on gemma nightly test to finish within 4 hours (NVIDIA-NeMo#880) Signed-off-by: Terry Kong <terryk@nvidia.com> commit 8fd8c96 Author: guyueh1 <140554423+guyueh1@users.noreply.github.com> Date: Mon Aug 11 10:46:29 2025 -0700 feat: Fix and enhances for Nsight system profiling (NVIDIA-NeMo#865) Signed-off-by: Guyue Huang <guyueh@nvidia.com> commit 2b87def Author: Qidong Su <soodoshll@gmail.com> Date: Fri Aug 8 18:54:20 2025 -0400 fix: OOM in deepscaler1.5b with sequence length = 16/24k (NVIDIA-NeMo#875) Signed-off-by: Qidong Su <qidongs@nvidia.com> commit fecf71e Author: Rayen <130129397+RayenTian@users.noreply.github.com> Date: Sat Aug 9 06:42:07 2025 +0800 fix: remove tie weight check (NVIDIA-NeMo#700) Signed-off-by: ruit <ruit@nvidia.com> commit d45ff3f Author: Terry Kong <terrycurtiskong@gmail.com> Date: Fri Aug 8 10:07:02 2025 -0700 test: add deepscaler tests + pipe-clean configs + fix eval for deepscaler (NVIDIA-NeMo#866) Signed-off-by: Terry Kong <terryk@nvidia.com> commit d73c942 Author: Anna Shors <ashors@nvidia.com> Date: Fri Aug 8 09:27:15 2025 -0700 feat: qwen3 export to HF (NVIDIA-NeMo#873) Signed-off-by: Abdalgader Abubaker <136640907+abdalgader-a@users.noreply.github.com> Signed-off-by: Anna Shors <ashors@nvidia.com> Co-authored-by: Abdalgader Abubaker <136640907+abdalgader-a@users.noreply.github.com> commit e924d33 Author: Shang Wang <samshang.wang@mail.utoronto.ca> Date: Fri Aug 8 12:15:34 2025 -0400 docs: Link uv's installation instructions to uv's website (NVIDIA-NeMo#837) Signed-off-by: Shang Wang <samshang.wang@mail.utoronto.ca> commit bbbb3d6 Author: yuki <48991475+yuki-666@users.noreply.github.com> Date: Fri Aug 8 23:26:15 2025 +0800 fix: fix non-colocated with cpu_offload enabled (NVIDIA-NeMo#861) Signed-off-by: Yuki Huang <yukih@nvidia.com> commit 88a399e Author: yuki <48991475+yuki-666@users.noreply.github.com> Date: Fri Aug 8 14:04:08 2025 +0800 chore: remove old fsdp1 unit test (NVIDIA-NeMo#871) Signed-off-by: Yuki Huang <yukih@nvidia.com> commit b8a89a9 Author: yuki <48991475+yuki-666@users.noreply.github.com> Date: Fri Aug 8 13:56:19 2025 +0800 feat: support non-colocated in mcore (NVIDIA-NeMo#613) Signed-off-by: Yuki Huang <yukih@nvidia.com> commit 5910abb Author: Anna Shors <ashors@nvidia.com> Date: Thu Aug 7 13:11:43 2025 -0700 feat: support DTensor CP in DPO and SFT (NVIDIA-NeMo#798) Signed-off-by: ashors1 <ashors@nvidia.com> commit 0988a7d Author: Felipe Vieira Frujeri <ffrujeri@gmail.com> Date: Wed Aug 6 22:01:32 2025 -0700 fix: Fix error message in VllmGenerationWorker. (NVIDIA-NeMo#633) Signed-off-by: Felipe Vieira Frujeri <ffrujeri@nvidia.com> commit 233cc07 Author: Parth Chadha <pchadha@nvidia.com> Date: Wed Aug 6 15:14:22 2025 -0700 fix: force use of eager (disabled cuda graphs) due to convergence issues (NVIDIA-NeMo#857) Signed-off-by: Parth Chadha <pchadha@nvidia.com> commit 0557402 Author: Terry Kong <terrycurtiskong@gmail.com> Date: Wed Aug 6 14:44:29 2025 -0700 chore: 0.3.0 -> 0.4.0rc0 (NVIDIA-NeMo#840) Signed-off-by: Terry Kong <terryk@nvidia.com> commit 03472a0 Author: Terry Kong <terrycurtiskong@gmail.com> Date: Wed Aug 6 14:43:55 2025 -0700 feat: dockerfile can build hermetically or from build context (NVIDIA-NeMo#799) Signed-off-by: Terry Kong <terryk@nvidia.com> commit 9af0a52 Author: Anna Shors <ashors@nvidia.com> Date: Wed Aug 6 12:35:51 2025 -0700 fix: fix grpo + mcore checkpointing without validation (NVIDIA-NeMo#844) Signed-off-by: ashors1 <ashors@nvidia.com> commit b6269f7 Author: Yubo Gao <yubog@nvidia.com> Date: Tue Aug 5 16:55:02 2025 -0400 feat: track policy training compute throughput (NVIDIA-NeMo#632) Signed-off-by: Yubo Gao <yubog@nvidia.com> commit b74c5d0 Author: Wei Du <wedu@nvidia.com> Date: Tue Aug 5 15:05:13 2025 -0500 feat: save checkpoint before timeout to avoid 4-hour runtime limit (NVIDIA-NeMo#734) Signed-off-by: Wei Du <wedu@nvidia.com> Signed-off-by: Terry Kong <terrycurtiskong@gmail.com> Co-authored-by: Terry Kong <terrycurtiskong@gmail.com> commit c784dd9 Author: Zhiyu Li <zhiyul@NVIDIA.com> Date: Tue Aug 5 10:47:30 2025 -0700 feat: add data shuffle and random seed option (NVIDIA-NeMo#334) Signed-off-by: Zhiyu Li <zhiyul@nvidia.com> Signed-off-by: Zhiyu Li <zhiyul@NVIDIA.com> commit c249efc Author: Abdalgader Abubaker <136640907+abdalgader-a@users.noreply.github.com> Date: Tue Aug 5 21:33:28 2025 +0400 docs: fix checkpointing command for megatron->hf export (NVIDIA-NeMo#823) Signed-off-by: abdalgader-a <abdalgader.abubaker@tii.ae> Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

Signed-off-by: Guyue Huang <guyueh@nvidia.com> Signed-off-by: Julien Veron Vialard <jveronvialar@nvidia.com>

guyueh1 marked this pull request as ready for review August 7, 2025 22:35

guyueh1 requested a review from terrykong August 7, 2025 22:56

terrykong reviewed Aug 7, 2025

View reviewed changes

nemo_rl/models/generation/vllm.py Outdated Show resolved Hide resolved

guyueh1 changed the title ~~Fix and enhances for Nsight system profiling~~ feat: Fix and enhances for Nsight system profiling Aug 8, 2025

github-actions bot added the documentation Improvements or additions to documentation label Aug 8, 2025

terrykong previously approved these changes Aug 9, 2025

View reviewed changes

guyueh1 force-pushed the guyueh/fix_and_enhances_for_nsys branch from c07a200 to bafe1e8 Compare August 11, 2025 16:28

guyueh1 dismissed terrykong’s stale review via 410eda6 August 11, 2025 16:32

squash multiple commits for DCO

787da0f

Signed-off-by: Guyue Huang <guyueh@nvidia.com>

guyueh1 force-pushed the guyueh/fix_and_enhances_for_nsys branch from 410eda6 to 787da0f Compare August 11, 2025 16:38

terrykong approved these changes Aug 11, 2025

View reviewed changes

terrykong enabled auto-merge August 11, 2025 16:50

terrykong added this pull request to the merge queue Aug 11, 2025

Merged via the queue into main with commit 8fd8c96 Aug 11, 2025
19 checks passed

terrykong deleted the guyueh/fix_and_enhances_for_nsys branch August 11, 2025 20:39

guyueh1 mentioned this pull request Aug 11, 2025

Holistic Profiling Tool in RL Workflow #569

Open

soodoshll pushed a commit to soodoshll/RL that referenced this pull request Aug 13, 2025

feat: Fix and enhances for Nsight system profiling (NVIDIA-NeMo#865)

aba93dd

Signed-off-by: Guyue Huang <guyueh@nvidia.com> Signed-off-by: Qidong Su <qidongs@nvidia.com>

guyueh1 mentioned this pull request Aug 20, 2025

Add nsys markers #51

Closed

terrykong linked an issue Aug 20, 2025 that may be closed by this pull request

Add nsys markers #51

Closed

jveronvialard pushed a commit that referenced this pull request Aug 27, 2025

feat: Fix and enhances for Nsight system profiling (#865)

42657b1

Signed-off-by: Guyue Huang <guyueh@nvidia.com> Signed-off-by: Julien Veron Vialard <jveronvialar@nvidia.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: Fix and enhances for Nsight system profiling #865

feat: Fix and enhances for Nsight system profiling #865

Uh oh!

guyueh1 commented Aug 7, 2025 •

edited

Loading

Uh oh!

terrykong left a comment

Uh oh!

Uh oh!

guyueh1 commented Aug 7, 2025 •

edited

Loading

Uh oh!

terrykong commented Aug 8, 2025

Uh oh!

guyueh1 commented Aug 8, 2025

Uh oh!

terrykong commented Aug 9, 2025

Uh oh!

guyueh1 commented Aug 11, 2025

Uh oh!

Uh oh!

Uh oh!

feat: Fix and enhances for Nsight system profiling #865

feat: Fix and enhances for Nsight system profiling #865

Uh oh!

Conversation

guyueh1 commented Aug 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do ?

Issues

Usage

Before your PR is "Ready for review"

Additional Information

Uh oh!

terrykong left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

guyueh1 commented Aug 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

terrykong commented Aug 8, 2025

Uh oh!

guyueh1 commented Aug 8, 2025

Uh oh!

terrykong commented Aug 9, 2025

Uh oh!

guyueh1 commented Aug 11, 2025

Uh oh!

Uh oh!

Uh oh!

guyueh1 commented Aug 7, 2025 •

edited

Loading

guyueh1 commented Aug 7, 2025 •

edited

Loading