Update exp_manager to better handle parallelism #12211

paarthneekhara · 2025-02-17T07:53:55Z

Update exp_manager.py to manage two things:

Avoid multiple tensorboards for the same experiment when resuming training in slurm jobs
Avoid val loss spikes when training restarts, sometimes second condition is False in pytorch lightning and we get loss spikes.

Signed-off-by: Paarth Neekhara <paarth.n@gmail.com>

github-actions · 2025-03-04T02:01:01Z

This PR is stale because it has been open for 14 days with no activity. Remove stale label or comment or update or this will be closed in 7 days.

github-actions · 2025-03-11T02:01:57Z

This PR was closed because it has been inactive for 7 days since being marked as stale.

Signed-off-by: Jason <jasoli@nvidia.com>

Signed-off-by: Paarth Neekhara <pneekhara@nvidia.com> Signed-off-by: Jason <jasoli@nvidia.com> Co-authored-by: Jason <jasoli@nvidia.com> Signed-off-by: Yuanzhe Dong <yudong@nvidia.com>

* Update exp_manager to better handle parallelism (#12211) Signed-off-by: Paarth Neekhara <pneekhara@nvidia.com> Signed-off-by: Jason <jasoli@nvidia.com> Co-authored-by: Jason <jasoli@nvidia.com> Signed-off-by: Yuanzhe Dong <yudong@nvidia.com> * Add Neva support for VLM inference (#12531) * Add Neva support for vlm inference * Apply isort and black reformatting Signed-off-by: meatybobby <meatybobby@users.noreply.github.com> * keep legacy generate --------- Signed-off-by: meatybobby <meatybobby@users.noreply.github.com> Signed-off-by: meatybobby <bobchen@nvidia.com> Co-authored-by: meatybobby <meatybobby@users.noreply.github.com> Signed-off-by: Yuanzhe Dong <yudong@nvidia.com> * fix automodle benchmark script Signed-off-by: Yuanzhe Dong <yudong@nvidia.com> * use hf mock dataset Signed-off-by: Yuanzhe Dong <yudong@nvidia.com> --------- Signed-off-by: Paarth Neekhara <pneekhara@nvidia.com> Signed-off-by: Jason <jasoli@nvidia.com> Signed-off-by: Yuanzhe Dong <yudong@nvidia.com> Signed-off-by: meatybobby <meatybobby@users.noreply.github.com> Signed-off-by: meatybobby <bobchen@nvidia.com> Co-authored-by: Paarth Neekhara <paarth.n@gmail.com> Co-authored-by: Jason <jasoli@nvidia.com> Co-authored-by: meatybobby <bobchen@nvidia.com> Co-authored-by: meatybobby <meatybobby@users.noreply.github.com>

exp manager updates

12614da

Signed-off-by: Paarth Neekhara <paarth.n@gmail.com>

github-actions bot added the stale label Mar 4, 2025

github-actions bot closed this Mar 11, 2025

blisc changed the title ~~exp manager updates~~ Update exp_manager to better handle parallelism Mar 12, 2025

blisc reopened this Mar 12, 2025

blisc previously approved these changes Mar 12, 2025

View reviewed changes

blisc marked this pull request as ready for review March 12, 2025 14:27

merge with main

f9f7718

Signed-off-by: Jason <jasoli@nvidia.com>

blisc dismissed their stale review via f9f7718 March 12, 2025 14:59

blisc approved these changes Mar 12, 2025

View reviewed changes

github-actions bot removed the stale label Mar 13, 2025

blisc added the Run CICD label Mar 13, 2025

blisc enabled auto-merge (squash) March 13, 2025 17:11

blisc disabled auto-merge March 13, 2025 17:11

blisc enabled auto-merge (squash) March 13, 2025 17:11

blisc merged commit 19ba856 into NVIDIA:main Mar 13, 2025
194 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Update exp_manager to better handle parallelism #12211

Update exp_manager to better handle parallelism #12211

Uh oh!

paarthneekhara commented Feb 17, 2025

Uh oh!

github-actions bot commented Mar 4, 2025

Uh oh!

github-actions bot commented Mar 11, 2025

Uh oh!

Uh oh!

Uh oh!

Update exp_manager to better handle parallelism #12211

Update exp_manager to better handle parallelism #12211

Uh oh!

Conversation

paarthneekhara commented Feb 17, 2025

Uh oh!

github-actions bot commented Mar 4, 2025

Uh oh!

github-actions bot commented Mar 11, 2025

Uh oh!

Uh oh!

Uh oh!