Skip to content

Conversation

paarthneekhara
Copy link
Collaborator

Update exp_manager.py to manage two things:

  1. Avoid multiple tensorboards for the same experiment when resuming training in slurm jobs
  2. Avoid val loss spikes when training restarts, sometimes second condition is False in pytorch lightning and we get loss spikes.

Signed-off-by: Paarth Neekhara <paarth.n@gmail.com>
Copy link
Contributor

github-actions bot commented Mar 4, 2025

This PR is stale because it has been open for 14 days with no activity. Remove stale label or comment or update or this will be closed in 7 days.

@github-actions github-actions bot added the stale label Mar 4, 2025
Copy link
Contributor

This PR was closed because it has been inactive for 7 days since being marked as stale.

@github-actions github-actions bot closed this Mar 11, 2025
@blisc blisc changed the title exp manager updates Update exp_manager to better handle parallelism Mar 12, 2025
@blisc blisc reopened this Mar 12, 2025
blisc
blisc previously approved these changes Mar 12, 2025
@blisc blisc marked this pull request as ready for review March 12, 2025 14:27
Signed-off-by: Jason <jasoli@nvidia.com>
@github-actions github-actions bot removed the stale label Mar 13, 2025
@blisc blisc added the Run CICD label Mar 13, 2025
@blisc blisc enabled auto-merge (squash) March 13, 2025 17:11
@blisc blisc disabled auto-merge March 13, 2025 17:11
@blisc blisc enabled auto-merge (squash) March 13, 2025 17:11
@blisc blisc merged commit 19ba856 into NVIDIA:main Mar 13, 2025
194 checks passed
yuanzhedong pushed a commit that referenced this pull request Mar 18, 2025
Signed-off-by: Paarth Neekhara <pneekhara@nvidia.com>
Signed-off-by: Jason <jasoli@nvidia.com>
Co-authored-by: Jason <jasoli@nvidia.com>
Signed-off-by: Yuanzhe Dong <yudong@nvidia.com>
yuanzhedong added a commit that referenced this pull request Mar 23, 2025
* Update exp_manager to better handle parallelism (#12211)

Signed-off-by: Paarth Neekhara <pneekhara@nvidia.com>
Signed-off-by: Jason <jasoli@nvidia.com>
Co-authored-by: Jason <jasoli@nvidia.com>
Signed-off-by: Yuanzhe Dong <yudong@nvidia.com>

* Add Neva support for VLM inference (#12531)

* Add Neva support for vlm inference

* Apply isort and black reformatting

Signed-off-by: meatybobby <meatybobby@users.noreply.github.com>

* keep legacy generate

---------

Signed-off-by: meatybobby <meatybobby@users.noreply.github.com>
Signed-off-by: meatybobby <bobchen@nvidia.com>
Co-authored-by: meatybobby <meatybobby@users.noreply.github.com>
Signed-off-by: Yuanzhe Dong <yudong@nvidia.com>

* fix automodle benchmark script

Signed-off-by: Yuanzhe Dong <yudong@nvidia.com>

* use hf mock dataset

Signed-off-by: Yuanzhe Dong <yudong@nvidia.com>

---------

Signed-off-by: Paarth Neekhara <pneekhara@nvidia.com>
Signed-off-by: Jason <jasoli@nvidia.com>
Signed-off-by: Yuanzhe Dong <yudong@nvidia.com>
Signed-off-by: meatybobby <meatybobby@users.noreply.github.com>
Signed-off-by: meatybobby <bobchen@nvidia.com>
Co-authored-by: Paarth Neekhara <paarth.n@gmail.com>
Co-authored-by: Jason <jasoli@nvidia.com>
Co-authored-by: meatybobby <bobchen@nvidia.com>
Co-authored-by: meatybobby <meatybobby@users.noreply.github.com>
cspades pushed a commit that referenced this pull request Mar 24, 2025
* Update exp_manager to better handle parallelism (#12211)

Signed-off-by: Paarth Neekhara <pneekhara@nvidia.com>
Signed-off-by: Jason <jasoli@nvidia.com>
Co-authored-by: Jason <jasoli@nvidia.com>
Signed-off-by: Yuanzhe Dong <yudong@nvidia.com>

* Add Neva support for VLM inference (#12531)

* Add Neva support for vlm inference

* Apply isort and black reformatting

Signed-off-by: meatybobby <meatybobby@users.noreply.github.com>

* keep legacy generate

---------

Signed-off-by: meatybobby <meatybobby@users.noreply.github.com>
Signed-off-by: meatybobby <bobchen@nvidia.com>
Co-authored-by: meatybobby <meatybobby@users.noreply.github.com>
Signed-off-by: Yuanzhe Dong <yudong@nvidia.com>

* fix automodle benchmark script

Signed-off-by: Yuanzhe Dong <yudong@nvidia.com>

* use hf mock dataset

Signed-off-by: Yuanzhe Dong <yudong@nvidia.com>

---------

Signed-off-by: Paarth Neekhara <pneekhara@nvidia.com>
Signed-off-by: Jason <jasoli@nvidia.com>
Signed-off-by: Yuanzhe Dong <yudong@nvidia.com>
Signed-off-by: meatybobby <meatybobby@users.noreply.github.com>
Signed-off-by: meatybobby <bobchen@nvidia.com>
Co-authored-by: Paarth Neekhara <paarth.n@gmail.com>
Co-authored-by: Jason <jasoli@nvidia.com>
Co-authored-by: meatybobby <bobchen@nvidia.com>
Co-authored-by: meatybobby <meatybobby@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants