-
Notifications
You must be signed in to change notification settings - Fork 117
feat: introduce megatron checkpoint dir precedence #665
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: Terry Kong <terryk@nvidia.com>
Signed-off-by: Terry Kong <terryk@nvidia.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Completed the tech pubs review of docs/design-docs/training-backends.md and provided some copyedits for grammar and punctuation. Also revised the Configuration heading levels to fix "stacked" headings (placing one heading directly beneath another without any intervening body text.) It tends to break the flow.
Co-authored-by: jgerh <163925524+jgerh@users.noreply.github.com> Signed-off-by: Terry Kong <terrycurtiskong@gmail.com>
Co-authored-by: jgerh <163925524+jgerh@users.noreply.github.com> Signed-off-by: Terry Kong <terrycurtiskong@gmail.com>
Co-authored-by: jgerh <163925524+jgerh@users.noreply.github.com> Signed-off-by: Terry Kong <terrycurtiskong@gmail.com>
Signed-off-by: Terry Kong <terryk@nvidia.com>
Signed-off-by: Terry Kong <terryk@nvidia.com> Signed-off-by: Terry Kong <terrycurtiskong@gmail.com> Co-authored-by: jgerh <163925524+jgerh@users.noreply.github.com> Signed-off-by: Zhiyu Li <zhiyul@nvidia.com>
Signed-off-by: Terry Kong <terryk@nvidia.com> Signed-off-by: Terry Kong <terrycurtiskong@gmail.com> Co-authored-by: jgerh <163925524+jgerh@users.noreply.github.com> Signed-off-by: Jialei Chen <jialeic@google.com>
Signed-off-by: Terry Kong <terryk@nvidia.com> Signed-off-by: Terry Kong <terrycurtiskong@gmail.com> Co-authored-by: jgerh <163925524+jgerh@users.noreply.github.com>
Signed-off-by: Terry Kong <terryk@nvidia.com> Signed-off-by: Terry Kong <terrycurtiskong@gmail.com> Co-authored-by: jgerh <163925524+jgerh@users.noreply.github.com>
Signed-off-by: Terry Kong <terryk@nvidia.com> Signed-off-by: Terry Kong <terrycurtiskong@gmail.com> Co-authored-by: jgerh <163925524+jgerh@users.noreply.github.com>
Signed-off-by: Terry Kong <terryk@nvidia.com> Signed-off-by: Terry Kong <terrycurtiskong@gmail.com> Co-authored-by: jgerh <163925524+jgerh@users.noreply.github.com>
Signed-off-by: Terry Kong <terryk@nvidia.com> Signed-off-by: Terry Kong <terrycurtiskong@gmail.com> Co-authored-by: jgerh <163925524+jgerh@users.noreply.github.com> Signed-off-by: Qidong Su <qidongs@nvidia.com>
when running multinode, unless someone knows to mount
/opt/checkpoints
everywhere, their multinode will fail. This change introduces the following dir precedenceMost will specify HF_HOME, so this will seemlessly work with those training large models on multiple nodes.
Also, default to the cache dir otherwise since
/opt
is usually reserved for "optional packages" as opposed to data mounts (which usually reside in/mnt
)