-
Notifications
You must be signed in to change notification settings - Fork 117
Description
Describe the bug
When using the Megatron backend, distributed optimizer with overlap param gather causes divergence for some algorithms. Because of this, we've blocked users from running with overlap_param_gather + distributed optimizer. We need to fix this bug to get the maximum performance from Megatron.
Steps/Code to reproduce bug
Please list minimal steps or code snippet for us to be able to reproduce the bug.
A helpful guide on on how to craft a minimal bug report http://matthewrocklin.com/blog/work/2018/02/28/minimal-bug-reports.
Expected behavior
A clear and concise description of what you expected to happen.
Environment overview (please complete the following information)
- Environment location: [Bare-metal, Docker, Cloud(specify cloud provider - AWS, Azure, GCP, Collab)]
- Method of install: [pip install or from source]. Please specify exact commands you used to install.
- If method of install is [Docker], provide
docker pull
&docker run
commands used
Environment details
If NVIDIA docker image is used you don't need to specify these.
Otherwise, please provide:
- OS version
- PyTorch version
- Python version
Additional context
Add any other context about the problem here.
Example: GPU model