-
Notifications
You must be signed in to change notification settings - Fork 2.3k
[mcore] moonlight (small model with deepseekv3 arch) #1284
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Thank you for the great work. Could you provide a dpsk-v3 script in the example for testing? |
just added a script |
@ISEEKYAN i ran dspk-v3 with this patch, but it failed. does this patch support training for dpsk-v3 ? |
What is the error info? It is supposed to support dpsk-v3 except for the mtp part for now. I will update the PR to support the full version. |
Thank you for your reply.
Python Dependencies and Versions: Error Log:
@ISEEKYAN could u please take a look at this error and help me fix it ? |
The mcore version was a pre-release 0.12 when I initially make this PR, and there was a known bug at |
It might be because TE version mismatch, I used |
I update the PR upon the latest version of verl, to use the optimzation of EP/avoid_pad_logits, reducing the GPU memory consumption a lot. Once merged with #1638, it will be possible to train Moonlight-16B-A3B with 1node/8GPUs. To train dpskv3 671B, there remains a few TODOs:
@jinqinn @duomicoding it will be very helpful with your participation |
update: my last run with 2nodes achieves 87 at gsm8k, and the training is relatively more stable than my experiments last month, now training moonlight is ready. |
@ISEEKYAN |
Please check the latest curve at url, now the ppo_kl curve is much better than the April version |
OOM error:
|
@ISEEKYAN Thank you for your reply, for the ppo_kl increasing curve and the reward declining curve, may I ask if you have any new discoveries or bugs? |
With the latest updates, the ppo_kl increases very slowly and no reward decline any more. |
dtype=dtype, | ||
use_cpu_initialization=False, | ||
add_bias_linear=False, | ||
attention_backend=AttnBackend.fused, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is AttnBackend.fused
specific to deepseek v3 model? is AttnBackend.auto
enough here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When feed with AttnBackend.auto
, the TE would use flash, but flash is not implemented for MLA, the error info is
ValueError: No dot product attention backend is available for the provided inputs. Please run with NVTE_DEBUG=1 NVTE_DEBUG_LEVEL=2 to find out the reasons for disabling all backends.
data.max_response_length=512 \ | ||
data.filter_overlong_prompts=True \ | ||
data.truncation='error' \ | ||
+data.trust_remote_code=True \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should trust_remote_code
be set in the model
per ppo_megatron_trainer.yaml?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh data preprocessing might need this as well. please ignore this if i misunderstand.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this another topic beyond supporting moonlight? would it be better if we commit another small PR for the config file modification?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
agree. we should track any change in the config and keep it consistent.
# limitations under the License. | ||
|
||
# there is some bug in mcore 0.12, so we need to patch it | ||
# 1. `get_query_key_value_tensors` in `multi_latent_attention.py` works wrong when packed_seq_params is not None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just curious whether there is issue opening in the megatron lm repo so that we can track the patch fix accordingly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it is supposed to be fixed in 0.13
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!!
### Checklist Before Starting - [x] Search for similar PR(s). ### What does this PR do? support training with deepseekv3 671B support MTP on top of #1284 now it is functional ready for 671B, still lacking of practice > Add one-line overview of what this PR aims to achieve or accomplish. ### High-Level Design > Demonstrate the high-level design if this PR is complex. ### Specific Changes > List the specific changes. ### API > Demonstrate how the API changes if any. ### Usage Example > Provide usage example(s) for easier usage. ```python # Add code snippet or script demonstrating how to use this ``` ### Test > For changes that can not be tested by CI (e.g., algorithm implementation, new model support), validate by experiment(s) and show results like training curve plots, evaluatuion results, etc. ### Additional Info. - **Issue Number**: Fixes issue # or discussion # if any. - **Training**: [Note which backend this PR will affect: FSDP, Megatron, both, or none] - **Inference**: [Note which backend this PR will affect: vLLM, SGLang, both, or none] ### Checklist Before Submitting - [ ] Read the [Contribute Guide](https://github.com/volcengine/verl?tab=readme-ov-file#contribution-guide). - [ ] Apply [pre-commit checks](https://github.com/volcengine/verl?tab=readme-ov-file#code-linting-and-formatting). - [ ] Add `[BREAKING]` to the PR title if it breaks any API. - [ ] Update the documentation about your changes in the [docs](https://github.com/volcengine/verl/tree/main/docs). - [ ] Add CI test(s) if necessary.
### Checklist Before Starting - [x] Search for similar PR(s). ### What does this PR do? support training with deepseekv3 671B support MTP on top of volcengine#1284 now it is functional ready for 671B, still lacking of practice > Add one-line overview of what this PR aims to achieve or accomplish. ### High-Level Design > Demonstrate the high-level design if this PR is complex. ### Specific Changes > List the specific changes. ### API > Demonstrate how the API changes if any. ### Usage Example > Provide usage example(s) for easier usage. ```python # Add code snippet or script demonstrating how to use this ``` ### Test > For changes that can not be tested by CI (e.g., algorithm implementation, new model support), validate by experiment(s) and show results like training curve plots, evaluatuion results, etc. ### Additional Info. - **Issue Number**: Fixes issue # or discussion # if any. - **Training**: [Note which backend this PR will affect: FSDP, Megatron, both, or none] - **Inference**: [Note which backend this PR will affect: vLLM, SGLang, both, or none] ### Checklist Before Submitting - [ ] Read the [Contribute Guide](https://github.com/volcengine/verl?tab=readme-ov-file#contribution-guide). - [ ] Apply [pre-commit checks](https://github.com/volcengine/verl?tab=readme-ov-file#code-linting-and-formatting). - [ ] Add `[BREAKING]` to the PR title if it breaks any API. - [ ] Update the documentation about your changes in the [docs](https://github.com/volcengine/verl/tree/main/docs). - [ ] Add CI test(s) if necessary.
achieve 74.3 at gsm8k, while moonlight reported as 77.4 still WIP with the performance diff
achieve 74.3 at gsm8k, while moonlight reported as 77.4
still WIP with the performance diff