Skip to content

Conversation

therealnaveenkamal
Copy link
Contributor

@therealnaveenkamal therealnaveenkamal commented Jun 21, 2025

What does this PR do ?

Add MLflow integration for experiment tracking and model management in NeMo RL

Issues

Closes #514 - Support for MLflow
This PR addresses the community request for MLflow support alongside existing TensorBoard and Weights & Biases logging options.

Usage

# Install MLflow
uv pip install mlflow

# Run SFT training with MLflow logging
uv run python examples/run_sft.py --config examples/configs/sft_mlflow.yaml

# Start MLflow UI to view results
mlflow ui --host 0.0.0.0 --port 5000

Before your PR is "Ready for review"

Pre checks:

  • Make sure you read and followed Contributor guidelines
  • Did you write any new necessary tests?
  • Did you run the unit tests and functional tests locally? Visit our Testing Guide for how to run tests
  • Did you add or update any necessary documentation? Visit our Document Development Guide for how to write, build and test the docs.

Additional Information

This PR adds comprehensive MLflow integration including:

  • MLflowLogger class with support for metrics, hyperparameters, and plot logging
  • Graceful fallbacks for missing configuration keys
  • Example configuration (sft_mlflow.yaml) for SFT training with MLflow
  • Updated README with installation and usage instructions
  • Fixed test coverage for MLflow plot logging functionality

The integration follows the existing logger architecture and provides a seamless experience for users who want to use MLflow for experiment tracking alongside or instead of TensorBoard and Weights & Biases.

Signed-off-by: Naveenraj Kamalakannan <therealnaveenkamal@gmail.com>
@therealnaveenkamal therealnaveenkamal changed the title Feat: MLFlow Integration for experiment tracking MLFlow Integration for experiment tracking Jun 21, 2025
@therealnaveenkamal therealnaveenkamal changed the title MLFlow Integration for experiment tracking feat: MLFlow Integration for experiment tracking Jun 21, 2025
Copy link
Contributor

@SahilJain314 SahilJain314 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for contributing! @terrykong to comment on the feature broadly

Copy link
Contributor

@terrykong terrykong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks @therealnaveenkamal for the contribution! appreciate your help making nemo-rl better!

left some comments below

Signed-off-by: Naveenraj Kamalakannan <therealnaveenkamal@gmail.com>
Signed-off-by: Naveenraj Kamalakannan <therealnaveenkamal@gmail.com>
@github-actions github-actions bot added the documentation Improvements or additions to documentation label Jun 27, 2025
Copy link
Contributor

@terrykong terrykong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a couple other comments

@therealnaveenkamal
Copy link
Contributor Author

Sure @terrykong , working on your comments. I'll ping you here once I'm ready. Thanks for your support.

Signed-off-by: Naveenraj Kamalakannan <therealnaveenkamal@gmail.com>
Signed-off-by: Naveenraj Kamalakannan <therealnaveenkamal@gmail.com>
@therealnaveenkamal
Copy link
Contributor Author

@terrykong I think I've addressed all your comments. please review and let me know if further edits are required.

Signed-off-by: Naveenraj Kamalakannan <therealnaveenkamal@gmail.com>
@terrykong terrykong added the CI:L1 Run doctests, unit tests, and functional tests label Jul 2, 2025
@therealnaveenkamal
Copy link
Contributor Author

@terrykong / @SahilJain314 not sure why the workflows have failed. do I have to do anything from my side? the lint & the copyright checks passed when I run the workflows locally.

@terrykong terrykong added CI:docs Run doctest and removed CI:L1 Run doctests, unit tests, and functional tests labels Jul 6, 2025
@terrykong terrykong self-requested a review July 6, 2025 04:43
@terrykong
Copy link
Contributor

@therealnaveenkamal I believe the copyright one is orthogonal to your change so you can ignore that (it doesn't gate our CI, but it should CC: @chtruong814)

The linter one fails. can you install the precommit hooks and run it on your branch?

Signed-off-by: Naveenraj Kamalakannan <therealnaveenkamal@gmail.com>
@github-actions github-actions bot added the CI Relating to CI label Jul 7, 2025
@github-actions github-actions bot removed the CI Relating to CI label Jul 7, 2025
Signed-off-by: Naveenraj Kamalakannan <therealnaveenkamal@gmail.com>
@therealnaveenkamal
Copy link
Contributor Author

@terrykong thank you. resolved the lint errors.

@terrykong terrykong removed the CI:docs Run doctest label Jul 7, 2025
@terrykong terrykong added the CI:docs Run doctest label Jul 7, 2025
@terrykong
Copy link
Contributor

thanks @therealnaveenkamal. The CI still says there's some unresolved errors:

image

Signed-off-by: Naveenraj Kamalakannan <therealnaveenkamal@gmail.com>
@therealnaveenkamal
Copy link
Contributor Author

image

@terrykong sorry my bad, I didn't do a lint check after resolving merge conflicts. i think everything should be good now. thanks.

Signed-off-by: Terry Kong <terryk@nvidia.com>
@terrykong
Copy link
Contributor

the linter is failing due to an error that usually indicates the uv.lock needs to be re-locked. i made a PR against your fork to update:

therealnaveenkamal#1

update uv.lock to account for the new dependency
@therealnaveenkamal
Copy link
Contributor Author

Thanks @terrykong, merged.

@terrykong terrykong added CI:L1 Run doctests, unit tests, and functional tests and removed CI:docs Run doctest labels Jul 14, 2025
@terrykong
Copy link
Contributor

Hi @therealnaveenkamal . I think the CI is having some issues with your PR (CC @chtruong814 ). I'll open a new PR, resolve the issues, and make sure to credit your contribution

@therealnaveenkamal
Copy link
Contributor Author

Thanks for the support @terrykong !! Looking forward to contributing more.

@terrykong
Copy link
Contributor

Appreciate doing most of the heavy lifting @therealnaveenkamal ! I just pushed it over the finish.

I'll close this since #697 has landed

@terrykong terrykong closed this Jul 21, 2025
@tpoisonooo tpoisonooo mentioned this pull request Jul 23, 2025
4 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CI:L1 Run doctests, unit tests, and functional tests documentation Improvements or additions to documentation external
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support for MLflow
4 participants