Skip to content

Conversation

mahdikhashan
Copy link
Member

ref: #3951

Copy link

Hi @mahdikhashan. Thanks for your PR.

I'm waiting for a kubeflow member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@mahdikhashan
Copy link
Member Author

mahdikhashan commented Jan 7, 2025

hi @andreyvelich , shall i keep it under user-guides/hp-tuning/?

@andreyvelich
Copy link
Member

Sure, I think we can create a new page for this feature.
FYI, please follow the contribution guide to sign the commits: https://www.kubeflow.org/docs/about/contributing/#getting-started
cc @helenxie-bit

@andreyvelich
Copy link
Member

Part of: kubeflow/katib#2339

Copy link
Member

@Arhell Arhell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/ok-to-test

Signed-off-by: mahdikhashan <mahdikhashan1@gmail.com>
Signed-off-by: mahdikhashan <mahdikhashan1@gmail.com>
Signed-off-by: mahdikhashan <mahdikhashan1@gmail.com>
Signed-off-by: mahdikhashan <mahdikhashan1@gmail.com>
Signed-off-by: mahdikhashan <mahdikhashan1@gmail.com>
Signed-off-by: mahdikhashan <mahdikhashan1@gmail.com>
Signed-off-by: mahdikhashan <mahdikhashan1@gmail.com>
Signed-off-by: mahdikhashan <mahdikhashan1@gmail.com>
Signed-off-by: mahdikhashan <mahdikhashan1@gmail.com>
Signed-off-by: mahdikhashan <mahdikhashan1@gmail.com>
or the [Kubeflow Katib GitHub](https://github.com/kubeflow/katib/issues).
{{% /alert %}}

This page describes Large Language Models hyperparameter (HP) optimization Python API that Katib supports and how to configure
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

describes how to implement Hyperparameter optimization (HPO) using Python API ...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done. thank you.

+++

{{% alert title="Warning" color="warning" %}}
This feature is in **alpha** stage and the Kubeflow community is looking for your feedback. Please
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

each web page has a feedback button at the bottom for users to add their feedback and creates an issue if needed.
cc @andreyvelich

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We explicitly added this warning for this guide, since this feature might be unstable, and we want to hear user feedback.

@@ -0,0 +1,351 @@
+++
title = "How to Optimize Hyperparameters of LLMs with Kubeflow"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion:

**How to implement Hyperparameter optimization (HPO) **

@andreyvelich to add comments on this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we keep this name:

How to Optimize Hyperparameters for LLMs Fine-Tuning with Kubeflow

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

- [Optimizing Hyperparameters of Large Language Models](#optimizing-hyperparameters-of-large-language-models)
- [Example: Optimizing Hyperparameters of Llama-3.2 for Binary Classification on IMDB Dataset](#example-optimizing-hyperparameters-of-llama-32-for-binary-classification-on-imdb-dataset)

## Prerequisites
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thank for including the prerequisites. I'm wondering if these prerequisites should be applied to all the docs/components/katib/user-guides/hp-tuning/ and in this case should be listed in this page.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure - I checked some of other similar docs under Katib, and I'd say for them it may not make sense.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Usually, we don't need it since this Prerequisites explained in the Getting Started guide.

@@ -0,0 +1,351 @@
+++
title = "How to Optimize Hyperparameters of LLMs with Kubeflow"
description = "API description"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The description could include more information about this page.
Additionally, it will be great to have short paragraph explaining more about this topic, what we are trying to achieve and why. And include a reference to this topic for the audience to learn more about it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, you are right - I'll extend it. thanks for reminding me of this.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

| `parallel_trial_count` | Number of trials to run in parallel, set to `2`. |
| `resources_per_trial` | Resources allocated for each trial: 2 GPUs, 4 CPUs, 10GB memory. |

```python
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mahdikhashan if you haven't tested the code yet, we should mark this PR as hold . please let us know. thank you.

@mahdikhashan
Copy link
Member Author

@varodrig thanks for you time and help with review it - I'll address your requested changes.
regarding the code, we have a nb example and we (@helenxie-bit) are collaborating on it together.

nb example issue: kubeflow/katib#2480

there is a in progress pr related to this (regarding e2e tests, its not related specifically on this but I have hold on to incorporate the latest possible changes).

Signed-off-by: mahdikhashan <mahdikhashan1@gmail.com>
@google-oss-prow google-oss-prow bot removed the lgtm label Feb 12, 2025
Signed-off-by: mahdikhashan <mahdikhashan1@gmail.com>
@andreyvelich andreyvelich changed the title [USERGUIDE] LLM Hyperparameter Optimization API katib: [USERGUIDE] LLM Hyperparameter Optimization API Feb 13, 2025
Copy link
Member

@andreyvelich andreyvelich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for this effort @mahdikhashan!
I left a few comments.

@@ -0,0 +1,351 @@
+++
title = "How to Optimize Hyperparameters of LLMs with Kubeflow"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we keep this name:

How to Optimize Hyperparameters for LLMs Fine-Tuning with Kubeflow

@@ -0,0 +1,351 @@
+++
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would keep this guide under /user-guides/llm-hp-optimization.md for now for more visibility.
WDYT @mahdikhashan @helenxie-bit @Electronic-Waste ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agreed. done.

+++

{{% alert title="Warning" color="warning" %}}
This feature is in **alpha** stage and the Kubeflow community is looking for your feedback. Please
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We explicitly added this warning for this guide, since this feature might be unstable, and we want to hear user feedback.

Comment on lines 13 to 14
This page describes how to implement Hyperparameter Optimization (HPO) using Python API that Katib supports and how to configure
it.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Modify this message to say that this page describes how to optimize HPs in the process of LLMs Fine-Tuning.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

This page describes how to implement Hyperparameter Optimization (HPO) using Python API that Katib supports and how to configure
it.

## Sections
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can remove this Sections, since website has outline at the right panel.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

)
```

#### HuggingFaceModelParams
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we move these sections to the Training Operator doc and cross-reference it from this doc ?
https://www.kubeflow.org/docs/components/trainer/legacy-v1/user-guides/fine-tuning/


### Key Parameters for LLM Hyperparameter Tuning

| **Parameter** | **Description** | **Required** |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not all of these parameters should be used for LLMs.
Please exclude the ones that can't be used with LLM Trainer (e.g. objective)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense. Then I guess these three parameters objective, base_image, and parameters should be removed.

secret_key="YOUR_SECRET_KEY"
)
```
## Optimizing Hyperparameters of Large Language Models
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should clearly say that right now user can tune parameters from training_parameters and lora_config.

algorithm_name = "random",
max_trial_count = 10,
parallel_trial_count = 2,
resources_per_trial={
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess, we should use TrainerResource here, isn't it @mahdikhashan @helenxie-bit ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for pointing this out! Yes, we should use something like this:

resources_per_trial=katib.TrainerResources(
        num_workers=2,
        num_procs_per_worker=2,
        resources_per_worker={"gpu": 2, "cpu": 4, "memory": "10G",},
    ),

cl.wait_for_experiment_condition(name=exp_name)

# Get the best hyperparameters.
print(cl.get_optimal_hyperparameters(exp_name))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to show output for the Experiment here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@andreyvelich Do you mean adding screenshots of the results of this example at the end?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it! I will work on the e2e test to ensure this example work, then we can add the output here.

Signed-off-by: mahdikhashan <mahdikhashan1@gmail.com>
Signed-off-by: mahdikhashan <mahdikhashan1@gmail.com>
Signed-off-by: mahdikhashan <mahdikhashan1@gmail.com>
Signed-off-by: mahdikhashan <mahdikhashan1@gmail.com>
Signed-off-by: mahdikhashan <mahdikhashan1@gmail.com>
Signed-off-by: mahdikhashan <mahdikhashan1@gmail.com>
Signed-off-by: mahdikhashan <mahdikhashan1@gmail.com>
Copy link
Member

@andreyvelich andreyvelich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @mahdikhashan, did you get a chance to address the remaining comments, so we can merge this PR ?

@mahdikhashan
Copy link
Member Author

Hi @mahdikhashan, did you get a chance to address the remaining comments, so we can merge this PR ?

thanks for reminding me of this. I'll do it asap.

Copy link
Member

@andreyvelich andreyvelich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think, we should be good to merge it.
Let's address the followup changes in the next PR.
Thank you @mahdikhashan @helenxie-bit!
/lgtm
/approve

Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: andreyvelich

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@google-oss-prow google-oss-prow bot merged commit de75668 into kubeflow:master Mar 25, 2025
6 checks passed
jaiakash pushed a commit to jaiakash/website that referenced this pull request Apr 10, 2025
* add base md

Signed-off-by: mahdikhashan <mahdikhashan1@gmail.com>

* update title and description

Signed-off-by: mahdikhashan <mahdikhashan1@gmail.com>

* add draft code

Signed-off-by: mahdikhashan <mahdikhashan1@gmail.com>

* add prerequisites

Signed-off-by: mahdikhashan <mahdikhashan1@gmail.com>

* add huggingface api details,s3 api, update example

Signed-off-by: mahdikhashan <mahdikhashan1@gmail.com>

* remove redundant text

Signed-off-by: mahdikhashan <mahdikhashan1@gmail.com>

* add HuggingFaceTrainerParams description

Signed-off-by: mahdikhashan <mahdikhashan1@gmail.com>

* update prerequisites

Signed-off-by: mahdikhashan <mahdikhashan1@gmail.com>

* update code example

Signed-off-by: mahdikhashan <mahdikhashan1@gmail.com>

* add sections

Signed-off-by: mahdikhashan <mahdikhashan1@gmail.com>

* replace langauge models with large language models

Signed-off-by: mahdikhashan <mahdikhashan1@gmail.com>

* improve prerequisites

Signed-off-by: mahdikhashan <mahdikhashan1@gmail.com>

* algorithm_name is optional

Signed-off-by: mahdikhashan <mahdikhashan1@gmail.com>

* objective_type is optional

Signed-off-by: mahdikhashan <mahdikhashan1@gmail.com>

* objective_metric_name is optional

Signed-off-by: mahdikhashan <mahdikhashan1@gmail.com>

* remove redundant example

Signed-off-by: mahdikhashan <mahdikhashan1@gmail.com>

* change tune args to optional

Signed-off-by: mahdikhashan <mahdikhashan1@gmail.com>

* add search api

Signed-off-by: mahdikhashan <mahdikhashan1@gmail.com>

* update link title

Signed-off-by: mahdikhashan <mahdikhashan1@gmail.com>

* add two scenarios for tune function with custom objective or loading model and parameters from hugging face

Signed-off-by: mahdikhashan <mahdikhashan1@gmail.com>

* add link for custom objective function example

Signed-off-by: mahdikhashan <mahdikhashan1@gmail.com>

* improve tune section

Signed-off-by: mahdikhashan <mahdikhashan1@gmail.com>

* improve title

Signed-off-by: mahdikhashan <mahdikhashan1@gmail.com>

* fix failing ci

Signed-off-by: mahdikhashan <mahdikhashan1@gmail.com>

* add warning of alpha api

Signed-off-by: mahdikhashan <mahdikhashan1@gmail.com>

* improve links

Signed-off-by: mahdikhashan <mahdikhashan1@gmail.com>

* improve python code consistency

Signed-off-by: mahdikhashan <mahdikhashan1@gmail.com>

* define search space for r in LoraConfig

Signed-off-by: mahdikhashan <mahdikhashan1@gmail.com>

* remove redundant line

Signed-off-by: mahdikhashan <mahdikhashan1@gmail.com>

* make sure imports are all consistent in snippets

Signed-off-by: mahdikhashan <mahdikhashan1@gmail.com>

* improve link

Signed-off-by: mahdikhashan <mahdikhashan1@gmail.com>

* improve fine-tune section

Signed-off-by: mahdikhashan <mahdikhashan1@gmail.com>

* improve links in prerequisites

Signed-off-by: mahdikhashan <mahdikhashan1@gmail.com>

* improve structure of integrations section

Signed-off-by: mahdikhashan <mahdikhashan1@gmail.com>

* add missing import

Signed-off-by: mahdikhashan <mahdikhashan1@gmail.com>

* replace local address instead of hardcoded link to website

Signed-off-by: mahdikhashan <mahdikhashan1@gmail.com>

* fix import

Signed-off-by: mahdikhashan <mahdikhashan1@gmail.com>

* use hyperparameter optimization instead of fine-tune

Signed-off-by: mahdikhashan <mahdikhashan1@gmail.com>

* fix header levels

Signed-off-by: mahdikhashan <mahdikhashan1@gmail.com>

* replace code import

Signed-off-by: mahdikhashan <mahdikhashan1@gmail.com>

* replace name

Signed-off-by: mahdikhashan <mahdikhashan1@gmail.com>

* replace definition of distributed training

Signed-off-by: mahdikhashan <mahdikhashan1@gmail.com>

* decrease header level

Signed-off-by: mahdikhashan <mahdikhashan1@gmail.com>

* decrease header level

Signed-off-by: mahdikhashan <mahdikhashan1@gmail.com>

* update configuration for `resource_per_trial`

Signed-off-by: mahdikhashan <mahdikhashan1@gmail.com>

* update header title

Signed-off-by: mahdikhashan <mahdikhashan1@gmail.com>

* update training operator control plane

Signed-off-by: mahdikhashan <mahdikhashan1@gmail.com>

* improve prerequisites

Signed-off-by: mahdikhashan <mahdikhashan1@gmail.com>

* update prerequisites

Signed-off-by: mahdikhashan <mahdikhashan1@gmail.com>

* update page into description

Signed-off-by: mahdikhashan <mahdikhashan1@gmail.com>

* update page description

Signed-off-by: mahdikhashan <mahdikhashan1@gmail.com>

* update page title and adjust description letter

Signed-off-by: mahdikhashan <mahdikhashan1@gmail.com>

* move the doc to the parent folder

Signed-off-by: mahdikhashan <mahdikhashan1@gmail.com>

* remove section

Signed-off-by: mahdikhashan <mahdikhashan1@gmail.com>

* modify message of the page

Signed-off-by: mahdikhashan <mahdikhashan1@gmail.com>

* fix typo

Signed-off-by: mahdikhashan <mahdikhashan1@gmail.com>

* fix typo

Signed-off-by: mahdikhashan <mahdikhashan1@gmail.com>

* mention Training Operator

Signed-off-by: mahdikhashan <mahdikhashan1@gmail.com>

---------

Signed-off-by: mahdikhashan <mahdikhashan1@gmail.com>
Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants