-
Notifications
You must be signed in to change notification settings - Fork 864
katib: [USERGUIDE] LLM Hyperparameter Optimization API #3952
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
katib: [USERGUIDE] LLM Hyperparameter Optimization API #3952
Conversation
Hi @mahdikhashan. Thanks for your PR. I'm waiting for a kubeflow member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
hi @andreyvelich , shall i keep it under |
Sure, I think we can create a new page for this feature. |
Part of: kubeflow/katib#2339 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/ok-to-test
Signed-off-by: mahdikhashan <mahdikhashan1@gmail.com>
Signed-off-by: mahdikhashan <mahdikhashan1@gmail.com>
a120413
to
aa3b2be
Compare
Signed-off-by: mahdikhashan <mahdikhashan1@gmail.com>
Signed-off-by: mahdikhashan <mahdikhashan1@gmail.com>
Signed-off-by: mahdikhashan <mahdikhashan1@gmail.com>
Signed-off-by: mahdikhashan <mahdikhashan1@gmail.com>
Signed-off-by: mahdikhashan <mahdikhashan1@gmail.com>
Signed-off-by: mahdikhashan <mahdikhashan1@gmail.com>
Signed-off-by: mahdikhashan <mahdikhashan1@gmail.com>
Signed-off-by: mahdikhashan <mahdikhashan1@gmail.com>
or the [Kubeflow Katib GitHub](https://github.com/kubeflow/katib/issues). | ||
{{% /alert %}} | ||
|
||
This page describes Large Language Models hyperparameter (HP) optimization Python API that Katib supports and how to configure |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
describes how to implement Hyperparameter optimization (HPO) using Python API ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done. thank you.
+++ | ||
|
||
{{% alert title="Warning" color="warning" %}} | ||
This feature is in **alpha** stage and the Kubeflow community is looking for your feedback. Please |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
each web page has a feedback button at the bottom for users to add their feedback and creates an issue if needed.
cc @andreyvelich
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We explicitly added this warning for this guide, since this feature might be unstable, and we want to hear user feedback.
@@ -0,0 +1,351 @@ | |||
+++ | |||
title = "How to Optimize Hyperparameters of LLMs with Kubeflow" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion:
**How to implement Hyperparameter optimization (HPO) **
@andreyvelich to add comments on this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we keep this name:
How to Optimize Hyperparameters for LLMs Fine-Tuning with Kubeflow
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done.
- [Optimizing Hyperparameters of Large Language Models](#optimizing-hyperparameters-of-large-language-models) | ||
- [Example: Optimizing Hyperparameters of Llama-3.2 for Binary Classification on IMDB Dataset](#example-optimizing-hyperparameters-of-llama-32-for-binary-classification-on-imdb-dataset) | ||
|
||
## Prerequisites |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thank for including the prerequisites. I'm wondering if these prerequisites should be applied to all the docs/components/katib/user-guides/hp-tuning/ and in this case should be listed in this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure - I checked some of other similar docs under Katib, and I'd say for them it may not make sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Usually, we don't need it since this Prerequisites explained in the Getting Started guide.
@@ -0,0 +1,351 @@ | |||
+++ | |||
title = "How to Optimize Hyperparameters of LLMs with Kubeflow" | |||
description = "API description" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The description could include more information about this page.
Additionally, it will be great to have short paragraph explaining more about this topic, what we are trying to achieve and why. And include a reference to this topic for the audience to learn more about it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, you are right - I'll extend it. thanks for reminding me of this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done.
| `parallel_trial_count` | Number of trials to run in parallel, set to `2`. | | ||
| `resources_per_trial` | Resources allocated for each trial: 2 GPUs, 4 CPUs, 10GB memory. | | ||
|
||
```python |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mahdikhashan if you haven't tested the code yet, we should mark this PR as hold . please let us know. thank you.
@varodrig thanks for you time and help with review it - I'll address your requested changes. nb example issue: kubeflow/katib#2480 there is a in progress pr related to this (regarding e2e tests, its not related specifically on this but I have hold on to incorporate the latest possible changes). |
Signed-off-by: mahdikhashan <mahdikhashan1@gmail.com>
Signed-off-by: mahdikhashan <mahdikhashan1@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for this effort @mahdikhashan!
I left a few comments.
@@ -0,0 +1,351 @@ | |||
+++ | |||
title = "How to Optimize Hyperparameters of LLMs with Kubeflow" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we keep this name:
How to Optimize Hyperparameters for LLMs Fine-Tuning with Kubeflow
@@ -0,0 +1,351 @@ | |||
+++ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would keep this guide under /user-guides/llm-hp-optimization.md
for now for more visibility.
WDYT @mahdikhashan @helenxie-bit @Electronic-Waste ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
agreed. done.
+++ | ||
|
||
{{% alert title="Warning" color="warning" %}} | ||
This feature is in **alpha** stage and the Kubeflow community is looking for your feedback. Please |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We explicitly added this warning for this guide, since this feature might be unstable, and we want to hear user feedback.
This page describes how to implement Hyperparameter Optimization (HPO) using Python API that Katib supports and how to configure | ||
it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Modify this message to say that this page describes how to optimize HPs in the process of LLMs Fine-Tuning.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done.
This page describes how to implement Hyperparameter Optimization (HPO) using Python API that Katib supports and how to configure | ||
it. | ||
|
||
## Sections |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can remove this Sections, since website has outline at the right panel.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done.
) | ||
``` | ||
|
||
#### HuggingFaceModelParams |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we move these sections to the Training Operator doc and cross-reference it from this doc ?
https://www.kubeflow.org/docs/components/trainer/legacy-v1/user-guides/fine-tuning/
|
||
### Key Parameters for LLM Hyperparameter Tuning | ||
|
||
| **Parameter** | **Description** | **Required** | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not all of these parameters should be used for LLMs.
Please exclude the ones that can't be used with LLM Trainer (e.g. objective)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That makes sense. Then I guess these three parameters objective
, base_image
, and parameters
should be removed.
secret_key="YOUR_SECRET_KEY" | ||
) | ||
``` | ||
## Optimizing Hyperparameters of Large Language Models |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should clearly say that right now user can tune parameters from training_parameters
and lora_config
.
algorithm_name = "random", | ||
max_trial_count = 10, | ||
parallel_trial_count = 2, | ||
resources_per_trial={ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess, we should use TrainerResource here, isn't it @mahdikhashan @helenxie-bit ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for pointing this out! Yes, we should use something like this:
resources_per_trial=katib.TrainerResources(
num_workers=2,
num_procs_per_worker=2,
resources_per_worker={"gpu": 2, "cpu": 4, "memory": "10G",},
),
cl.wait_for_experiment_condition(name=exp_name) | ||
|
||
# Get the best hyperparameters. | ||
print(cl.get_optimal_hyperparameters(exp_name)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to show output for the Experiment here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@andreyvelich Do you mean adding screenshots of the results of this example at the end?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we add the console output, we should be good.
Like here: https://www.kubeflow.org/docs/components/katib/getting-started/#:~:text=You%20should%20get%20similar%20output%20for%20the%20most%20optimal%20Trial%2C%20hyperparameters%2C%20and%20observation%20metrics%3A
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it! I will work on the e2e test to ensure this example work, then we can add the output here.
Signed-off-by: mahdikhashan <mahdikhashan1@gmail.com>
Signed-off-by: mahdikhashan <mahdikhashan1@gmail.com>
Signed-off-by: mahdikhashan <mahdikhashan1@gmail.com>
Signed-off-by: mahdikhashan <mahdikhashan1@gmail.com>
Signed-off-by: mahdikhashan <mahdikhashan1@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @mahdikhashan, did you get a chance to address the remaining comments, so we can merge this PR ?
thanks for reminding me of this. I'll do it asap. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think, we should be good to merge it.
Let's address the followup changes in the next PR.
Thank you @mahdikhashan @helenxie-bit!
/lgtm
/approve
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: andreyvelich The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
* add base md Signed-off-by: mahdikhashan <mahdikhashan1@gmail.com> * update title and description Signed-off-by: mahdikhashan <mahdikhashan1@gmail.com> * add draft code Signed-off-by: mahdikhashan <mahdikhashan1@gmail.com> * add prerequisites Signed-off-by: mahdikhashan <mahdikhashan1@gmail.com> * add huggingface api details,s3 api, update example Signed-off-by: mahdikhashan <mahdikhashan1@gmail.com> * remove redundant text Signed-off-by: mahdikhashan <mahdikhashan1@gmail.com> * add HuggingFaceTrainerParams description Signed-off-by: mahdikhashan <mahdikhashan1@gmail.com> * update prerequisites Signed-off-by: mahdikhashan <mahdikhashan1@gmail.com> * update code example Signed-off-by: mahdikhashan <mahdikhashan1@gmail.com> * add sections Signed-off-by: mahdikhashan <mahdikhashan1@gmail.com> * replace langauge models with large language models Signed-off-by: mahdikhashan <mahdikhashan1@gmail.com> * improve prerequisites Signed-off-by: mahdikhashan <mahdikhashan1@gmail.com> * algorithm_name is optional Signed-off-by: mahdikhashan <mahdikhashan1@gmail.com> * objective_type is optional Signed-off-by: mahdikhashan <mahdikhashan1@gmail.com> * objective_metric_name is optional Signed-off-by: mahdikhashan <mahdikhashan1@gmail.com> * remove redundant example Signed-off-by: mahdikhashan <mahdikhashan1@gmail.com> * change tune args to optional Signed-off-by: mahdikhashan <mahdikhashan1@gmail.com> * add search api Signed-off-by: mahdikhashan <mahdikhashan1@gmail.com> * update link title Signed-off-by: mahdikhashan <mahdikhashan1@gmail.com> * add two scenarios for tune function with custom objective or loading model and parameters from hugging face Signed-off-by: mahdikhashan <mahdikhashan1@gmail.com> * add link for custom objective function example Signed-off-by: mahdikhashan <mahdikhashan1@gmail.com> * improve tune section Signed-off-by: mahdikhashan <mahdikhashan1@gmail.com> * improve title Signed-off-by: mahdikhashan <mahdikhashan1@gmail.com> * fix failing ci Signed-off-by: mahdikhashan <mahdikhashan1@gmail.com> * add warning of alpha api Signed-off-by: mahdikhashan <mahdikhashan1@gmail.com> * improve links Signed-off-by: mahdikhashan <mahdikhashan1@gmail.com> * improve python code consistency Signed-off-by: mahdikhashan <mahdikhashan1@gmail.com> * define search space for r in LoraConfig Signed-off-by: mahdikhashan <mahdikhashan1@gmail.com> * remove redundant line Signed-off-by: mahdikhashan <mahdikhashan1@gmail.com> * make sure imports are all consistent in snippets Signed-off-by: mahdikhashan <mahdikhashan1@gmail.com> * improve link Signed-off-by: mahdikhashan <mahdikhashan1@gmail.com> * improve fine-tune section Signed-off-by: mahdikhashan <mahdikhashan1@gmail.com> * improve links in prerequisites Signed-off-by: mahdikhashan <mahdikhashan1@gmail.com> * improve structure of integrations section Signed-off-by: mahdikhashan <mahdikhashan1@gmail.com> * add missing import Signed-off-by: mahdikhashan <mahdikhashan1@gmail.com> * replace local address instead of hardcoded link to website Signed-off-by: mahdikhashan <mahdikhashan1@gmail.com> * fix import Signed-off-by: mahdikhashan <mahdikhashan1@gmail.com> * use hyperparameter optimization instead of fine-tune Signed-off-by: mahdikhashan <mahdikhashan1@gmail.com> * fix header levels Signed-off-by: mahdikhashan <mahdikhashan1@gmail.com> * replace code import Signed-off-by: mahdikhashan <mahdikhashan1@gmail.com> * replace name Signed-off-by: mahdikhashan <mahdikhashan1@gmail.com> * replace definition of distributed training Signed-off-by: mahdikhashan <mahdikhashan1@gmail.com> * decrease header level Signed-off-by: mahdikhashan <mahdikhashan1@gmail.com> * decrease header level Signed-off-by: mahdikhashan <mahdikhashan1@gmail.com> * update configuration for `resource_per_trial` Signed-off-by: mahdikhashan <mahdikhashan1@gmail.com> * update header title Signed-off-by: mahdikhashan <mahdikhashan1@gmail.com> * update training operator control plane Signed-off-by: mahdikhashan <mahdikhashan1@gmail.com> * improve prerequisites Signed-off-by: mahdikhashan <mahdikhashan1@gmail.com> * update prerequisites Signed-off-by: mahdikhashan <mahdikhashan1@gmail.com> * update page into description Signed-off-by: mahdikhashan <mahdikhashan1@gmail.com> * update page description Signed-off-by: mahdikhashan <mahdikhashan1@gmail.com> * update page title and adjust description letter Signed-off-by: mahdikhashan <mahdikhashan1@gmail.com> * move the doc to the parent folder Signed-off-by: mahdikhashan <mahdikhashan1@gmail.com> * remove section Signed-off-by: mahdikhashan <mahdikhashan1@gmail.com> * modify message of the page Signed-off-by: mahdikhashan <mahdikhashan1@gmail.com> * fix typo Signed-off-by: mahdikhashan <mahdikhashan1@gmail.com> * fix typo Signed-off-by: mahdikhashan <mahdikhashan1@gmail.com> * mention Training Operator Signed-off-by: mahdikhashan <mahdikhashan1@gmail.com> --------- Signed-off-by: mahdikhashan <mahdikhashan1@gmail.com> Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com>
ref: #3951