katib: [USERGUIDE] LLM Hyperparameter Optimization API #3952

mahdikhashan · 2025-01-07T09:24:45Z

google-oss-prow · 2025-01-07T09:24:55Z

Hi @mahdikhashan. Thanks for your PR.

I'm waiting for a kubeflow member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

mahdikhashan · 2025-01-07T09:25:33Z

hi @andreyvelich , shall i keep it under user-guides/hp-tuning/?

andreyvelich · 2025-01-07T14:30:58Z

Sure, I think we can create a new page for this feature.
FYI, please follow the contribution guide to sign the commits: https://www.kubeflow.org/docs/about/contributing/#getting-started
cc @helenxie-bit

andreyvelich · 2025-01-07T14:31:25Z

Part of: kubeflow/katib#2339

Arhell

/ok-to-test

Signed-off-by: mahdikhashan <mahdikhashan1@gmail.com>

varodrig · 2025-02-04T04:32:56Z

content/en/docs/components/katib/user-guides/hp-tuning/llm-hp-optimization.md

+or the [Kubeflow Katib GitHub](https://github.com/kubeflow/katib/issues).
+{{% /alert %}}
+
+This page describes Large Language Models hyperparameter (HP) optimization Python API that Katib supports and how to configure


describes how to implement Hyperparameter optimization (HPO) using Python API ...

done. thank you.

varodrig · 2025-02-04T04:35:53Z

content/en/docs/components/katib/user-guides/hp-tuning/llm-hp-optimization.md

+++
+
+{{% alert title="Warning" color="warning" %}}
+This feature is in **alpha** stage and the Kubeflow community is looking for your feedback. Please


each web page has a feedback button at the bottom for users to add their feedback and creates an issue if needed.
cc @andreyvelich

We explicitly added this warning for this guide, since this feature might be unstable, and we want to hear user feedback.

varodrig · 2025-02-09T22:25:21Z

content/en/docs/components/katib/user-guides/hp-tuning/llm-hp-optimization.md

@@ -0,0 +1,351 @@
+++
+title = "How to Optimize Hyperparameters of LLMs with Kubeflow"


suggestion:

**How to implement Hyperparameter optimization (HPO) **

@andreyvelich to add comments on this.

Should we keep this name:

How to Optimize Hyperparameters for LLMs Fine-Tuning with Kubeflow

varodrig · 2025-02-09T22:33:11Z

content/en/docs/components/katib/user-guides/hp-tuning/llm-hp-optimization.md

+- [Optimizing Hyperparameters of Large Language Models](#optimizing-hyperparameters-of-large-language-models)
+- [Example: Optimizing Hyperparameters of Llama-3.2 for Binary Classification on IMDB Dataset](#example-optimizing-hyperparameters-of-llama-32-for-binary-classification-on-imdb-dataset)
+
+## Prerequisites


thank for including the prerequisites. I'm wondering if these prerequisites should be applied to all the docs/components/katib/user-guides/hp-tuning/ and in this case should be listed in this page.

I'm not sure - I checked some of other similar docs under Katib, and I'd say for them it may not make sense.

Usually, we don't need it since this Prerequisites explained in the Getting Started guide.

varodrig · 2025-02-09T22:35:24Z

content/en/docs/components/katib/user-guides/hp-tuning/llm-hp-optimization.md

@@ -0,0 +1,351 @@
+++
+title = "How to Optimize Hyperparameters of LLMs with Kubeflow"
+description = "API description"


The description could include more information about this page.
Additionally, it will be great to have short paragraph explaining more about this topic, what we are trying to achieve and why. And include a reference to this topic for the audience to learn more about it.

yes, you are right - I'll extend it. thanks for reminding me of this.

varodrig · 2025-02-09T22:50:47Z

content/en/docs/components/katib/user-guides/hp-tuning/llm-hp-optimization.md

+| `parallel_trial_count`     | Number of trials to run in parallel, set to `2`.                     |
+| `resources_per_trial`      | Resources allocated for each trial: 2 GPUs, 4 CPUs, 10GB memory.    |
+
+```python


@mahdikhashan if you haven't tested the code yet, we should mark this PR as hold . please let us know. thank you.

mahdikhashan · 2025-02-10T16:33:41Z

@varodrig thanks for you time and help with review it - I'll address your requested changes.
regarding the code, we have a nb example and we (@helenxie-bit) are collaborating on it together.

nb example issue: kubeflow/katib#2480

there is a in progress pr related to this (regarding e2e tests, its not related specifically on this but I have hold on to incorporate the latest possible changes).

Signed-off-by: mahdikhashan <mahdikhashan1@gmail.com>

andreyvelich

Thank you for this effort @mahdikhashan!
I left a few comments.

andreyvelich · 2025-02-13T17:18:54Z

content/en/docs/components/katib/user-guides/hp-tuning/llm-hp-optimization.md

@@ -0,0 +1,351 @@
+++
+title = "How to Optimize Hyperparameters of LLMs with Kubeflow"


Should we keep this name:

How to Optimize Hyperparameters for LLMs Fine-Tuning with Kubeflow

andreyvelich · 2025-02-13T17:20:28Z

content/en/docs/components/katib/user-guides/hp-tuning/llm-hp-optimization.md

@@ -0,0 +1,351 @@
+++


I would keep this guide under /user-guides/llm-hp-optimization.md for now for more visibility.
WDYT @mahdikhashan @helenxie-bit @Electronic-Waste ?

agreed. done.

andreyvelich · 2025-02-13T17:21:14Z

content/en/docs/components/katib/user-guides/hp-tuning/llm-hp-optimization.md

+++
+
+{{% alert title="Warning" color="warning" %}}
+This feature is in **alpha** stage and the Kubeflow community is looking for your feedback. Please


We explicitly added this warning for this guide, since this feature might be unstable, and we want to hear user feedback.

andreyvelich · 2025-02-13T17:22:09Z

content/en/docs/components/katib/user-guides/hp-tuning/llm-hp-optimization.md

+This page describes how to implement Hyperparameter Optimization (HPO) using Python API that Katib supports and how to configure
+it.


Modify this message to say that this page describes how to optimize HPs in the process of LLMs Fine-Tuning.

andreyvelich · 2025-02-13T17:22:38Z

content/en/docs/components/katib/user-guides/hp-tuning/llm-hp-optimization.md

+This page describes how to implement Hyperparameter Optimization (HPO) using Python API that Katib supports and how to configure
+it.
+
+## Sections


We can remove this Sections, since website has outline at the right panel.

andreyvelich · 2025-02-13T17:26:29Z

content/en/docs/components/katib/user-guides/hp-tuning/llm-hp-optimization.md

+)
+```
+
+#### HuggingFaceModelParams


Can we move these sections to the Training Operator doc and cross-reference it from this doc ?
https://www.kubeflow.org/docs/components/trainer/legacy-v1/user-guides/fine-tuning/

andreyvelich · 2025-02-13T17:28:54Z

content/en/docs/components/katib/user-guides/hp-tuning/llm-hp-optimization.md

+
+### Key Parameters for LLM Hyperparameter Tuning
+
+| **Parameter**                   | **Description**                                                                 | **Required** |


Not all of these parameters should be used for LLMs.
Please exclude the ones that can't be used with LLM Trainer (e.g. objective)

That makes sense. Then I guess these three parameters objective, base_image, and parameters should be removed.

andreyvelich · 2025-02-13T17:30:52Z

content/en/docs/components/katib/user-guides/hp-tuning/llm-hp-optimization.md

+    secret_key="YOUR_SECRET_KEY"
+)
+```
+## Optimizing Hyperparameters of Large Language Models


We should clearly say that right now user can tune parameters from training_parameters and lora_config.

andreyvelich · 2025-02-13T17:31:38Z

content/en/docs/components/katib/user-guides/hp-tuning/llm-hp-optimization.md

+	algorithm_name = "random",
+	max_trial_count = 10,
+	parallel_trial_count = 2,
+	resources_per_trial={


I guess, we should use TrainerResource here, isn't it @mahdikhashan @helenxie-bit ?

Thank you for pointing this out! Yes, we should use something like this:

resources_per_trial=katib.TrainerResources( num_workers=2, num_procs_per_worker=2, resources_per_worker={"gpu": 2, "cpu": 4, "memory": "10G",}, ),

andreyvelich · 2025-02-13T17:31:54Z

content/en/docs/components/katib/user-guides/hp-tuning/llm-hp-optimization.md

+cl.wait_for_experiment_condition(name=exp_name)
+
+# Get the best hyperparameters.
+print(cl.get_optimal_hyperparameters(exp_name))


We need to show output for the Experiment here.

@andreyvelich Do you mean adding screenshots of the results of this example at the end?

If we add the console output, we should be good.
Like here: https://www.kubeflow.org/docs/components/katib/getting-started/#:~:text=You%20should%20get%20similar%20output%20for%20the%20most%20optimal%20Trial%2C%20hyperparameters%2C%20and%20observation%20metrics%3A

Got it! I will work on the e2e test to ensure this example work, then we can add the output here.

Signed-off-by: mahdikhashan <mahdikhashan1@gmail.com>

andreyvelich

Hi @mahdikhashan, did you get a chance to address the remaining comments, so we can merge this PR ?

mahdikhashan · 2025-02-27T06:18:03Z

Hi @mahdikhashan, did you get a chance to address the remaining comments, so we can merge this PR ?

thanks for reminding me of this. I'll do it asap.

andreyvelich

I think, we should be good to merge it.
Let's address the followup changes in the next PR.
Thank you @mahdikhashan @helenxie-bit!
/lgtm
/approve

google-oss-prow · 2025-03-25T11:33:32Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: andreyvelich

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~content/en/docs/components/katib/OWNERS~~ [andreyvelich]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

* add base md Signed-off-by: mahdikhashan <mahdikhashan1@gmail.com> * update title and description Signed-off-by: mahdikhashan <mahdikhashan1@gmail.com> * add draft code Signed-off-by: mahdikhashan <mahdikhashan1@gmail.com> * add prerequisites Signed-off-by: mahdikhashan <mahdikhashan1@gmail.com> * add huggingface api details,s3 api, update example Signed-off-by: mahdikhashan <mahdikhashan1@gmail.com> * remove redundant text Signed-off-by: mahdikhashan <mahdikhashan1@gmail.com> * add HuggingFaceTrainerParams description Signed-off-by: mahdikhashan <mahdikhashan1@gmail.com> * update prerequisites Signed-off-by: mahdikhashan <mahdikhashan1@gmail.com> * update code example Signed-off-by: mahdikhashan <mahdikhashan1@gmail.com> * add sections Signed-off-by: mahdikhashan <mahdikhashan1@gmail.com> * replace langauge models with large language models Signed-off-by: mahdikhashan <mahdikhashan1@gmail.com> * improve prerequisites Signed-off-by: mahdikhashan <mahdikhashan1@gmail.com> * algorithm_name is optional Signed-off-by: mahdikhashan <mahdikhashan1@gmail.com> * objective_type is optional Signed-off-by: mahdikhashan <mahdikhashan1@gmail.com> * objective_metric_name is optional Signed-off-by: mahdikhashan <mahdikhashan1@gmail.com> * remove redundant example Signed-off-by: mahdikhashan <mahdikhashan1@gmail.com> * change tune args to optional Signed-off-by: mahdikhashan <mahdikhashan1@gmail.com> * add search api Signed-off-by: mahdikhashan <mahdikhashan1@gmail.com> * update link title Signed-off-by: mahdikhashan <mahdikhashan1@gmail.com> * add two scenarios for tune function with custom objective or loading model and parameters from hugging face Signed-off-by: mahdikhashan <mahdikhashan1@gmail.com> * add link for custom objective function example Signed-off-by: mahdikhashan <mahdikhashan1@gmail.com> * improve tune section Signed-off-by: mahdikhashan <mahdikhashan1@gmail.com> * improve title Signed-off-by: mahdikhashan <mahdikhashan1@gmail.com> * fix failing ci Signed-off-by: mahdikhashan <mahdikhashan1@gmail.com> * add warning of alpha api Signed-off-by: mahdikhashan <mahdikhashan1@gmail.com> * improve links Signed-off-by: mahdikhashan <mahdikhashan1@gmail.com> * improve python code consistency Signed-off-by: mahdikhashan <mahdikhashan1@gmail.com> * define search space for r in LoraConfig Signed-off-by: mahdikhashan <mahdikhashan1@gmail.com> * remove redundant line Signed-off-by: mahdikhashan <mahdikhashan1@gmail.com> * make sure imports are all consistent in snippets Signed-off-by: mahdikhashan <mahdikhashan1@gmail.com> * improve link Signed-off-by: mahdikhashan <mahdikhashan1@gmail.com> * improve fine-tune section Signed-off-by: mahdikhashan <mahdikhashan1@gmail.com> * improve links in prerequisites Signed-off-by: mahdikhashan <mahdikhashan1@gmail.com> * improve structure of integrations section Signed-off-by: mahdikhashan <mahdikhashan1@gmail.com> * add missing import Signed-off-by: mahdikhashan <mahdikhashan1@gmail.com> * replace local address instead of hardcoded link to website Signed-off-by: mahdikhashan <mahdikhashan1@gmail.com> * fix import Signed-off-by: mahdikhashan <mahdikhashan1@gmail.com> * use hyperparameter optimization instead of fine-tune Signed-off-by: mahdikhashan <mahdikhashan1@gmail.com> * fix header levels Signed-off-by: mahdikhashan <mahdikhashan1@gmail.com> * replace code import Signed-off-by: mahdikhashan <mahdikhashan1@gmail.com> * replace name Signed-off-by: mahdikhashan <mahdikhashan1@gmail.com> * replace definition of distributed training Signed-off-by: mahdikhashan <mahdikhashan1@gmail.com> * decrease header level Signed-off-by: mahdikhashan <mahdikhashan1@gmail.com> * decrease header level Signed-off-by: mahdikhashan <mahdikhashan1@gmail.com> * update configuration for `resource_per_trial` Signed-off-by: mahdikhashan <mahdikhashan1@gmail.com> * update header title Signed-off-by: mahdikhashan <mahdikhashan1@gmail.com> * update training operator control plane Signed-off-by: mahdikhashan <mahdikhashan1@gmail.com> * improve prerequisites Signed-off-by: mahdikhashan <mahdikhashan1@gmail.com> * update prerequisites Signed-off-by: mahdikhashan <mahdikhashan1@gmail.com> * update page into description Signed-off-by: mahdikhashan <mahdikhashan1@gmail.com> * update page description Signed-off-by: mahdikhashan <mahdikhashan1@gmail.com> * update page title and adjust description letter Signed-off-by: mahdikhashan <mahdikhashan1@gmail.com> * move the doc to the parent folder Signed-off-by: mahdikhashan <mahdikhashan1@gmail.com> * remove section Signed-off-by: mahdikhashan <mahdikhashan1@gmail.com> * modify message of the page Signed-off-by: mahdikhashan <mahdikhashan1@gmail.com> * fix typo Signed-off-by: mahdikhashan <mahdikhashan1@gmail.com> * fix typo Signed-off-by: mahdikhashan <mahdikhashan1@gmail.com> * mention Training Operator Signed-off-by: mahdikhashan <mahdikhashan1@gmail.com> --------- Signed-off-by: mahdikhashan <mahdikhashan1@gmail.com> Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com>

google-oss-prow bot added the do-not-merge/work-in-progress label Jan 7, 2025

google-oss-prow bot requested review from andreyvelich and johnugeorge January 7, 2025 09:24

google-oss-prow bot added the needs-ok-to-test label Jan 7, 2025

google-oss-prow bot added the size/S label Jan 7, 2025

Arhell reviewed Jan 8, 2025

View reviewed changes

google-oss-prow bot added ok-to-test and removed needs-ok-to-test labels Jan 8, 2025

mahdikhashan added 2 commits January 8, 2025 10:47

add base md

181719d

Signed-off-by: mahdikhashan <mahdikhashan1@gmail.com>

update title and description

aa3b2be

Signed-off-by: mahdikhashan <mahdikhashan1@gmail.com>

mahdikhashan force-pushed the llm-hp-optimization branch from a120413 to aa3b2be Compare January 8, 2025 09:48

google-oss-prow bot added size/M size/L and removed size/S size/M labels Jan 11, 2025

mahdikhashan marked this pull request as ready for review January 11, 2025 18:26

google-oss-prow bot removed the do-not-merge/work-in-progress label Jan 11, 2025

google-oss-prow bot requested a review from sperlingxx January 11, 2025 18:26

mahdikhashan added 8 commits January 11, 2025 19:47

add draft code

36f2c1e

Signed-off-by: mahdikhashan <mahdikhashan1@gmail.com>

add prerequisites

b7120e0

Signed-off-by: mahdikhashan <mahdikhashan1@gmail.com>

add huggingface api details,s3 api, update example

182b493

Signed-off-by: mahdikhashan <mahdikhashan1@gmail.com>

remove redundant text

21646a8

Signed-off-by: mahdikhashan <mahdikhashan1@gmail.com>

add HuggingFaceTrainerParams description

cc71dee

Signed-off-by: mahdikhashan <mahdikhashan1@gmail.com>

update prerequisites

56e4e53

Signed-off-by: mahdikhashan <mahdikhashan1@gmail.com>

update code example

6f36f1e

Signed-off-by: mahdikhashan <mahdikhashan1@gmail.com>

add sections

41efc88

Signed-off-by: mahdikhashan <mahdikhashan1@gmail.com>

varodrig reviewed Feb 10, 2025

View reviewed changes

update page into description

a07ccf7

Signed-off-by: mahdikhashan <mahdikhashan1@gmail.com>

google-oss-prow bot removed the lgtm label Feb 12, 2025

update page description

0005734

Signed-off-by: mahdikhashan <mahdikhashan1@gmail.com>

mahdikhashan requested review from varodrig and helenxie-bit February 12, 2025 07:05

andreyvelich changed the title ~~[USERGUIDE] LLM Hyperparameter Optimization API~~ katib: [USERGUIDE] LLM Hyperparameter Optimization API Feb 13, 2025

andreyvelich reviewed Feb 13, 2025

View reviewed changes

mahdikhashan added 6 commits February 14, 2025 12:59

update page title and adjust description letter

a52454b

Signed-off-by: mahdikhashan <mahdikhashan1@gmail.com>

move the doc to the parent folder

dc2ef16

Signed-off-by: mahdikhashan <mahdikhashan1@gmail.com>

remove section

bda7bb6

Signed-off-by: mahdikhashan <mahdikhashan1@gmail.com>

modify message of the page

d99ca22

Signed-off-by: mahdikhashan <mahdikhashan1@gmail.com>

fix typo

108c8a0

Signed-off-by: mahdikhashan <mahdikhashan1@gmail.com>

fix typo

201afc8

Signed-off-by: mahdikhashan <mahdikhashan1@gmail.com>

mahdikhashan mentioned this pull request Feb 16, 2025

Add mahdi khashan as a member kubeflow/internal-acls#751

Merged

mention Training Operator

b253c94

Signed-off-by: mahdikhashan <mahdikhashan1@gmail.com>

andreyvelich reviewed Feb 26, 2025

View reviewed changes

andreyvelich mentioned this pull request Feb 26, 2025

GSoC 2024: Summary of LLM Hyperparameter Optimization API Project kubeflow/blog#154

Merged

helenxie-bit mentioned this pull request Mar 7, 2025

LLM Hyperparameter Optimization API User Guide Update kubeflow/katib#2522

Closed

andreyvelich reviewed Mar 25, 2025

View reviewed changes

google-oss-prow bot assigned andreyvelich Mar 25, 2025

google-oss-prow bot added the lgtm label Mar 25, 2025

google-oss-prow bot added the approved label Mar 25, 2025

google-oss-prow bot merged commit de75668 into kubeflow:master Mar 25, 2025
6 checks passed

thesuperzapper mentioned this pull request Mar 26, 2025

community: GSOC project template in the Kubeflow #4060

Closed

1 task

mahdikhashan mentioned this pull request Apr 13, 2025

add LLM HP Optimization API SDK to the website #3951

Closed

		@@ -0,0 +1,351 @@
		+++
		title = "How to Optimize Hyperparameters of LLMs with Kubeflow"

		This page describes how to implement Hyperparameter Optimization (HPO) using Python API that Katib supports and how to configure
		it.


		### Key Parameters for LLM Hyperparameter Tuning

		\| Parameter \| Description \| Required \|

katib: [USERGUIDE] LLM Hyperparameter Optimization API #3952

katib: [USERGUIDE] LLM Hyperparameter Optimization API #3952

Uh oh!

Conversation

mahdikhashan commented Jan 7, 2025

Uh oh!

google-oss-prow bot commented Jan 7, 2025

Uh oh!

mahdikhashan commented Jan 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

andreyvelich commented Jan 7, 2025

Uh oh!

andreyvelich commented Jan 7, 2025

Uh oh!

Arhell left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mahdikhashan commented Feb 10, 2025

Uh oh!

andreyvelich left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

mahdikhashan commented Jan 7, 2025 •

edited

Loading