[GSoC] Add e2e test for `tune` api with LLM hyperparameter optimization #2420

helenxie-bit · 2024-09-03T13:17:38Z

What this PR does / why we need it:
This PR adds an e2e test for the tune API, specifically for the scenario of importing external models and datasets for LLM hyperparameter optimization.

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #

Checklist:

Docs included if any changes are user facing

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>

helenxie-bit · 2024-09-03T13:21:23Z

/area gsoc

helenxie-bit · 2024-09-03T13:21:49Z

Ref: #2339

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>

…roller Signed-off-by: helenxie-bit <helenxiehz@gmail.com>

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>

helenxie-bit · 2025-03-18T23:12:19Z

[ ]

Hi @helenxie-bit , we are planning to cut Katib release this week. Do you think you can finish this PR ?

@andreyvelich Thank you for catching up! I'm working on this. But the e2e test failed due to some problem inside the trainer. Here is the error message:

I0318 22:47:28.491627     308 main.go:396] Trial Name: tune-example-llm-optimization-mkfm67k9
I0318 22:47:33.946979     308 main.go:139] 2025-03-18T22:47:33Z INFO     Starting HuggingFace LLM Trainer
I0318 22:47:33.950305     308 main.go:139] /usr/local/lib/python3.10/dist-packages/accelerate/state.py:313: UserWarning: OMP_NUM_THREADS/MKL_NUM_THREADS unset, we set it at 8 to improve oob performance.
I0318 22:47:33.950324     308 main.go:139]   warnings.warn(
I0318 22:47:33.952095     308 main.go:139] /usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py:1317: UserWarning: For MPI backend, world_size (1) and rank (0) are ignored since they are assigned by the MPI runtime.
I0318 22:47:33.952106     308 main.go:139]   warnings.warn(
I0318 22:47:34.003708     308 main.go:139] /usr/local/lib/python3.10/dist-packages/transformers/training_args.py:1815: FutureWarning: `--push_to_hub_token` is deprecated and will be removed in version 5 of 🤗 Transformers. Use `--hub_token` instead.
I0318 22:47:34.003725     308 main.go:139]   warnings.warn(
I0318 22:47:34.005569     308 main.go:139] 2025-03-18T22:47:34Z INFO     Setup model and tokenizer
I0318 22:47:34.006007     308 main.go:139] /usr/local/lib/python3.10/dist-packages/huggingface_hub/file_download.py:797: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
I0318 22:47:34.006018     308 main.go:139]   warnings.warn(
I0318 22:47:35.597752     308 main.go:139] [rank0]: Traceback (most recent call last):
I0318 22:47:35.597801     308 main.go:139] [rank0]:   File "/usr/local/lib/python3.10/dist-packages/transformers/utils/hub.py", line 398, in cached_file
I0318 22:47:35.597818     308 main.go:139] [rank0]:     resolved_file = hf_hub_download(
I0318 22:47:35.597822     308 main.go:139] [rank0]:   File "/usr/local/lib/python3.10/dist-packages/huggingface_hub/utils/_validators.py", line 114, in _inner_fn
I0318 22:47:35.597834     308 main.go:139] [rank0]:     return fn(*args, **kwargs)
I0318 22:47:35.597842     308 main.go:139] [rank0]:   File "/usr/local/lib/python3.10/dist-packages/huggingface_hub/file_download.py", line 862, in hf_hub_download
I0318 22:47:35.597856     308 main.go:139] [rank0]:     return _hf_hub_download_to_cache_dir(
I0318 22:47:35.597863     308 main.go:139] [rank0]:   File "/usr/local/lib/python3.10/dist-packages/huggingface_hub/file_download.py", line 969, in _hf_hub_download_to_cache_dir
I0318 22:47:35.597875     308 main.go:139] [rank0]:     _raise_on_head_call_error(head_call_error, force_download, local_files_only)
I0318 22:47:35.597882     308 main.go:139] [rank0]:   File "/usr/local/lib/python3.10/dist-packages/huggingface_hub/file_download.py", line 1477, in _raise_on_head_call_error
I0318 22:47:35.597893     308 main.go:139] [rank0]:     raise LocalEntryNotFoundError(
I0318 22:47:35.597898     308 main.go:139] [rank0]: huggingface_hub.errors.LocalEntryNotFoundError: Cannot find the requested files in the disk cache and outgoing traffic has been disabled. To enable hf.co look-ups and downloads online, set 'local_files_only' to False.

I checked the logs of the master pod, and it only has two containers: pytorch and metrics-logger-and-collector. It seems the container of storage-initializer is not created.

command: kubectl logs tune-example-llm-optimization-mkfm67k9-master-0 -n default
                           
Defaulted container "pytorch" out of: pytorch, metrics-logger-and-collector

I'm not sure if it has something to do with the update of training operator. Do you have any ideas?

By the way, I've installed Training Operator control plane v1.8.1. I tried to install the latest Training Operator control plane by running kubectl apply --server-side -k "github.com/kubeflow/training-operator.git/manifests/overlays/standalone?ref=master", however, it shows the following error. I'm not sure if it has something to due with the storage-initializer error:

error: evalsymlink failure on '/private/var/folders/l3/jvrwplzx77z55jbbtyh6nxbw0000gn/T/kustomize-2661365325/manifests/overlays/standalone' : lstat /private/var/folders/l3/jvrwplzx77z55jbbtyh6nxbw0000gn/T/kustomize-2661365325/manifests/overlays/standalone: no such file or directory

Update (2025-03-27):
I identified the cause—the order of worker_pod_template_spec and master_pod_template_spec needs to be reversed to match the implementation in this util function. I've fixed it in this PR.

katib/sdk/python/v1beta1/kubeflow/katib/api/katib_client.py

Lines 690 to 697 in 54764d6

    
           # Generate Trial template using the PyTorchJob. 
        
           trial_template = utils.get_trial_template_with_pytorchjob( 
        
               retain_trials, 
        
               trial_parameters, 
        
               resources_per_trial, 
        
               worker_pod_template_spec, 
        
               master_pod_template_spec, 
        
           )

helenxie-bit · 2025-03-18T23:30:58Z

TypeError: Object of type LoraRuntimeConfig is not JSON serializable
it seems that the reason for test failure on my machine is
TypeError: Object of type LoraRuntimeConfig is not JSON serializable
my python version is Python 3.12.7.

@mahdikhashan Hmmm, that's strange. It seems the problem is that the type should be LoraConfig instead of LoraRuntimeConfig. Can you check the version of peft and transformers in your device? The correct version should be 0.3.0 and 4.38.0 separately.

mahdikhashan · 2025-03-19T17:17:40Z

TypeError: Object of type LoraRuntimeConfig is not JSON serializable
it seems that the reason for test failure on my machine is
TypeError: Object of type LoraRuntimeConfig is not JSON serializable
my python version is Python 3.12.7.
@mahdikhashan Hmmm, that's strange. It seems the problem is that the type should be LoraConfig instead of LoraRuntimeConfig. Can you check the version of peft and transformers in your device? The correct version should be 0.3.0 and 4.38.0 separately.

yes, i'll do so and share the full testing env for it so then we can work on it.

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>

helenxie-bit · 2025-03-27T22:24:09Z

@andreyvelich @mahdikhashan Thank you for the review! I've incorporated your suggestions, and this PR is now ready for review.

Note: I'm also currently testing the example provided in this user guide, but I've encountered an issue related to downloading the model in the storage-initializer. Here's the original error message:

2025-03-27T21:53:28Z INFO     Downloading model
2025-03-27T21:53:28Z INFO     ----------------------------------------
/usr/local/lib/python3.11/site-packages/huggingface_hub/file_download.py:797: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
  warnings.warn(
Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "/app/storage_initializer/storage.py", line 50, in <module>
    model_factory(args.model_provider, args.model_provider_parameters)
  File "/app/storage_initializer/storage.py", line 12, in model_factory
    hf.download_model_and_tokenizer()
  File "/app/storage_initializer/hugging_face.py", line 68, in download_model_and_tokenizer
    transformer_type_class.from_pretrained(
  File "/usr/local/lib/python3.11/site-packages/transformers/models/auto/auto_factory.py", line 521, in from_pretrained
    config, kwargs = AutoConfig.from_pretrained(
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/transformers/models/auto/configuration_auto.py", line 1135, in from_pretrained
    return config_class.from_dict(config_dict, **unused_kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/transformers/configuration_utils.py", line 763, in from_dict
    config = cls(**config_dict)
             ^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/transformers/models/llama/configuration_llama.py", line 160, in __init__
    self._rope_scaling_validation()
  File "/usr/local/lib/python3.11/site-packages/transformers/models/llama/configuration_llama.py", line 180, in _rope_scaling_validation
    raise ValueError(
ValueError: `rope_scaling` must be a dictionary with with two fields, `type` and `factor`, got {'factor': 32.0, 'high_freq_factor': 4.0, 'low_freq_factor': 1.0, 'original_max_position_embeddings': 8192, 'rope_type': 'llama3'}

I suspect this error is due to package version compatibility. Updating transformers from 4.38.0 to 4.50.2 resolved this issue. However, after upgrading, a new tokenizer loading issue appeared:

2025-03-27T21:58:37Z INFO     Downloading model
2025-03-27T21:58:37Z INFO     ----------------------------------------
Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "/app/storage_initializer/storage.py", line 50, in <module>
    model_factory(args.model_provider, args.model_provider_parameters)
  File "/app/storage_initializer/storage.py", line 12, in model_factory
    hf.download_model_and_tokenizer()
  File "/app/storage_initializer/hugging_face.py", line 74, in download_model_and_tokenizer
    transformers.AutoTokenizer.from_pretrained(
  File "/usr/local/lib/python3.11/site-packages/transformers/models/auto/tokenization_auto.py", line 916, in from_pretrained
    return tokenizer_class_fast.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/transformers/tokenization_utils_base.py", line 2255, in from_pretrained
    raise EnvironmentError(
OSError: Can't load tokenizer for 'meta-llama/Llama-3.2-1B'. If you were trying to load it from 'https://huggingface.co/models', make sure you don't have a local directory with the same name. Otherwise, make sure 'meta-llama/Llama-3.2-1B' is the correct path to a directory containing all relevant files for a LlamaTokenizerFast tokenizer.

I'm actively working on fixing this new issue, but it may take some additional time. How about we proceed to review and merge this PR first and handle the example issue separately in this follow-up issue? Please let me know what you think.

Updated 2025-03-28: To fix the above errors, I created a PR here. Please review when you have time @andreyvelich @mahdikhashan . Thanks!

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>

mahdikhashan · 2025-04-28T10:44:26Z

i think when we merged kubeflow/trainer#2576, we can review and merge this one.

andreyvelich · 2025-06-26T12:14:03Z

@mahdikhashan @helenxie-bit Are we ready to merge this ?

mahdikhashan · 2025-06-26T14:08:11Z

@mahdikhashan @helenxie-bit Are we ready to merge this ?

/lgtm

andreyvelich · 2025-06-26T14:11:48Z

/lgtm
/approve

google-oss-prow · 2025-06-26T14:12:10Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: andreyvelich

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [andreyvelich]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

add e2e test for tune api

6be7f29

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>

google-oss-prow bot requested review from andreyvelich, anencore94 and gaocegege September 3, 2024 13:17

google-oss-prow bot added the size/M label Sep 3, 2024

helenxie-bit mentioned this pull request Sep 3, 2024

[GSoC] Project 4: Hyperparameter Optimization API in Katib for LLMs #2339

Closed

6 tasks

google-oss-prow bot added the area/gsoc label Sep 3, 2024

helenxie-bit added 2 commits September 3, 2024 21:38

upgrade training-operator sdk

1a1f119

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>

specify the version of training operator sdk

8461a49

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>

helenxie-bit changed the title ~~[GSoC] Add e2e test for tune api with LLM hyperparameter optimization~~ [WIP] Add e2e test for tune api with LLM hyperparameter optimization Sep 3, 2024

google-oss-prow bot added the do-not-merge/work-in-progress label Sep 3, 2024

fix num_labels error and update the version of training operator cont…

c860238

…roller Signed-off-by: helenxie-bit <helenxiehz@gmail.com>

google-oss-prow bot added size/L and removed size/M labels Sep 3, 2024

helenxie-bit added 14 commits September 3, 2024 22:30

check the version of training operator

216ebd9

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>

debug

f6b96f5

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>

check import path of HuggingFaceModelParams

c636493

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>

update the version of training operator sdk

8180422

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>

update the name of experiment

6101489

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>

add step of checking pod

d67a1b8

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>

check the logs of pod

295abb6

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>

add check

e0a1b6d

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>

check reason for imagepullbackoff

1df7df9

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>

revert timeout limit

d1e1311

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>

fix format

0cc319f

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>

extend timeout limit

0383932

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>

update training operator sdk version

08c8634

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>

check the logs of pod

7a98a00

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>

helenxie-bit added 6 commits March 18, 2025 13:39

move import statements inside the function

023f535

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>

apply pprint for the logging output

17fe1c1

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>

update experiment names

3dd5282

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>

resolve conflicts

33b38b8

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>

fix format

0487832

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>

fix format

a090f5c

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>

mahdikhashan mentioned this pull request Mar 19, 2025

[test] improve e2e test file flow for test cases #2532

Open

helenxie-bit mentioned this pull request Mar 27, 2025

LLM Hyperparameter Optimization API User Guide Update #2522

Closed

fix the sequence of arguments in 'trial_template'

eafc3cc

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>

helenxie-bit added 7 commits March 27, 2025 21:20

test example in user guide

436ecdf

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>

fix access token error

f3dfa0a

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>

fix the error of setup

84e1322

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>

fix the error of setup

663cacf

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>

reverse back

598b54e

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>

fix format

096de9e

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>

fix format

0cb1479

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>

google-oss-prow bot assigned mahdikhashan Jun 26, 2025

google-oss-prow bot added the lgtm label Jun 26, 2025

google-oss-prow bot assigned andreyvelich Jun 26, 2025

google-oss-prow bot added the approved label Jun 26, 2025

google-oss-prow bot merged commit 73b8c5c into kubeflow:master Jun 26, 2025
66 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[GSoC] Add e2e test for `tune` api with LLM hyperparameter optimization #2420

[GSoC] Add e2e test for `tune` api with LLM hyperparameter optimization #2420

Uh oh!

helenxie-bit commented Sep 3, 2024

Uh oh!

helenxie-bit commented Sep 3, 2024

Uh oh!

helenxie-bit commented Sep 3, 2024

Uh oh!

helenxie-bit commented Mar 18, 2025 •

edited

Loading

Uh oh!

helenxie-bit commented Mar 18, 2025

Uh oh!

mahdikhashan commented Mar 19, 2025

Uh oh!

helenxie-bit commented Mar 27, 2025 •

edited

Loading

Uh oh!

mahdikhashan commented Apr 28, 2025

Uh oh!

andreyvelich commented Jun 26, 2025

Uh oh!

mahdikhashan commented Jun 26, 2025

Uh oh!

andreyvelich commented Jun 26, 2025

Uh oh!

google-oss-prow bot commented Jun 26, 2025

Uh oh!

Uh oh!

Uh oh!

[GSoC] Add e2e test for tune api with LLM hyperparameter optimization #2420

[GSoC] Add e2e test for tune api with LLM hyperparameter optimization #2420

Uh oh!

Conversation

helenxie-bit commented Sep 3, 2024

Uh oh!

helenxie-bit commented Sep 3, 2024

Uh oh!

helenxie-bit commented Sep 3, 2024

Uh oh!

helenxie-bit commented Mar 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

helenxie-bit commented Mar 18, 2025

Uh oh!

mahdikhashan commented Mar 19, 2025

Uh oh!

helenxie-bit commented Mar 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mahdikhashan commented Apr 28, 2025

Uh oh!

andreyvelich commented Jun 26, 2025

Uh oh!

mahdikhashan commented Jun 26, 2025

Uh oh!

andreyvelich commented Jun 26, 2025

Uh oh!

google-oss-prow bot commented Jun 26, 2025

Uh oh!

Uh oh!

Uh oh!

[GSoC] Add e2e test for `tune` api with LLM hyperparameter optimization #2420

[GSoC] Add e2e test for `tune` api with LLM hyperparameter optimization #2420

helenxie-bit commented Mar 18, 2025 •

edited

Loading

helenxie-bit commented Mar 27, 2025 •

edited

Loading