Skip to content

Conversation

helenxie-bit
Copy link
Contributor

What this PR does / why we need it:
This PR adds an e2e test for the tune API, specifically for the scenario of importing external models and datasets for LLM hyperparameter optimization.

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #

Checklist:

  • Docs included if any changes are user facing

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>
@helenxie-bit
Copy link
Contributor Author

/area gsoc

@helenxie-bit
Copy link
Contributor Author

Ref: #2339

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>
Signed-off-by: helenxie-bit <helenxiehz@gmail.com>
@helenxie-bit helenxie-bit changed the title [GSoC] Add e2e test for tune api with LLM hyperparameter optimization [WIP] Add e2e test for tune api with LLM hyperparameter optimization Sep 3, 2024
…roller

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>
@google-oss-prow google-oss-prow bot added size/L and removed size/M labels Sep 3, 2024
Signed-off-by: helenxie-bit <helenxiehz@gmail.com>
Signed-off-by: helenxie-bit <helenxiehz@gmail.com>
Signed-off-by: helenxie-bit <helenxiehz@gmail.com>
Signed-off-by: helenxie-bit <helenxiehz@gmail.com>
Signed-off-by: helenxie-bit <helenxiehz@gmail.com>
Signed-off-by: helenxie-bit <helenxiehz@gmail.com>
Signed-off-by: helenxie-bit <helenxiehz@gmail.com>
Signed-off-by: helenxie-bit <helenxiehz@gmail.com>
Signed-off-by: helenxie-bit <helenxiehz@gmail.com>
Signed-off-by: helenxie-bit <helenxiehz@gmail.com>
Signed-off-by: helenxie-bit <helenxiehz@gmail.com>
Signed-off-by: helenxie-bit <helenxiehz@gmail.com>
Signed-off-by: helenxie-bit <helenxiehz@gmail.com>
Signed-off-by: helenxie-bit <helenxiehz@gmail.com>
Signed-off-by: helenxie-bit <helenxiehz@gmail.com>
Signed-off-by: helenxie-bit <helenxiehz@gmail.com>
Signed-off-by: helenxie-bit <helenxiehz@gmail.com>
Signed-off-by: helenxie-bit <helenxiehz@gmail.com>
Signed-off-by: helenxie-bit <helenxiehz@gmail.com>
Signed-off-by: helenxie-bit <helenxiehz@gmail.com>
@helenxie-bit
Copy link
Contributor Author

helenxie-bit commented Mar 18, 2025

  • [ ]

Hi @helenxie-bit , we are planning to cut Katib release this week. Do you think you can finish this PR ?

@andreyvelich Thank you for catching up! I'm working on this. But the e2e test failed due to some problem inside the trainer. Here is the error message:

I0318 22:47:28.491627     308 main.go:396] Trial Name: tune-example-llm-optimization-mkfm67k9
I0318 22:47:33.946979     308 main.go:139] 2025-03-18T22:47:33Z INFO     Starting HuggingFace LLM Trainer
I0318 22:47:33.950305     308 main.go:139] /usr/local/lib/python3.10/dist-packages/accelerate/state.py:313: UserWarning: OMP_NUM_THREADS/MKL_NUM_THREADS unset, we set it at 8 to improve oob performance.
I0318 22:47:33.950324     308 main.go:139]   warnings.warn(
I0318 22:47:33.952095     308 main.go:139] /usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py:1317: UserWarning: For MPI backend, world_size (1) and rank (0) are ignored since they are assigned by the MPI runtime.
I0318 22:47:33.952106     308 main.go:139]   warnings.warn(
I0318 22:47:34.003708     308 main.go:139] /usr/local/lib/python3.10/dist-packages/transformers/training_args.py:1815: FutureWarning: `--push_to_hub_token` is deprecated and will be removed in version 5 of 🤗 Transformers. Use `--hub_token` instead.
I0318 22:47:34.003725     308 main.go:139]   warnings.warn(
I0318 22:47:34.005569     308 main.go:139] 2025-03-18T22:47:34Z INFO     Setup model and tokenizer
I0318 22:47:34.006007     308 main.go:139] /usr/local/lib/python3.10/dist-packages/huggingface_hub/file_download.py:797: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
I0318 22:47:34.006018     308 main.go:139]   warnings.warn(
I0318 22:47:35.597752     308 main.go:139] [rank0]: Traceback (most recent call last):
I0318 22:47:35.597801     308 main.go:139] [rank0]:   File "/usr/local/lib/python3.10/dist-packages/transformers/utils/hub.py", line 398, in cached_file
I0318 22:47:35.597818     308 main.go:139] [rank0]:     resolved_file = hf_hub_download(
I0318 22:47:35.597822     308 main.go:139] [rank0]:   File "/usr/local/lib/python3.10/dist-packages/huggingface_hub/utils/_validators.py", line 114, in _inner_fn
I0318 22:47:35.597834     308 main.go:139] [rank0]:     return fn(*args, **kwargs)
I0318 22:47:35.597842     308 main.go:139] [rank0]:   File "/usr/local/lib/python3.10/dist-packages/huggingface_hub/file_download.py", line 862, in hf_hub_download
I0318 22:47:35.597856     308 main.go:139] [rank0]:     return _hf_hub_download_to_cache_dir(
I0318 22:47:35.597863     308 main.go:139] [rank0]:   File "/usr/local/lib/python3.10/dist-packages/huggingface_hub/file_download.py", line 969, in _hf_hub_download_to_cache_dir
I0318 22:47:35.597875     308 main.go:139] [rank0]:     _raise_on_head_call_error(head_call_error, force_download, local_files_only)
I0318 22:47:35.597882     308 main.go:139] [rank0]:   File "/usr/local/lib/python3.10/dist-packages/huggingface_hub/file_download.py", line 1477, in _raise_on_head_call_error
I0318 22:47:35.597893     308 main.go:139] [rank0]:     raise LocalEntryNotFoundError(
I0318 22:47:35.597898     308 main.go:139] [rank0]: huggingface_hub.errors.LocalEntryNotFoundError: Cannot find the requested files in the disk cache and outgoing traffic has been disabled. To enable hf.co look-ups and downloads online, set 'local_files_only' to False.

I checked the logs of the master pod, and it only has two containers: pytorch and metrics-logger-and-collector. It seems the container of storage-initializer is not created.

command: kubectl logs tune-example-llm-optimization-mkfm67k9-master-0 -n default
                           
Defaulted container "pytorch" out of: pytorch, metrics-logger-and-collector

I'm not sure if it has something to do with the update of training operator. Do you have any ideas?

By the way, I've installed Training Operator control plane v1.8.1. I tried to install the latest Training Operator control plane by running kubectl apply --server-side -k "github.com/kubeflow/training-operator.git/manifests/overlays/standalone?ref=master", however, it shows the following error. I'm not sure if it has something to due with the storage-initializer error:

error: evalsymlink failure on '/private/var/folders/l3/jvrwplzx77z55jbbtyh6nxbw0000gn/T/kustomize-2661365325/manifests/overlays/standalone' : lstat /private/var/folders/l3/jvrwplzx77z55jbbtyh6nxbw0000gn/T/kustomize-2661365325/manifests/overlays/standalone: no such file or directory

Update (2025-03-27):
I identified the cause—the order of worker_pod_template_spec and master_pod_template_spec needs to be reversed to match the implementation in this util function. I've fixed it in this PR.

# Generate Trial template using the PyTorchJob.
trial_template = utils.get_trial_template_with_pytorchjob(
retain_trials,
trial_parameters,
resources_per_trial,
worker_pod_template_spec,
master_pod_template_spec,
)

@helenxie-bit
Copy link
Contributor Author

TypeError: Object of type LoraRuntimeConfig is not JSON serializable

it seems that the reason for test failure on my machine is

TypeError: Object of type LoraRuntimeConfig is not JSON serializable

my python version is Python 3.12.7.

@mahdikhashan Hmmm, that's strange. It seems the problem is that the type should be LoraConfig instead of LoraRuntimeConfig. Can you check the version of peft and transformers in your device? The correct version should be 0.3.0 and 4.38.0 separately.

@mahdikhashan
Copy link
Member

TypeError: Object of type LoraRuntimeConfig is not JSON serializable

it seems that the reason for test failure on my machine is

TypeError: Object of type LoraRuntimeConfig is not JSON serializable

my python version is Python 3.12.7.

@mahdikhashan Hmmm, that's strange. It seems the problem is that the type should be LoraConfig instead of LoraRuntimeConfig. Can you check the version of peft and transformers in your device? The correct version should be 0.3.0 and 4.38.0 separately.

yes, i'll do so and share the full testing env for it so then we can work on it.

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>
@helenxie-bit
Copy link
Contributor Author

helenxie-bit commented Mar 27, 2025

@andreyvelich @mahdikhashan Thank you for the review! I've incorporated your suggestions, and this PR is now ready for review.

Note: I'm also currently testing the example provided in this user guide, but I've encountered an issue related to downloading the model in the storage-initializer. Here's the original error message:

2025-03-27T21:53:28Z INFO     Downloading model
2025-03-27T21:53:28Z INFO     ----------------------------------------
/usr/local/lib/python3.11/site-packages/huggingface_hub/file_download.py:797: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
  warnings.warn(
Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "/app/storage_initializer/storage.py", line 50, in <module>
    model_factory(args.model_provider, args.model_provider_parameters)
  File "/app/storage_initializer/storage.py", line 12, in model_factory
    hf.download_model_and_tokenizer()
  File "/app/storage_initializer/hugging_face.py", line 68, in download_model_and_tokenizer
    transformer_type_class.from_pretrained(
  File "/usr/local/lib/python3.11/site-packages/transformers/models/auto/auto_factory.py", line 521, in from_pretrained
    config, kwargs = AutoConfig.from_pretrained(
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/transformers/models/auto/configuration_auto.py", line 1135, in from_pretrained
    return config_class.from_dict(config_dict, **unused_kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/transformers/configuration_utils.py", line 763, in from_dict
    config = cls(**config_dict)
             ^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/transformers/models/llama/configuration_llama.py", line 160, in __init__
    self._rope_scaling_validation()
  File "/usr/local/lib/python3.11/site-packages/transformers/models/llama/configuration_llama.py", line 180, in _rope_scaling_validation
    raise ValueError(
ValueError: `rope_scaling` must be a dictionary with with two fields, `type` and `factor`, got {'factor': 32.0, 'high_freq_factor': 4.0, 'low_freq_factor': 1.0, 'original_max_position_embeddings': 8192, 'rope_type': 'llama3'}

I suspect this error is due to package version compatibility. Updating transformers from 4.38.0 to 4.50.2 resolved this issue. However, after upgrading, a new tokenizer loading issue appeared:

2025-03-27T21:58:37Z INFO     Downloading model
2025-03-27T21:58:37Z INFO     ----------------------------------------
Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "/app/storage_initializer/storage.py", line 50, in <module>
    model_factory(args.model_provider, args.model_provider_parameters)
  File "/app/storage_initializer/storage.py", line 12, in model_factory
    hf.download_model_and_tokenizer()
  File "/app/storage_initializer/hugging_face.py", line 74, in download_model_and_tokenizer
    transformers.AutoTokenizer.from_pretrained(
  File "/usr/local/lib/python3.11/site-packages/transformers/models/auto/tokenization_auto.py", line 916, in from_pretrained
    return tokenizer_class_fast.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/transformers/tokenization_utils_base.py", line 2255, in from_pretrained
    raise EnvironmentError(
OSError: Can't load tokenizer for 'meta-llama/Llama-3.2-1B'. If you were trying to load it from 'https://huggingface.co/models', make sure you don't have a local directory with the same name. Otherwise, make sure 'meta-llama/Llama-3.2-1B' is the correct path to a directory containing all relevant files for a LlamaTokenizerFast tokenizer.

I'm actively working on fixing this new issue, but it may take some additional time. How about we proceed to review and merge this PR first and handle the example issue separately in this follow-up issue? Please let me know what you think.

Updated 2025-03-28: To fix the above errors, I created a PR here. Please review when you have time @andreyvelich @mahdikhashan . Thanks!

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>
Signed-off-by: helenxie-bit <helenxiehz@gmail.com>
Signed-off-by: helenxie-bit <helenxiehz@gmail.com>
Signed-off-by: helenxie-bit <helenxiehz@gmail.com>
Signed-off-by: helenxie-bit <helenxiehz@gmail.com>
Signed-off-by: helenxie-bit <helenxiehz@gmail.com>
Signed-off-by: helenxie-bit <helenxiehz@gmail.com>
@mahdikhashan
Copy link
Member

i think when we merged kubeflow/trainer#2576, we can review and merge this one.

@andreyvelich
Copy link
Member

@mahdikhashan @helenxie-bit Are we ready to merge this ?

@mahdikhashan
Copy link
Member

@mahdikhashan @helenxie-bit Are we ready to merge this ?

/lgtm

@andreyvelich
Copy link
Member

/lgtm
/approve

Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: andreyvelich

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@google-oss-prow google-oss-prow bot merged commit 73b8c5c into kubeflow:master Jun 26, 2025
66 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants