Skip to content

Conversation

cdoern
Copy link
Contributor

@cdoern cdoern commented Apr 29, 2024

when downloading from HF, check check if the repo has .safetensor files. If it does, and --filename is default, download the entire repo and put it into a <repo_name> folder inside of models

Signed-off-by: Charlie Doern cdoern@redhat.com

@cdoern cdoern force-pushed the model-agnostic-download branch 2 times, most recently from 349e033 to b7140b1 Compare April 29, 2024 19:05
@cdoern cdoern force-pushed the model-agnostic-download branch from b7140b1 to 19a6b3e Compare April 29, 2024 19:12
when downloading from HF, check check if the repo has .safetensor files. If it does, and `--filename` is default, download the entire repo and put it into a <repo_name> folder inside of models
Signed-off-by: Charlie Doern <cdoern@redhat.com>
@cdoern cdoern force-pushed the model-agnostic-download branch from 19a6b3e to 6af9db5 Compare April 29, 2024 19:23
@russellb russellb added the hold In-progress PR. Tag should be removed before merge. label Apr 29, 2024
@russellb russellb requested a review from alimaredia April 29, 2024 19:24
@russellb russellb removed the hold In-progress PR. Tag should be removed before merge. label Apr 29, 2024
Copy link
Member

@russellb russellb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm! I know @alimaredia is testing this, so I requested his review. It should block merging until he approves as well.

@@ -712,15 +712,36 @@ def chat(
def download(ctx, repository, release, filename, model_dir):
"""Download the model(s) to train"""
click.echo(f"Downloading model from {repository}@{release} to {model_dir}...")
if (
"HF_TOKEN" not in os.environ
and repository != "instructlab/merlinite-7b-lab-GGUF"
Copy link
Member

@jaideepr97 jaideepr97 Apr 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need this check? I think it might be best not to hardcode merlinite for such checks - it would also be good practice to always have an HF token exposed and would make the code easier and cleaner to maintain (even if technically you dont need a token for the instructlab org)

just my 0.02
especially considering this would fail if we try to pull the granite model even though it exists in the same instructlab HF org

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume this was so at least the default config didn't require it, which is nice.

An even better UX would determine whether the token is required for a given model before then erroring out because it's not set, but that could be a future improvement ...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right - then for now should we at least only check if "instructlab" is part of the repo string, and we allow the download without token if true? since we would want to at least have uniform behavior for all the models in the instructlab HF org

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about you use clicks' feature to read values from env vars?
@click.option('--hf-token', envvar='HF_TOKEN')

Copy link
Contributor Author

@cdoern cdoern Apr 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

merl is the default. This is my way of checking If NO HF TOKEN and THE USER SET REPO

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tiran yes! I was just looking at that yesterday when looking at #1022

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was surprised by it asking me for a HF_TOKEN when I did ilab download --repository instructlab/merlinite-7b-lab as that repository does not need one. In my case, I specifically wanted it to download the non-gguf version of this model, but perhaps the logic around HF_TOKEN usage should be a bit broader?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was improved further in #1035

filename=filename,
local_dir=model_dir,
)
files = list_repo_files(repo_id=repository, token=os.getenv("HF_TOKEN"))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cdoern why not download the entire repository via snapshot_download() instead of going file by file? https://huggingface.co/docs/huggingface_hub/en/guides/download#download-an-entire-repository

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like a good idea to me +1

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apologies @alimaredia -- once you posted this review, that made mergify happy that you looked at it. If you had done "request changes" it would have blocked it, just FYI.

I probably should just set a hold label and let you remove it

@mergify mergify bot merged commit 9a70e6b into instructlab:main Apr 29, 2024
@cdoern
Copy link
Contributor Author

cdoern commented Apr 29, 2024

uh oh, seems mergify has gone rogue!!!! I will open another PR to address folks' comments

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants