Skip to content

Tokenizer patch #537

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Apr 9, 2024
Merged

Tokenizer patch #537

merged 4 commits into from
Apr 9, 2024

Conversation

AkshitaB
Copy link
Contributor

@AkshitaB AkshitaB commented Apr 9, 2024

If you just want to fix the tokenizer files, without running the full conversion:

fix_bad_tokenizer(path)  # assumes native olmo compatibility, i.e., ensure that the original config.yaml is present
write_tokenizer(path)

Copy link
Contributor

@OyvindTafjord OyvindTafjord left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, LGTM! At some point, for future-proofing new tokenizer, maybe make some check that the 50279 token is what we expect it to be.

def fix_bad_tokenizer(checkpoint_dir: str):
path = os.path.join(checkpoint_dir, "config.yaml")
conf = om.load(path)
conf["tokenizer"]["identifier"] = "allenai/gpt-neox-olmo-dolma-v1_5"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just checking, was the tokenizer also wrong for the OLMo models we released, or just for the 1.7 runs? I just want to make sure that this script doesn't break old checkpoints, or that we warn that there is such a risk.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ran this explicitly for the v1 models before releasing.

@AkshitaB AkshitaB merged commit 9d40898 into main Apr 9, 2024
@AkshitaB AkshitaB deleted the AkshitaB-tokenizer-patch branch April 9, 2024 23:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants