Skip to content

Conversation

2015aroras
Copy link
Collaborator

Correction of #610, tested on a 1.7 7B checkpoint.

@2015aroras 2015aroras requested a review from epwalsh June 10, 2024 19:11
Comment on lines +206 to +213
config_path = Path(checkpoint_dir) / "config.yaml"
tokenizer_config = yaml.safe_load(config_path.read_text())["tokenizer"]

# Initialize tokenizer and validate vocab size.
if Path(tokenizer_config["identifier"]).is_file():
base_tokenizer = Tokenizer.from_file(tokenizer_config["identifier"])
else:
base_tokenizer = Tokenizer.from_pretrained(tokenizer_config["identifier"])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason not to do this:

Suggested change
config_path = Path(checkpoint_dir) / "config.yaml"
tokenizer_config = yaml.safe_load(config_path.read_text())["tokenizer"]
# Initialize tokenizer and validate vocab size.
if Path(tokenizer_config["identifier"]).is_file():
base_tokenizer = Tokenizer.from_file(tokenizer_config["identifier"])
else:
base_tokenizer = Tokenizer.from_pretrained(tokenizer_config["identifier"])
base_tokenizer = Tokenizer.from_checkpoint(checkpoint_dir)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is Tokenizer from the tokenizers library, not the one from the OLMo code base. I made my first PR not remembering this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, ok

@2015aroras 2015aroras merged commit 578234d into main Jun 10, 2024
@2015aroras 2015aroras deleted the shanea/hf-get-tokenizer-from-config-2 branch June 10, 2024 23:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants