-
Notifications
You must be signed in to change notification settings - Fork 30.2k
Closed
Description
System Info
transformers
version: 4.30.1- Platform: Linux-5.15.107+-x86_64-with-glibc2.31
- Python version: 3.10.12
- Huggingface_hub version: 0.15.1
- Safetensors version: 0.3.1
- PyTorch version (GPU?): 2.0.1+cu118 (False)
- Tensorflow version (GPU?): 2.12.0 (False)
- Flax version (CPU?/GPU?/TPU?): 0.6.9 (cpu)
- Jax version: 0.4.10
- JaxLib version: 0.4.10
- Using GPU in script?: No
- Using distributed or parallel set-up in script?: No
Who can help?
Information
- The official example scripts
- My own modified scripts
Tasks
- An officially supported task in the
examples
folder (such as GLUE/SQuAD, ...) - My own task or dataset (give details below)
Reproduction
The auto-converted fast tokenizer for the LLaMA model sometimes does not produce the same tokenization results as the original sentence piece tokenizer. This is affecting the OpenLLaMA models. Here's the code to reproduce it:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained('openlm-research/open_llama_7b', use_fast=False)
fast_tokenizer = AutoTokenizer.from_pretrained('openlm-research/open_llama_7b')
text = 'thermal'
print(tokenizer.encode(text))
print(fast_tokenizer.encode(text))
The code produces the following output:
[1, 14412]
[1, 31822, 496, 12719]
Expected behavior
The auto-converted fast tokenizer should produce the exact same tokens as the original sentencepiece tokenizer.
QuantumBoIt
Metadata
Metadata
Assignees
Labels
No labels