Auto-Converted Fast Tokenizer Producing Incorrect Results

### System Info

- `transformers` version: 4.30.1
- Platform: Linux-5.15.107+-x86_64-with-glibc2.31
- Python version: 3.10.12
- Huggingface_hub version: 0.15.1
- Safetensors version: 0.3.1
- PyTorch version (GPU?): 2.0.1+cu118 (False)
- Tensorflow version (GPU?): 2.12.0 (False)
- Flax version (CPU?/GPU?/TPU?): 0.6.9 (cpu)
- Jax version: 0.4.10
- JaxLib version: 0.4.10
- Using GPU in script?: No
- Using distributed or parallel set-up in script?: No

### Who can help?

@ArthurZucker 

### Information

- [ ] The official example scripts
- [X] My own modified scripts

### Tasks

- [X] An officially supported task in the `examples` folder (such as GLUE/SQuAD, ...)
- [ ] My own task or dataset (give details below)

### Reproduction

The auto-converted fast tokenizer for the LLaMA model sometimes does not produce the same tokenization results as the original sentence piece tokenizer. This is affecting the OpenLLaMA models. Here's the code to reproduce it:

```python
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained('openlm-research/open_llama_7b', use_fast=False)
fast_tokenizer = AutoTokenizer.from_pretrained('openlm-research/open_llama_7b')

text = 'thermal'
print(tokenizer.encode(text))
print(fast_tokenizer.encode(text))
```

The code produces the following output:

```
[1, 14412]
[1, 31822, 496, 12719]
```

### Expected behavior

The auto-converted fast tokenizer should produce the exact same tokens as the original sentencepiece tokenizer.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Auto-Converted Fast Tokenizer Producing Incorrect Results #24233

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Auto-Converted Fast Tokenizer Producing Incorrect Results #24233

Description

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions