[Bug]? how does the tokenizer encode the special tokens? 

### System Info

transformer version 4.28.1

### Who can help?

@ArthurZucker 
hi, maybe, the following issue should be asked here?
[[Bug]? how does the tokenizer encode the special tokens? #1263](https://github.com/huggingface/tokenizers/issues/1263)


### Information

- [ ] The official example scripts
- [X] My own modified scripts

### Tasks

- [ ] An officially supported task in the `examples` folder (such as GLUE/SQuAD, ...)
- [X] My own task or dataset (give details below)

### Reproduction

Hi, all, I used the tokenzier to process  data for llama model(already converted to hf formated) and set:
```python
tokenizer = AutoTokenizer.from_pretrained(llama_model_id, model_max_length=1024, padding_side='right',
                                              trust_remote_code=True)
tokenizer.add_special_tokens(  
            {
                "eos_token": "</s>",
                "bos_token": "</s>",
                "unk_token": "</s>",
            })
tokenizer.pad_token = tokenizer.eos_token
```
when tokenizing  a piece of text with an eos_token:

```python
tokenizer(['ASSISTANT: Hello!</s>']) # there is no space between ! and </s>.
```

```
output:
{'input_ids': [[1, 319, 1799, 9047, 13566, 29901, 15043, 29991, 829, 29879, 29958]], 
 'token_type_ids': [[0, 0, 0, 0, 0, 0,0, 0, 0, 0, 0]], 
 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]}
```
The `eos_token: </s>` is encoded to ` 829, 29879, 29958` which means `</s>` is regarded as `</`,`s` and `>`.


```python
tokenizer(['ASSISTANT: Hello! </s>'])  # there is a space between ! and </s>.
```

```
output:
{'input_ids': [[1, 319, 1799, 9047, 13566, 29901, 15043, 29991, 2]],
  'token_type_ids': [[0, 0, 0, 0, 0, 0, 0, 0, 0]],
  'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1]]}
```
in this time, `</s>` is encoded correctly  (token id is 2).

As description above, does this mean we should add a space between text and `eos_token`? however, I find many popular projects like `Alpaca` concatenate text with `eos_token` without a space.

I previously thought tokenizer encode text in a greedy style, the `eos_token`  would be encoded correctly with or without a space. However, the experiments above seemed to not support my opinion.

could anyone help me, if there is something misunderstood by me? thx.


----

After some other experiments, I found some weird thing:



```python
tokenizer('我是谁')
output:
'input_ids': [1, 29871, 30672, 30392, 235, 179, 132] 
```
1 is bos_token_id, 29871 is the token id of ''

```python
tokenizer('我是谁</s>')
output:
'input_ids': [1, 29871, 30672, 30392, 235, 179, 132, 829, 29879, 29958]

tokenizer('who are you</s>')
output:
'input_ids': [1, 1058, 526, 366, 829, 29879, 29958] # there is no 29871.
```

when add a space ` ` between `谁` and `</s>`.
```python
tokenizer('我是谁 </s>') 
output:
'input_ids': [1, 29871, 30672, 30392, 235, 179, 132, 2] # the `</s>` is encoded correctly
```


when decode `[1, 29871, 30672, 30392, 235, 179, 132, 2] `
```
tokenizer.decode([1, 29871, 30672, 30392, 235, 179, 132, 2])
output:
'<s> 我是谁</s>' 
```
 the space ` ` is ignored!

When manually add token id 29871：
```
tokenizer.decode([1, 29871, 30672, 30392, 235, 179, 132, 29871, 2])
output:
'<s> 我是谁 </s>' 
```
this time, there is a space  ` ` between `谁` and `</s>`.

Does these experiments above means encode, decode methods are not completely Reciprocal reversible operation?




### Expected behavior

does above experiments show bugs? if not, how should I understand these? thanks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Bug]? how does the tokenizer encode the special tokens? #23851

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Bug]? how does the tokenizer encode the special tokens? #23851

Description

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions