-
Notifications
You must be signed in to change notification settings - Fork 30.3k
Description
System Info
transformer version 4.28.1
Who can help?
@ArthurZucker
hi, maybe, the following issue should be asked here?
[Bug]? how does the tokenizer encode the special tokens? #1263
Information
- The official example scripts
- My own modified scripts
Tasks
- An officially supported task in the
examples
folder (such as GLUE/SQuAD, ...) - My own task or dataset (give details below)
Reproduction
Hi, all, I used the tokenzier to process data for llama model(already converted to hf formated) and set:
tokenizer = AutoTokenizer.from_pretrained(llama_model_id, model_max_length=1024, padding_side='right',
trust_remote_code=True)
tokenizer.add_special_tokens(
{
"eos_token": "</s>",
"bos_token": "</s>",
"unk_token": "</s>",
})
tokenizer.pad_token = tokenizer.eos_token
when tokenizing a piece of text with an eos_token:
tokenizer(['ASSISTANT: Hello!</s>']) # there is no space between ! and </s>.
output:
{'input_ids': [[1, 319, 1799, 9047, 13566, 29901, 15043, 29991, 829, 29879, 29958]],
'token_type_ids': [[0, 0, 0, 0, 0, 0,0, 0, 0, 0, 0]],
'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]}
The eos_token: </s>
is encoded to 829, 29879, 29958
which means </s>
is regarded as </
,s
and >
.
tokenizer(['ASSISTANT: Hello! </s>']) # there is a space between ! and </s>.
output:
{'input_ids': [[1, 319, 1799, 9047, 13566, 29901, 15043, 29991, 2]],
'token_type_ids': [[0, 0, 0, 0, 0, 0, 0, 0, 0]],
'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1]]}
in this time, </s>
is encoded correctly (token id is 2).
As description above, does this mean we should add a space between text and eos_token
? however, I find many popular projects like Alpaca
concatenate text with eos_token
without a space.
I previously thought tokenizer encode text in a greedy style, the eos_token
would be encoded correctly with or without a space. However, the experiments above seemed to not support my opinion.
could anyone help me, if there is something misunderstood by me? thx.
After some other experiments, I found some weird thing:
tokenizer('我是谁')
output:
'input_ids': [1, 29871, 30672, 30392, 235, 179, 132]
1 is bos_token_id, 29871 is the token id of ''
tokenizer('我是谁</s>')
output:
'input_ids': [1, 29871, 30672, 30392, 235, 179, 132, 829, 29879, 29958]
tokenizer('who are you</s>')
output:
'input_ids': [1, 1058, 526, 366, 829, 29879, 29958] # there is no 29871.
when add a space
between 谁
and </s>
.
tokenizer('我是谁 </s>')
output:
'input_ids': [1, 29871, 30672, 30392, 235, 179, 132, 2] # the `</s>` is encoded correctly
when decode [1, 29871, 30672, 30392, 235, 179, 132, 2]
tokenizer.decode([1, 29871, 30672, 30392, 235, 179, 132, 2])
output:
'<s> 我是谁</s>'
the space
is ignored!
When manually add token id 29871:
tokenizer.decode([1, 29871, 30672, 30392, 235, 179, 132, 29871, 2])
output:
'<s> 我是谁 </s>'
this time, there is a space
between 谁
and </s>
.
Does these experiments above means encode, decode methods are not completely Reciprocal reversible operation?
Expected behavior
does above experiments show bugs? if not, how should I understand these? thanks