-
Notifications
You must be signed in to change notification settings - Fork 30.3k
Closed
Closed
Copy link
Description
System Info
transformers
version: 4.34.0- Platform: macOS-13.5-arm64-arm-64bit
- Python version: 3.10.12
- Huggingface_hub version: 0.17.3
- Safetensors version: 0.4.0
- Accelerate version: 0.20.3
- Accelerate config: not found
- PyTorch version (GPU?): 2.1.0 (False)
- Tensorflow version (GPU?): not installed (NA)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
Who can help?
Information
- The official example scripts
- My own modified scripts
Tasks
- An officially supported task in the
examples
folder (such as GLUE/SQuAD, ...) - My own task or dataset (give details below)
Reproduction
In [1]: import transformers
In [2]: t0tt = transformers.AutoTokenizer.from_pretrained('bigscience/T0pp')
You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thouroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565
In [3]: t0tt.save_pretrained('saved-tokenizer')
Out[3]:
('saved-tokenizer/tokenizer_config.json',
'saved-tokenizer/special_tokens_map.json',
'saved-tokenizer/spiece.model',
'saved-tokenizer/added_tokens.json',
'saved-tokenizer/tokenizer.json')
In [4]: loaded_t0tt = transformers.AutoTokenizer.from_pretrained('saved-tokenizer')
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
In [6]: t0tt._eos_token
Out[6]: AddedToken("</s>", rstrip=True, lstrip=True, single_word=False, normalized=True, special=True)
In [7]: loaded_t0tt._eos_token
Out[7]: AddedToken("</s>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True)
In [8]: t0tt.eos_token
Out[8]: '</s>'
In [9]: t0tt('hello </s> goodbye')
Out[9]: {'input_ids': [21820, 1, 23281, 1], 'attention_mask': [1, 1, 1, 1]}
In [10]: loaded_t0tt('hello </s> goodbye')
Out[10]: {'input_ids': [21820, 3, 1, 23281, 1], 'attention_mask': [1, 1, 1, 1, 1]}
Expected behavior
When saving and loading a tokenizer, it
(1) behaves the same
(2) has the same config details on the AddedToken
Metadata
Metadata
Assignees
Labels
No labels