Files for SentenceTransformer
support (0.6B model as the example).
Will push to the huggingface model repos.
Convert tokenizer:
import tokenizers
name_or_path = "TODO"
tok = AutoTokenizer.from_pretrained(name_or_path)
print(tok.tokenize('test 1, test 2'), tok('test 1, test 2'))
template_processor = tokenizers.processors.TemplateProcessing(
single="$A <|endoftext|>", pair="$A $B <|endoftext|>", special_tokens=[("<|endoftext|>", 151643)]
)
tok.backend_tokenizer.post_processor = tokenizers.processors.Sequence([
tok.backend_tokenizer.post_processor, template_processor
])
print(tok.tokenize('test 1, test 2'), tok('test 1, test 2'))
tok.save_pretrained(name_or_path + '-eos')