-
Notifications
You must be signed in to change notification settings - Fork 30.3k
fix: condition bos_token_id and space as token #36211
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
cc @ArthurZucker @itazap for tokenizers! |
gentle ping @ArthurZucker @itazap |
Thanks for the catch and fix! 🤗 I took a look at the original motivation for 'token healing' (#30081) and found that there might be a different root cause to the error. For models like space_tok = tokenizer.convert_ids_to_tokens(tokenizer.convert_tokens_to_ids(tokenizer.tokenize(" ")))[0] vs. the current: space_tok = tokenizer.convert_ids_to_tokens(tokenizer.convert_tokens_to_ids(" "))[0] cc @ArthurZucker lmk what you think 😊 |
Co-authored-by: Ita Zaporozhets <31893021+itazap@users.noreply.github.com>
cc @gante for generation |
+1 to Ita's comment -- tokenizing the space to get the space token seems more intelligent and robust The other change, regarding |
# tail tokens are used for a prefix search, thus, whitespaces are replaced with | ||
# their tokenization (e.g. 'Ġ') to enable search for tokens prefixed with a whitespace | ||
tail_toks = (tokenizer.decode(t).replace(" ", space_tok) for t in tail_ids) | ||
space_tok_id = tokenizer.convert_tokens_to_ids(" ") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
space_tok_id = tokenizer.convert_tokens_to_ids(" ") | |
space_tok_id = tokenizer.convert_tokens_to_ids(tokenizer.tokenize(" ")) |
Great! Last thing would be great to add a test for this, making sure that we get some output instead of an error, let me know if you'd like to add it or I can help 🤗 @desaxce |
Will take care of it before end of week. |
Appreciate it! Thank you! 😊 |
@itazap Sorry for the wrong timing, had a difficult end of week and I kept busy over the weekend. |
@desaxce no worries, thanks for the update! |
Fixes # (#36210).
Changes:
In the
heal_tokens
function, we now:bos_token_id
Just making the code not raise exceptions, there may be something more clever to do.
To test:
Use token healing on https://huggingface.co/Qwen/Qwen2.5-Coder-7B-Instruct: