-
Notifications
You must be signed in to change notification settings - Fork 30.3k
Fix bug in slow tokenizer conversion, make it a lot faster #24266
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Speed info: the new implementation takes 70 ms, the old one took 123 seconds for the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Really nice fix and improvement - thanks for working on this ❤️
Logic all looks good to me. There's a test that's failing, but it's decorated with @is_flaky
so shouldn't be preventing CI being green here. @ydshieh any insights into what might be happening?
The documentation is not available anymore as the PR was closed or merged. |
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, but running RUN_SLOW=1 RUN_TOKENIZER_INTEGRATION=1 pytest tests/models/llama/test_tokenization_llama.py
(the code was changed for llama specifically) crashes.
This probably happens when sorting local = sorted(local, key=lambda x: (vocab[x[0]], vocab[[x[1]]]))
, left a suggestion.
Also it seems that the new serialization differs from the old one, that's probably just the special tokens that are no longer normalized by default
Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>
Sorry for the weird error. I forgot to re-run tests after the second commit |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot 🤗
What does this PR do?
The slow tokenizer conversion currently has a bug where merges with a score of 0 do not get used due to an erroneous check. The check simply tested truthiness, but was actually looking for a
None
. During fixing, I noticed that the code was also slow, so I made it a lot faster.Fixes #24233
Before submitting
Pull Request section?
to it if that's the case.
documentation guidelines, and
here are tips on formatting docstrings.
Who can review?
Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.
@ArthurZucker