Skip to content

Conversation

stephantul
Copy link
Contributor

What does this PR do?

The slow tokenizer conversion currently has a bug where merges with a score of 0 do not get used due to an erroneous check. The check simply tested truthiness, but was actually looking for a None. During fixing, I noticed that the code was also slow, so I made it a lot faster.

Fixes #24233

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline,
    Pull Request section?
  • Was this discussed/approved via a Github issue or the forum? Please add a link
    to it if that's the case.
  • Did you make sure to update the documentation with your changes? Here are the
    documentation guidelines, and
    here are tips on formatting docstrings.
  • Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

@ArthurZucker

@stephantul
Copy link
Contributor Author

Speed info: the new implementation takes 70 ms, the old one took 123 seconds for the openlm-research/open_llama_7b tokenizer mentioned in the issue.

Copy link
Collaborator

@amyeroberts amyeroberts left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really nice fix and improvement - thanks for working on this ❤️

Logic all looks good to me. There's a test that's failing, but it's decorated with @is_flaky so shouldn't be preventing CI being green here. @ydshieh any insights into what might be happening?

@HuggingFaceDocBuilderDev
Copy link

HuggingFaceDocBuilderDev commented Jun 14, 2023

The documentation is not available anymore as the PR was closed or merged.

@ydshieh
Copy link
Collaborator

ydshieh commented Jun 14, 2023

Really nice fix and improvement - thanks for working on this ❤️

Logic all looks good to me. There's a test that's failing, but it's decorated with @is_flaky so shouldn't be preventing CI being green here. @ydshieh any insights into what might be happening?

@amyeroberts

is_flacky() won't keep the test green 100%. It just runs the test a few times (default 5) 😿 . The failing is still expected but less frequently.

Copy link
Collaborator

@ArthurZucker ArthurZucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, but running RUN_SLOW=1 RUN_TOKENIZER_INTEGRATION=1 pytest tests/models/llama/test_tokenization_llama.py (the code was changed for llama specifically) crashes.
This probably happens when sorting local = sorted(local, key=lambda x: (vocab[x[0]], vocab[[x[1]]])), left a suggestion.

Also it seems that the new serialization differs from the old one, that's probably just the special tokens that are no longer normalized by default

Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>
@stephantul
Copy link
Contributor Author

stephantul commented Jun 14, 2023

Sorry for the weird error. I forgot to re-run tests after the second commit

Copy link
Collaborator

@ArthurZucker ArthurZucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot 🤗

@amyeroberts amyeroberts merged commit 6793f0c into huggingface:main Jun 15, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Auto-Converted Fast Tokenizer Producing Incorrect Results
5 participants