Fix bug in slow tokenizer conversion, make it a lot faster #24266

stephantul · 2023-06-14T06:24:54Z

What does this PR do?

The slow tokenizer conversion currently has a bug where merges with a score of 0 do not get used due to an erroneous check. The check simply tested truthiness, but was actually looking for a None. During fixing, I noticed that the code was also slow, so I made it a lot faster.

Fixes #24233

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

@ArthurZucker

stephantul · 2023-06-14T07:24:56Z

Speed info: the new implementation takes 70 ms, the old one took 123 seconds for the openlm-research/open_llama_7b tokenizer mentioned in the issue.

amyeroberts

Really nice fix and improvement - thanks for working on this ❤️

Logic all looks good to me. There's a test that's failing, but it's decorated with @is_flaky so shouldn't be preventing CI being green here. @ydshieh any insights into what might be happening?

HuggingFaceDocBuilderDev · 2023-06-14T13:49:17Z

The documentation is not available anymore as the PR was closed or merged.

ydshieh · 2023-06-14T14:08:25Z

Really nice fix and improvement - thanks for working on this ❤️

Logic all looks good to me. There's a test that's failing, but it's decorated with @is_flaky so shouldn't be preventing CI being green here. @ydshieh any insights into what might be happening?

@amyeroberts

is_flacky() won't keep the test green 100%. It just runs the test a few times (default 5) 😿 . The failing is still expected but less frequently.

ArthurZucker

LGTM, but running RUN_SLOW=1 RUN_TOKENIZER_INTEGRATION=1 pytest tests/models/llama/test_tokenization_llama.py (the code was changed for llama specifically) crashes.
This probably happens when sorting local = sorted(local, key=lambda x: (vocab[x[0]], vocab[[x[1]]])), left a suggestion.

Also it seems that the new serialization differs from the old one, that's probably just the special tokens that are no longer normalized by default

src/transformers/convert_slow_tokenizer.py

Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>

stephantul · 2023-06-14T16:08:19Z

Sorry for the weird error. I forgot to re-run tests after the second commit

ArthurZucker

Thanks a lot 🤗

stephantul added 2 commits June 14, 2023 08:17

Make conversion faster, fix None vs 0 bug

0585c5a

Add second sort for consistency

9ee1892

amyeroberts approved these changes Jun 14, 2023

View reviewed changes

ArthurZucker reviewed Jun 14, 2023

View reviewed changes

src/transformers/convert_slow_tokenizer.py Outdated Show resolved Hide resolved

Update src/transformers/convert_slow_tokenizer.py

33bc722

Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>

ArthurZucker approved these changes Jun 14, 2023

View reviewed changes

amyeroberts merged commit 6793f0c into huggingface:main Jun 15, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix bug in slow tokenizer conversion, make it a lot faster #24266

Fix bug in slow tokenizer conversion, make it a lot faster #24266

Uh oh!

stephantul commented Jun 14, 2023

Uh oh!

stephantul commented Jun 14, 2023

Uh oh!

amyeroberts left a comment

Uh oh!

HuggingFaceDocBuilderDev commented Jun 14, 2023 •

edited

Loading

Uh oh!

ydshieh commented Jun 14, 2023

Uh oh!

ArthurZucker left a comment •

edited by younesbelkada

Loading

Uh oh!

Uh oh!

stephantul commented Jun 14, 2023 •

edited

Loading

Uh oh!

ArthurZucker left a comment

Uh oh!

Uh oh!

Fix bug in slow tokenizer conversion, make it a lot faster #24266

Fix bug in slow tokenizer conversion, make it a lot faster #24266

Uh oh!

Conversation

stephantul commented Jun 14, 2023

What does this PR do?

Before submitting

Who can review?

Uh oh!

stephantul commented Jun 14, 2023

Uh oh!

amyeroberts left a comment

Choose a reason for hiding this comment

Uh oh!

HuggingFaceDocBuilderDev commented Jun 14, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ydshieh commented Jun 14, 2023

Uh oh!

ArthurZucker left a comment • edited by younesbelkada Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

stephantul commented Jun 14, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ArthurZucker left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

HuggingFaceDocBuilderDev commented Jun 14, 2023 •

edited

Loading

ArthurZucker left a comment •

edited by younesbelkada

Loading

stephantul commented Jun 14, 2023 •

edited

Loading