fix: condition bos_token_id and space as token #36211

desaxce · 2025-02-15T09:19:35Z

Fixes # (#36210).

Changes:

In the heal_tokens function, we now:

allow tokenizers which don't have a bos_token_id
skip replacement of space when tokenizer doesn't specify space character as token

Just making the code not raise exceptions, there may be something more clever to do.

To test:

Use token healing on https://huggingface.co/Qwen/Qwen2.5-Coder-7B-Instruct:

from transformers import AutoTokenizer, Qwen2ForCausalLM, Qwen2Tokenizer

pipe = Qwen2ForCausalLM.from_pretrained("./")
tokenizer = Qwen2Tokenizer.from_pretrained("./")

prompt = f'Complete the following Lean 4 code:\n\n```lean4\nimport '
inputs = tokenizer(prompt, return_tensors="pt")

generate_ids = pipe.generate(inputs.input_ids, tokenizer=tokenizer, max_new_tokens=1, token_healing=True)
tokenizer.batch_decode(generate_ids, skip_special_tokens=False, clean_up_tokenization_spaces=False)[0]

Rocketknight1 · 2025-02-17T15:38:33Z

cc @ArthurZucker @itazap for tokenizers!

Rocketknight1 · 2025-03-19T14:55:09Z

gentle ping @ArthurZucker @itazap

src/transformers/generation/utils.py

itazap · 2025-04-03T14:19:52Z

Thanks for the catch and fix! 🤗 I took a look at the original motivation for 'token healing' (#30081) and found that there might be a different root cause to the error. For models like Qwen, the space token is actually "Ġ" instead of " ", and could be handled with:

space_tok = tokenizer.convert_ids_to_tokens(tokenizer.convert_tokens_to_ids(tokenizer.tokenize(" ")))[0]

vs. the current:

space_tok = tokenizer.convert_ids_to_tokens(tokenizer.convert_tokens_to_ids(" "))[0]

cc @ArthurZucker lmk what you think 😊

Co-authored-by: Ita Zaporozhets <31893021+itazap@users.noreply.github.com>

itazap · 2025-04-15T10:13:26Z

cc @gante for generation

gante · 2025-04-18T16:57:28Z

+1 to Ita's comment -- tokenizing the space to get the space token seems more intelligent and robust

The other change, regarding bos_token_id, looks good to me :)

itazap · 2025-04-21T13:30:07Z

src/transformers/generation/utils.py

-        # tail tokens are used for a prefix search, thus, whitespaces are replaced with
-        # their tokenization (e.g. 'Ġ') to enable search for tokens prefixed with a whitespace
-        tail_toks = (tokenizer.decode(t).replace(" ", space_tok) for t in tail_ids)
+        space_tok_id = tokenizer.convert_tokens_to_ids(" ")


Suggested change

space_tok_id = tokenizer.convert_tokens_to_ids(" ")

space_tok_id = tokenizer.convert_tokens_to_ids(tokenizer.tokenize(" "))

itazap · 2025-04-21T13:31:42Z

Great! Last thing would be great to add a test for this, making sure that we get some output instead of an error, let me know if you'd like to add it or I can help 🤗 @desaxce

desaxce · 2025-04-21T13:48:37Z

Great! Last thing would be great to add a test for this, making sure that we get some output instead of an error, let me know if you'd like to add it or I can help 🤗 @desaxce

Will take care of it before end of week.

itazap · 2025-04-22T08:13:43Z

Appreciate it! Thank you! 😊

desaxce · 2025-04-27T19:50:46Z

@itazap Sorry for the wrong timing, had a difficult end of week and I kept busy over the weekend.
Let's plan for a Wed. finish line so we start May ready to merge :)

itazap · 2025-04-28T08:35:04Z

@desaxce no worries, thanks for the update!

fix: condition bos_token_id and space as token

47fd2c5

itazap reviewed Apr 3, 2025

View reviewed changes

src/transformers/generation/utils.py Outdated Show resolved Hide resolved

desaxce and others added 2 commits April 6, 2025 18:14

fix: ensure replacement takes place if either id is 0

1628043

Co-authored-by: Ita Zaporozhets <31893021+itazap@users.noreply.github.com>

Merge branch 'main' into issue-36210/token-healing-qwen

0597cc2

itazap reviewed Apr 21, 2025

View reviewed changes

kylehowells mentioned this pull request Jun 3, 2025

Fix Perplexity Score For Tokenizers without bos_token_id huggingface/evaluate#682

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: condition bos_token_id and space as token #36211

fix: condition bos_token_id and space as token #36211

desaxce commented Feb 15, 2025

Uh oh!

Rocketknight1 commented Feb 17, 2025

Uh oh!

Rocketknight1 commented Mar 19, 2025

Uh oh!

Uh oh!

itazap commented Apr 3, 2025

Uh oh!

itazap commented Apr 15, 2025

Uh oh!

gante commented Apr 18, 2025

Uh oh!

itazap Apr 21, 2025

Uh oh!

itazap commented Apr 21, 2025

Uh oh!

desaxce commented Apr 21, 2025

Uh oh!

itazap commented Apr 22, 2025

Uh oh!

desaxce commented Apr 27, 2025

Uh oh!

itazap commented Apr 28, 2025

Uh oh!

Uh oh!

	space_tok_id = tokenizer.convert_tokens_to_ids(" ")
	space_tok_id = tokenizer.convert_tokens_to_ids(tokenizer.tokenize(" "))

fix: condition bos_token_id and space as token #36211

Are you sure you want to change the base?

fix: condition bos_token_id and space as token #36211

Conversation

desaxce commented Feb 15, 2025

Changes:

To test:

Uh oh!

Rocketknight1 commented Feb 17, 2025

Uh oh!

Rocketknight1 commented Mar 19, 2025

Uh oh!

Uh oh!

itazap commented Apr 3, 2025

Uh oh!

itazap commented Apr 15, 2025

Uh oh!

gante commented Apr 18, 2025

Uh oh!

itazap Apr 21, 2025

Choose a reason for hiding this comment

Uh oh!

itazap commented Apr 21, 2025

Uh oh!

desaxce commented Apr 21, 2025

Uh oh!

itazap commented Apr 22, 2025

Uh oh!

desaxce commented Apr 27, 2025

Uh oh!

itazap commented Apr 28, 2025

Uh oh!

Uh oh!