Skip to content

Conversation

kylehowells
Copy link
Contributor

@kylehowells kylehowells commented Jun 3, 2025

Some Tokenizer's, like Qwen2.5's, don't include a bos_token_id.

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained('Qwen/Qwen2.5-0.5B')
print(tokenizer.bos_token_id)
# None

Which means using the model with the perplexity metric fails.

import evaluate
perplexity = evaluate.load("perplexity", module_type="metric")
input_texts = ["Given a model and an input text sequence, perplexity measures how likely the model is to generate the input text sequence."]
results = perplexity.compute(model_id='Qwen/Qwen2.5-0.5B', predictions=input_texts)
print(results)

Results in this error, caused by the tokeniser not containing a bos_token_id:

(venv) $ python bug.py
  0%|                                                                                                     | 0/1 [00:00<?, ?it/s]
Traceback (most recent call last):
  File "/.../PerplexityScoring/bug.py", line 4, in <module>
    results = perplexity.compute(model_id='Qwen/Qwen2.5-0.5B', predictions=input_texts)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/.../venv/lib/python3.11/site-packages/evaluate/module.py", line 467, in compute
    output = self._compute(**inputs, **compute_kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/kylehowells/.cache/huggingface/modules/evaluate_modules/metrics/evaluate-metric--perplexity/8ab643ad86f568b7d1d5f7822373fa7401ff5ff0297ccf114b0ca6a33be96bc0/perplexity.py", line 170, in _compute
    bos_tokens_tensor = torch.tensor([[tokenizer.bos_token_id]] * encoded_batch.size(dim=0)).to(device)
                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: Could not infer dtype of NoneType

There is also a similar PR for the transformers library itself to handle this situation back in February: fix: condition bos_token_id and space as token #36211

This PR adds a not None check to the add bos_token_id block.

@kylehowells kylehowells changed the title Update Perplexity For Tokenizers without bos_token_id Fix Perplexity Score For Tokenizers without bos_token_id Jun 3, 2025
@chrisjbryant
Copy link

Fwiw, I'm running into the same problem and haven't yet worked out how to get a perplexity score when the input is a single token.

@lhoestq lhoestq merged commit f05792f into huggingface:main Jun 20, 2025
2 of 6 checks passed
@lhoestq
Copy link
Member

lhoestq commented Jun 20, 2025

good catch ! thanks for the fix :)

@kylehowells kylehowells deleted the perplexity-support-tokenisers-without-bos_token_id branch June 20, 2025 16:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants