-
Notifications
You must be signed in to change notification settings - Fork 30.3k
Description
System Info
- `transformers` version: 4.20.0.dev0
- Platform: Linux-5.15.0-27-generic-x86_64-with-glibc2.35
- Python version: 3.9.12
- Huggingface_hub version: 0.4.0
- PyTorch version (GPU?): 1.11.0+cu113 (True)
- Tensorflow version (GPU?): not installed (NA)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Using GPU in script?: No
- Using distributed or parallel set-up in script?: No
Who can help?
Information
- The official example scripts
- My own modified scripts
Tasks
- An officially supported task in the
examples
folder (such as GLUE/SQuAD, ...)
Reproduction
Basically, the pipeline for text classification does not handle well input pairs that must be separated by [SEP] token.
For example, for glue's mnli dataset, we have:
premise = 'The new rights are nice enough'
hypothesis = 'Everyone really likes the newest benefits '
Whether we pass
pipeline([[premise, hypothesis]], padding=True, truncation=True)
- or
pipeline(" ".join([premise, hypothesis]), padding=True, truncation=True)
the pipeline output is wrong.
Detailed reproduction
If necessary, install transformers in the dev version (pip uninstall transformers && git clone https://github.com/huggingface/transformers.git && cd transformers && pip install -e .
).
Replace
transformers/src/transformers/pipelines/text_classification.py
Lines 132 to 134 in 1f13ba8
def preprocess(self, inputs, **tokenizer_kwargs) -> Dict[str, GenericTensor]: | |
return_tensors = self.framework | |
return self.tokenizer(inputs, return_tensors=return_tensors, **tokenizer_kwargs) |
by
def preprocess(self, inputs, **tokenizer_kwargs) -> Dict[str, GenericTensor]:
return_tensors = self.framework
tokenized_inps = self.tokenizer(inputs, return_tensors=return_tensors, **tokenizer_kwargs)
print("tokenized_inps", tokenized_inps)
return tokenized_inps
to be able to see what are the tokenized inputs in the pipeline.
Then run
from transformers import AutoTokenizer, AutoModelForSequenceClassification
from transformers import pipeline
from datasets import load_dataset
tokenizer = AutoTokenizer.from_pretrained("roberta-large-mnli")
model = AutoModelForSequenceClassification.from_pretrained("roberta-large-mnli")
pipe = pipeline(task="text-classification", tokenizer=tokenizer, model=model)
raw_datasets = load_dataset("glue", "mnli")
txt1 = raw_datasets["validation_matched"][0]["premise"]
txt2 = raw_datasets["validation_matched"][0]["hypothesis"]
inputs = [txt1, txt2]
txt = " ".join(inputs)
res = pipe(txt, padding=True, truncation=True)
print(res)
"""Output:
tokenized_inps {'input_ids': tensor([[ 0, 133, 92, 659, 32, 2579, 615, 7632, 269, 3829, 5, 8946, 1795, 1437, 2]]), 'attention_mask': ...}
[{'label': 'NEUTRAL', 'score': 0.7983464002609253}]
NOTE: theses input_ids correspond to:
'<s>The new rights are nice enough Everyone really likes the newest benefits </s>'
"""
We can see that separating the premise and hypothesis by a space is a very bad idea as there is no [SEP] token between the two.
Now run:
from transformers import BatchEncoding
data = raw_datasets["validation_matched"][0:1]
tokenized_inps = tokenizer(data["premise"], data["hypothesis"], padding=True, truncation=True)
tokenized_inps = BatchEncoding(tokenized_inps, tensor_type="pt")
print(tokenized_inps)
print(tokenizer.decode(tokenized_inps["input_ids"][0]))
"""Output:
{'input_ids': tensor([[ 0, 133, 92, 659, 32, 2579, 615, 2, 2, 11243, 269, 3829, 5, 8946, 1795, 1437, 2]]), 'attention_mask': ...}
<s>The new rights are nice enough</s></s>Everyone really likes the newest benefits </s>
"""
Here, the tokenizer
takes a text=premise
and text_pair=hypothesis
, and we see as expected SEP tokens between the two.
Other possibility with the pipeline:
txt1 = raw_datasets["validation_matched"][0]["premise"]
txt2 = raw_datasets["validation_matched"][0]["hypothesis"]
inputs = [txt1, txt2]
res = pipe([inputs], padding=True, truncation=True)
print(res)
"""Outputs:
tokenized_inps {'input_ids': tensor([[ 0, 133, 92, 659, 32, 2579, 615, 2, 1],
[ 0, 11243, 269, 3829, 5, 8946, 1795, 1437, 2]]), 'attention_mask': ...}
[{'label': 'NEUTRAL', 'score': 0.8978187441825867}]
Note that now input_ids is 2D! The decoding gives:
<s>The new rights are nice enough</s><pad>
<s>Everyone really likes the newest benefits </s>
"""
There is a [CLS] token inserted in the middle, most likely this is not desirable. In fact, when we run the pipeline on several examples from the dataset, all are classified as neutral and wrong.
Hacky solution
Use
txt1 = raw_datasets["validation_matched"][0]["premise"]
txt2 = raw_datasets["validation_matched"][0]["hypothesis"]
inputs = [txt1, txt2]
tokenized_inps = pipe.preprocess([inputs])
res = pipe.forward(tokenized_inps)
res = pipe.postprocess(res)
print(res)
"""Output:
tokenized_inps {'input_ids': tensor([[ 0, 133, 92,659, 32, 2579, 615, 2, 2, 11243, 269, 3829, 5, 8946, 1795, 1437, 2]]), 'attention_mask': ...}
{'label': 'NEUTRAL', 'score': 0.9636728167533875}
We get the right input_ids, and the score is the same as with manually using tokenizer + model, yay!
"""
which gives the same proba as with using the tokenizer and model separately.
To me, the issue lies in two facts:
- It is very wrong to join two sentences with a space (as suggested in the doc https://huggingface.co/tasks/text-classification ) since we loose the info that they are different sentence.
- In case we pass the data as
pipeline([[premise, hypothesis]])
, it could be that there is some funny stuff happening initem = next(self.iterator)
Expected behavior
Pipeline for text-classification with text pair should output the same result than manually using tokenizer + model + softmax.