Skip to content

Pipeline inference with text pair is broken #17305

@fxmarty

Description

@fxmarty

System Info

- `transformers` version: 4.20.0.dev0
- Platform: Linux-5.15.0-27-generic-x86_64-with-glibc2.35
- Python version: 3.9.12
- Huggingface_hub version: 0.4.0
- PyTorch version (GPU?): 1.11.0+cu113 (True)
- Tensorflow version (GPU?): not installed (NA)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Using GPU in script?: No
- Using distributed or parallel set-up in script?: No

Who can help?

@Narsil

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)

Reproduction

Basically, the pipeline for text classification does not handle well input pairs that must be separated by [SEP] token.

For example, for glue's mnli dataset, we have:

premise = 'The new rights are nice enough'
hypothesis = 'Everyone really likes the newest benefits '

Whether we pass

  • pipeline([[premise, hypothesis]], padding=True, truncation=True)
  • or pipeline(" ".join([premise, hypothesis]), padding=True, truncation=True)

the pipeline output is wrong.

Detailed reproduction

If necessary, install transformers in the dev version (pip uninstall transformers && git clone https://github.com/huggingface/transformers.git && cd transformers && pip install -e .).

Replace

def preprocess(self, inputs, **tokenizer_kwargs) -> Dict[str, GenericTensor]:
return_tensors = self.framework
return self.tokenizer(inputs, return_tensors=return_tensors, **tokenizer_kwargs)

by

    def preprocess(self, inputs, **tokenizer_kwargs) -> Dict[str, GenericTensor]:
        return_tensors = self.framework
        tokenized_inps = self.tokenizer(inputs, return_tensors=return_tensors, **tokenizer_kwargs)
        print("tokenized_inps", tokenized_inps)
        return tokenized_inps

to be able to see what are the tokenized inputs in the pipeline.

Then run

from transformers import AutoTokenizer, AutoModelForSequenceClassification
from transformers import pipeline
from datasets import load_dataset

tokenizer = AutoTokenizer.from_pretrained("roberta-large-mnli")
model = AutoModelForSequenceClassification.from_pretrained("roberta-large-mnli")

pipe = pipeline(task="text-classification", tokenizer=tokenizer, model=model)

raw_datasets = load_dataset("glue", "mnli")

txt1 = raw_datasets["validation_matched"][0]["premise"]
txt2 = raw_datasets["validation_matched"][0]["hypothesis"]
    
inputs = [txt1, txt2]
    
txt = " ".join(inputs)
res = pipe(txt, padding=True, truncation=True)
print(res)
"""Output:
tokenized_inps {'input_ids': tensor([[ 0, 133, 92, 659, 32, 2579, 615, 7632, 269, 3829, 5, 8946, 1795, 1437, 2]]), 'attention_mask': ...}
[{'label': 'NEUTRAL', 'score': 0.7983464002609253}]

NOTE: theses input_ids correspond to:
'<s>The new rights are nice enough Everyone really likes the newest benefits </s>'
"""

We can see that separating the premise and hypothesis by a space is a very bad idea as there is no [SEP] token between the two.

Now run:

from transformers import BatchEncoding
data = raw_datasets["validation_matched"][0:1]
tokenized_inps = tokenizer(data["premise"], data["hypothesis"], padding=True, truncation=True)
tokenized_inps = BatchEncoding(tokenized_inps, tensor_type="pt")
print(tokenized_inps)
print(tokenizer.decode(tokenized_inps["input_ids"][0]))
"""Output:
{'input_ids': tensor([[ 0, 133, 92, 659, 32, 2579, 615, 2, 2, 11243, 269,  3829, 5, 8946, 1795, 1437,  2]]), 'attention_mask': ...}
<s>The new rights are nice enough</s></s>Everyone really likes the newest benefits </s>
"""

Here, the tokenizer takes a text=premise and text_pair=hypothesis, and we see as expected SEP tokens between the two.

Other possibility with the pipeline:

txt1 = raw_datasets["validation_matched"][0]["premise"]
txt2 = raw_datasets["validation_matched"][0]["hypothesis"]
    
inputs = [txt1, txt2]
  
res = pipe([inputs], padding=True, truncation=True)
print(res)
"""Outputs:
tokenized_inps {'input_ids': tensor([[ 0, 133, 92, 659, 32, 2579, 615, 2, 1],
        [ 0, 11243, 269, 3829, 5, 8946, 1795, 1437, 2]]), 'attention_mask': ...}
[{'label': 'NEUTRAL', 'score': 0.8978187441825867}]

Note that now input_ids is 2D! The decoding gives:
<s>The new rights are nice enough</s><pad>
<s>Everyone really likes the newest benefits </s>
"""

There is a [CLS] token inserted in the middle, most likely this is not desirable. In fact, when we run the pipeline on several examples from the dataset, all are classified as neutral and wrong.

Hacky solution

Use

txt1 = raw_datasets["validation_matched"][0]["premise"]
txt2 = raw_datasets["validation_matched"][0]["hypothesis"]
    
inputs = [txt1, txt2]
tokenized_inps = pipe.preprocess([inputs])
res = pipe.forward(tokenized_inps)
res = pipe.postprocess(res)
print(res)
"""Output:
tokenized_inps {'input_ids': tensor([[ 0, 133, 92,659, 32, 2579, 615, 2, 2, 11243, 269, 3829, 5, 8946, 1795,  1437, 2]]), 'attention_mask': ...}
{'label': 'NEUTRAL', 'score': 0.9636728167533875}
We get the right input_ids, and the score is the same as with manually using tokenizer + model, yay!
"""

which gives the same proba as with using the tokenizer and model separately.

To me, the issue lies in two facts:

Expected behavior

Pipeline for text-classification with text pair should output the same result than manually using tokenizer + model + softmax.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions