Pipeline inference with text pair is broken

### System Info

```shell
- `transformers` version: 4.20.0.dev0
- Platform: Linux-5.15.0-27-generic-x86_64-with-glibc2.35
- Python version: 3.9.12
- Huggingface_hub version: 0.4.0
- PyTorch version (GPU?): 1.11.0+cu113 (True)
- Tensorflow version (GPU?): not installed (NA)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Using GPU in script?: No
- Using distributed or parallel set-up in script?: No
```


### Who can help?

@narsil

### Information

- [ ] The official example scripts
- [X] My own modified scripts

### Tasks

- [x] An officially supported task in the `examples` folder (such as GLUE/SQuAD, ...)

### Reproduction

Basically, the pipeline for text classification does not handle well input pairs that must be separated by [SEP] token.

For example, for glue's mnli dataset, we have:

```python
premise = 'The new rights are nice enough'
hypothesis = 'Everyone really likes the newest benefits '
```

Whether we pass
* `pipeline([[premise, hypothesis]], padding=True, truncation=True)`
* or `pipeline(" ".join([premise, hypothesis]), padding=True, truncation=True)`

the pipeline output is wrong.

## Detailed reproduction
If necessary, install transformers in the dev version (`pip uninstall transformers && git clone https://github.com/huggingface/transformers.git && cd transformers && pip install -e .`).

Replace https://github.com/huggingface/transformers/blob/1f13ba818e0e3b780cf9155242e2c83a27fdfa9a/src/transformers/pipelines/text_classification.py#L132-L134

by

```python
    def preprocess(self, inputs, **tokenizer_kwargs) -> Dict[str, GenericTensor]:
        return_tensors = self.framework
        tokenized_inps = self.tokenizer(inputs, return_tensors=return_tensors, **tokenizer_kwargs)
        print("tokenized_inps", tokenized_inps)
        return tokenized_inps
 ```
to be able to see what are the tokenized inputs in the pipeline.

Then run

```python
from transformers import AutoTokenizer, AutoModelForSequenceClassification
from transformers import pipeline
from datasets import load_dataset

tokenizer = AutoTokenizer.from_pretrained("roberta-large-mnli")
model = AutoModelForSequenceClassification.from_pretrained("roberta-large-mnli")

pipe = pipeline(task="text-classification", tokenizer=tokenizer, model=model)

raw_datasets = load_dataset("glue", "mnli")

txt1 = raw_datasets["validation_matched"][0]["premise"]
txt2 = raw_datasets["validation_matched"][0]["hypothesis"]
    
inputs = [txt1, txt2]
    
txt = " ".join(inputs)
res = pipe(txt, padding=True, truncation=True)
print(res)
"""Output:
tokenized_inps {'input_ids': tensor([[ 0, 133, 92, 659, 32, 2579, 615, 7632, 269, 3829, 5, 8946, 1795, 1437, 2]]), 'attention_mask': ...}
[{'label': 'NEUTRAL', 'score': 0.7983464002609253}]

NOTE: theses input_ids correspond to:
'<s>The new rights are nice enough Everyone really likes the newest benefits </s>'
"""
```

We can see that separating the premise and hypothesis by a space is a very bad idea as there is no [SEP] token between the two.

Now run:
```python
from transformers import BatchEncoding
data = raw_datasets["validation_matched"][0:1]
tokenized_inps = tokenizer(data["premise"], data["hypothesis"], padding=True, truncation=True)
tokenized_inps = BatchEncoding(tokenized_inps, tensor_type="pt")
print(tokenized_inps)
print(tokenizer.decode(tokenized_inps["input_ids"][0]))
"""Output:
{'input_ids': tensor([[ 0, 133, 92, 659, 32, 2579, 615, 2, 2, 11243, 269,  3829, 5, 8946, 1795, 1437,  2]]), 'attention_mask': ...}
<s>The new rights are nice enough</s></s>Everyone really likes the newest benefits </s>
"""
```

Here, the `tokenizer` takes a `text=premise` and `text_pair=hypothesis`, and we see as expected SEP tokens between the two.

Other possibility with the pipeline:
```python
txt1 = raw_datasets["validation_matched"][0]["premise"]
txt2 = raw_datasets["validation_matched"][0]["hypothesis"]
    
inputs = [txt1, txt2]
  
res = pipe([inputs], padding=True, truncation=True)
print(res)
"""Outputs:
tokenized_inps {'input_ids': tensor([[ 0, 133, 92, 659, 32, 2579, 615, 2, 1],
        [ 0, 11243, 269, 3829, 5, 8946, 1795, 1437, 2]]), 'attention_mask': ...}
[{'label': 'NEUTRAL', 'score': 0.8978187441825867}]

Note that now input_ids is 2D! The decoding gives:
<s>The new rights are nice enough</s><pad>
<s>Everyone really likes the newest benefits </s>
"""
```
There is a [CLS] token inserted in the middle, most likely this is not desirable. In fact, when we run the pipeline on several examples from the dataset, all are classified as neutral and wrong.

## Hacky solution
Use

```python
txt1 = raw_datasets["validation_matched"][0]["premise"]
txt2 = raw_datasets["validation_matched"][0]["hypothesis"]
    
inputs = [txt1, txt2]
tokenized_inps = pipe.preprocess([inputs])
res = pipe.forward(tokenized_inps)
res = pipe.postprocess(res)
print(res)
"""Output:
tokenized_inps {'input_ids': tensor([[ 0, 133, 92,659, 32, 2579, 615, 2, 2, 11243, 269, 3829, 5, 8946, 1795,  1437, 2]]), 'attention_mask': ...}
{'label': 'NEUTRAL', 'score': 0.9636728167533875}
We get the right input_ids, and the score is the same as with manually using tokenizer + model, yay!
"""
```

which gives the same proba as with using the tokenizer and model separately.

To me, the issue lies in two facts:
* It is very wrong to join two sentences with a space (as suggested in the doc https://huggingface.co/tasks/text-classification ) since we loose the info that they are different sentence.
* In case we pass the data as `pipeline([[premise, hypothesis]])`, it could be that there is some funny stuff happening in https://github.com/huggingface/transformers/blob/1f13ba818e0e3b780cf9155242e2c83a27fdfa9a/src/transformers/pipelines/pt_utils.py#L111

### Expected behavior

Pipeline for text-classification with text pair should output the same result than manually using tokenizer + model + softmax.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Pipeline inference with text pair is broken #17305

System Info

Who can help?

Information

Tasks

Reproduction

Detailed reproduction

Hacky solution

Expected behavior

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

	def preprocess(self, inputs, **tokenizer_kwargs) -> Dict[str, GenericTensor]:
	return_tensors = self.framework
	return self.tokenizer(inputs, return_tensors=return_tensors, **tokenizer_kwargs)

Pipeline inference with text pair is broken #17305

Description

System Info

Who can help?

Information

Tasks

Reproduction

Detailed reproduction

Hacky solution

Expected behavior

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions