Batch Decoding of LMs will cause different outputs with different batch size

### System Info

Transformers=4.31
Torch=2.01
Cuda=11.8
Python=3.10

A100 GPU 80GB

### Who can help?

@ArthurZucker , @younesbelkada , @gante 

### Information

- [ ] The official example scripts
- [X] My own modified scripts

### Tasks

- [ ] An officially supported task in the `examples` folder (such as GLUE/SQuAD, ...)
- [X] My own task or dataset (give details below)

### Reproduction

Running the following examples will produce different outputs for the first input.

```
from transformers import LlamaForCausalLM, LlamaTokenizer
from transformers import GenerationConfig
import torch

if __name__ == '__main__':
    name = 'yahma/llama-7b-hf'
    tokenizer = LlamaTokenizer.from_pretrained(
        name, 
        padding_side="left", 
        trust_remote_code=True)
    tokenizer.pad_token_id = 0 if tokenizer.pad_token_id is None else tokenizer.pad_token_id

    model = LlamaForCausalLM.from_pretrained(
        name, 
        device_map="auto", 
        torch_dtype=torch.bfloat16,
        trust_remote_code=True)

    question = [
        'Can you explain to me what is the concept of deep learning and how it can be applied to NLP?',
        'Where am I supposed to eat dinner',
        'How hard is it to find a doctor in Canada',
        'What is the best price of vegatables',
        'How can somehow be so mean',
        'How can we get the severance pay',
        'What type of president is this?'
        'How is the weather today?'
    ]

    batch = tokenizer(
        question,
        padding=True,
        return_tensors="pt",
    )
    with torch.no_grad():
        output_ids = model.generate(
            batch.input_ids.to(model.device),
            attention_mask=batch.attention_mask.to(model.device),
            pad_token_id=tokenizer.pad_token_id,
            generation_config=GenerationConfig(do_sample=False, max_new_tokens=50, trust_remote_code=True)
        )

    output_strs = []
    for output_id in output_ids.tolist()[:4]:
        tmp = tokenizer.decode(output_id[batch.input_ids.shape[-1]:], skip_special_tokens=True)
        output_strs.append(tmp)
        print(tmp)
        print('----------------------------------------------------')


    print('############### Now we decrease the batch size #############################')

    question = [
        'Can you explain to me what is the concept of deep learning and how it can be applied to NLP?',
        'Where am I supposed to eat dinner',
        'How hard is it to find a doctor in Canada',
        'What is the best price of vegatables',
    ]

    batch = tokenizer(
        question,
        padding=True,
        return_tensors="pt",
    )
    with torch.no_grad():
        output_ids = model.generate(
            batch.input_ids.to(model.device),
            attention_mask=batch.attention_mask.to(model.device),
            pad_token_id=tokenizer.pad_token_id,
            generation_config=GenerationConfig(do_sample=False, max_new_tokens=50, trust_remote_code=True)
        )

    output_strs = []
    for output_id in output_ids.tolist():
        tmp = tokenizer.decode(output_id[batch.input_ids.shape[-1]:], skip_special_tokens=True)
        output_strs.append(tmp)
        print(tmp)
        print('----------------------------------------------------')
```

### Expected behavior

The produced outputs are supposed to be the same and should not be affected by the batch size. 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Batch Decoding of LMs will cause different outputs with different batch size #25921

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Batch Decoding of LMs will cause different outputs with different batch size #25921

Description

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions