Skip to content

all_gather with gloo backend does not work in inference mode #126032

@youkaichao

Description

@youkaichao

🐛 Describe the bug

A minimal reproducible example:

import torch
import torch.distributed as dist
dist.init_process_group(backend='gloo')
# dist.init_process_group(backend='nccl')
# torch.cuda.set_device(dist.get_rank())
with torch.inference_mode():
    data = [torch.ones((3, 3))] * dist.get_world_size()
    obj = data[dist.get_rank()]
    dist.all_gather(data, obj)
    # dist.broadcast(obj, src=0)

The error is:

E RuntimeError: Inplace update to inference tensor outside InferenceMode is not allowed.You can make a clone to get a normal tensor before doing inplace update.See pytorch/rfcs#17 for more details.

It looks strange, that nccl backend works in this case. broadcast works, too. Only all_gather does not work.

Versions

pytorch 2.3.0

cc @mrshenli @pritamdamania87 @zhaojuanmao @satgera @gqchen @aazzolini @osalpekar @jiayisuse @H-Huang @kwen2501 @awgu @penguinwu @fegin @XilunWu @wanchaol @fduwjj @wz337 @tianyu-l @wconstab @yf225 @chauhang @d4l3k

Metadata

Metadata

Assignees

No one assigned

    Labels

    module: c10dIssues/PRs related to collective communications and process groupsoncall: distributedAdd this issue/PR to distributed oncall triage queuetriagedThis issue has been looked at a team member, and triaged and prioritized into an appropriate module

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions