Skip to content

Using torch expandable segments with megatron breaks vLLM refit on Ampere. #522

@SahilJain314

Description

@SahilJain314

Describe the bug
Running the refit_policy_generation operation with a Megatron policy that is running torch with 'expandable_segments:true' causes the colocated vLLM refit case to error out with:

(VllmGenerationWorker pid=36530) Error in VllmInternalWorkerExtension.update_weights_from_ipc_handles: pidfd_getfd: Operation not permitted

on Ampere. Seems to work fine on Hopper (Blackwell TBD).

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions