Skip to content

Conversation

FightingZhen
Copy link
Collaborator

@FightingZhen FightingZhen commented Jun 11, 2025

Checklist Before Starting

  • Searched for similar PR(s).
  • Checked PR Title format
    • In format of: [modules] type: Title
    • modules are in fsdp, megatron, sglang, vllm, rollout, trainer, tests, training_utils, recipe, hardware, deployment, ray, worker, single_controller, misc, perf, model, algo, env, tool, ckpt, doc
    • type is in feat, fix, refactor, chore
    • can involve multiple modules, seperated by , or space, like [megatron, fsdp, doc] feat: xxx

What does this PR do?

Refactor device management such as torch.cuda and nccl in most part of code in verl/recipe and verl/verl, which is more convinent for supporting other devices or platforms.

Test

Not related.

High-Level Design

Not related.

Specific Changes

  1. use get_torch_device() to get corresponding torch.device() object based on specific device.
  2. use get_device_id() to get corresponding device rank index based on specific device.
  3. use get_nccl_backend() to get corresponding nccl backend based on specific device.

API

Not related.

Usage Example

Monifications in this PR should not be perceived.

Checklist Before Submitting

  • Read the Contribute Guide.
  • Apply pre-commit checks.
  • Add [BREAKING] to the PR title description if it breaks any API.
  • Update the documentation about your changes in the docs.
  • New CI unit test(s) are added to cover the code path.
  • Rely on existing unit tests on CI that covers the code path.

@CLAassistant
Copy link

CLAassistant commented Jun 11, 2025

CLA assistant check
All committers have signed the CLA.

@FightingZhen FightingZhen changed the title Feat: refactor part of device management [Refactor] refactor part of device management Jun 11, 2025
@FightingZhen FightingZhen changed the title [Refactor] refactor part of device management refactor part of device management Jun 11, 2025
@FightingZhen FightingZhen changed the title refactor part of device management [hardware] refactor part of device management Jun 11, 2025
@FightingZhen FightingZhen changed the title [hardware] refactor part of device management [hardware] refactor: refactor part of device management Jun 11, 2025
Copy link
Collaborator

@eric-haibin-lin eric-haibin-lin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please fill in the PR template. thank you

@FightingZhen FightingZhen force-pushed the feat_device_refactor branch from 6657b25 to d191000 Compare June 12, 2025 02:39
@FightingZhen FightingZhen changed the title [hardware] refactor: refactor part of device management [WIP][hardware] refactor: refactor part of device management Jun 12, 2025
@FightingZhen FightingZhen changed the title [WIP][hardware] refactor: refactor part of device management [hardware] refactor: refactor part of device management Jun 12, 2025
@FightingZhen
Copy link
Collaborator Author

please fill in the PR template. thank you

I have filled the PR template, thank your for your remind :)

@FightingZhen FightingZhen reopened this Jun 12, 2025
@FightingZhen FightingZhen force-pushed the feat_device_refactor branch 7 times, most recently from 787e566 to 574faad Compare June 14, 2025 01:58
vermouth1992
vermouth1992 previously approved these changes Jun 14, 2025
@FightingZhen FightingZhen force-pushed the feat_device_refactor branch 2 times, most recently from b0427aa to 1b51a6d Compare June 14, 2025 06:45
@FightingZhen FightingZhen force-pushed the feat_device_refactor branch from 1b51a6d to 8d0d8fc Compare June 14, 2025 07:30
fix pre-commit error
@FightingZhen FightingZhen force-pushed the feat_device_refactor branch from 8d0d8fc to 7b68b95 Compare June 14, 2025 07:36
@vermouth1992 vermouth1992 merged commit ca65c36 into volcengine:main Jun 14, 2025
40 of 44 checks passed
yellowbee686 pushed a commit to yellowbee686/verl that referenced this pull request Jun 18, 2025
)

### Checklist Before Starting

- [x] Searched for similar PR(s).
- [x] Checked PR Title format
  - [x] In format of: [modules] type: Title
- [x] modules are in `fsdp, megatron, sglang, vllm, rollout, trainer,
tests, training_utils, recipe, hardware, deployment, ray, worker,
single_controller, misc, perf, model, algo, env, tool, ckpt, doc`
  - [x] type is in `feat, fix, refactor, chore`
- [x] can involve multiple modules, seperated by `,` or space, like
`[megatron, fsdp, doc] feat: xxx`

### What does this PR do?

Refactor device management such as `torch.cuda` and `nccl` in most part
of code in `verl/recipe` and `verl/verl`, which is more convinent for
supporting other devices or platforms.

### Test

Not related.

### High-Level Design

Not related.

### Specific Changes

1. use `get_torch_device()` to get corresponding `torch.device()` object
based on specific device.
2. use `get_device_id()` to get corresponding device rank index based on
specific device.
3. use `get_nccl_backend()` to get corresponding nccl backend based on
specific device.

### API

Not related.

### Usage Example

Monifications in this PR should not be perceived.

### Checklist Before Submitting

- [x] Read the [Contribute
Guide](https://github.com/volcengine/verl?tab=readme-ov-file#contribution-guide).
- [x] Apply [pre-commit
checks](https://github.com/volcengine/verl?tab=readme-ov-file#code-linting-and-formatting).
- [x] Add `[BREAKING]` to the PR title `description` if it breaks any
API.
- [x] Update the documentation about your changes in the
[docs](https://github.com/volcengine/verl/tree/main/docs).
- [x] New CI unit test(s) are added to cover the code path.
- [x] Rely on existing unit tests on CI that covers the code path.
Tyizhanshen pushed a commit to HyperdriveHustle/verl that referenced this pull request Jul 1, 2025
)

### Checklist Before Starting

- [x] Searched for similar PR(s).
- [x] Checked PR Title format
  - [x] In format of: [modules] type: Title
- [x] modules are in `fsdp, megatron, sglang, vllm, rollout, trainer,
tests, training_utils, recipe, hardware, deployment, ray, worker,
single_controller, misc, perf, model, algo, env, tool, ckpt, doc`
  - [x] type is in `feat, fix, refactor, chore`
- [x] can involve multiple modules, seperated by `,` or space, like
`[megatron, fsdp, doc] feat: xxx`

### What does this PR do?

Refactor device management such as `torch.cuda` and `nccl` in most part
of code in `verl/recipe` and `verl/verl`, which is more convinent for
supporting other devices or platforms.

### Test

Not related.

### High-Level Design

Not related.

### Specific Changes

1. use `get_torch_device()` to get corresponding `torch.device()` object
based on specific device.
2. use `get_device_id()` to get corresponding device rank index based on
specific device.
3. use `get_nccl_backend()` to get corresponding nccl backend based on
specific device.

### API

Not related.

### Usage Example

Monifications in this PR should not be perceived.

### Checklist Before Submitting

- [x] Read the [Contribute
Guide](https://github.com/volcengine/verl?tab=readme-ov-file#contribution-guide).
- [x] Apply [pre-commit
checks](https://github.com/volcengine/verl?tab=readme-ov-file#code-linting-and-formatting).
- [x] Add `[BREAKING]` to the PR title `description` if it breaks any
API.
- [x] Update the documentation about your changes in the
[docs](https://github.com/volcengine/verl/tree/main/docs).
- [x] New CI unit test(s) are added to cover the code path.
- [x] Rely on existing unit tests on CI that covers the code path.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants