-
Notifications
You must be signed in to change notification settings - Fork 819
Prototype multi-gpu support with PPO #162
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This pull request is being automatically deployed with Vercel (learn more). 🔍 Inspect: https://vercel.com/vwxyzjn/cleanrl/AuJfHiX83yegoMeXn5bVMydxxZKP |
Here is a script for understanding how gradient accumulation works and data parallelism works: import torch.distributed as dist
import torch.multiprocessing as mp
import os
import torch
def init_process(rank, size, fn, backend="gloo"):
"""Initialize the distributed environment."""
os.environ["MASTER_ADDR"] = "127.0.0.1"
os.environ["MASTER_PORT"] = "29500"
dist.init_process_group(backend, rank=rank, world_size=size)
fn(rank, size)
def train(rank: int, size: int):
prediction = torch.tensor(
[[1.,2.,3.,4.], [4.,7.,5.,8.]],
requires_grad=True)
label = torch.tensor([[0.,0.,0.,0.], [0.,0.,0.,0.]])
loss = (prediction[rank] - label[rank]) ** 2
loss.mean().backward()
dist.all_reduce(prediction.grad.data, op=dist.ReduceOp.SUM)
prediction.grad.data /= size
print("gradient with data parallelism (multi-gpu) \n", prediction.grad)
if __name__ == "__main__":
prediction = torch.tensor(
[[1.,2.,3.,4.], [4.,7.,5.,8.]],
requires_grad=True)
label = torch.tensor([[0.,0.,0.,0.], [0.,0.,0.,0.]])
loss = (prediction - label) ** 2
loss.mean().backward()
print("gradient with the whole batch\n", prediction.grad)
prediction = torch.tensor(
[[1.,2.,3.,4.], [4.,7.,5.,8.]],
requires_grad=True)
label = torch.tensor([[0.,0.,0.,0.], [0.,0.,0.,0.]])
# do backward pass in two minibatches
loss = (prediction[0] - label[0]) ** 2
loss.mean().backward()
loss = (prediction[1] - label[1]) ** 2
loss.mean().backward()
# divide the gradient by the size
print("gradient accumulation \n", prediction.grad / 2)
size = 2
processes = []
mp.set_start_method("spawn")
for rank in range(size):
p = mp.Process(target=init_process, args=(rank, size, train))
p.start()
processes.append(p)
for p in processes:
p.join() the script yields the following output:
|
Attempt 1The sample efficiency seems to suffer, as shown above, however the wall-time performance is pretty good. My suspicion for the sample efficiency regression is that policy gradient averaging is more tricky compared to the value gradient averaging: see #162 (comment) Options 3 and 4 are pretty impressive: they reduce the wall-time by half while using only a single GPU: maybe by using multi-GPU the speed up can be even more? Some notes
The following script (see here for full script) demonstrates that such a practice results in the same gradient for value function, but not the same gradient for the policy function. optimizer.zero_grad()
start = 0
end = start + args.minibatch_size
mb_inds = b_inds[start:end]
fit_vloss(mb_inds)
print()
print(f"CASE 1: value function: forward and backward pass of the minibatch (size 256: i.e., data[0:256])")
print("agent.critic.weight.grad.sum() =", agent.critic.weight.grad.sum())
optimizer.zero_grad()
args.minibatch_size = 128
for start in [0, 128]:
end = start + args.minibatch_size
mb_inds = b_inds[start:end]
fit_vloss(mb_inds)
print()
print(f"CASE 2: value function: forward and backward pass of 2 minibatches (size 128: i.e., data[0:128] and data[128:256])")
print("agent.critic.weight.grad.sum() / 2 =", agent.critic.weight.grad.sum() / 2)
optimizer.zero_grad()
start = 0
end = start + args.minibatch_size
mb_inds = b_inds[start:end]
fit_pgloss(mb_inds)
print()
print(f"CASE 3: policy function: forward and backward pass of the minibatch (size 256: i.e., data[0:256])")
print("agent.actor.weight.grad.sum() =", agent.actor.weight.grad.sum())
optimizer.zero_grad()
args.minibatch_size = 128
for start in [0, 128]:
end = start + args.minibatch_size
mb_inds = b_inds[start:end]
fit_pgloss(mb_inds)
print()
print(f"CASE 4: policy function: forward and backward pass of 2 minibatches (size 128: i.e., data[0:128] and data[128:256])")
print("agent.actor.weight.grad.sum() / 2 =", agent.actor.weight.grad.sum() / 2)
|
Here is an even simpler demo. The issue is import torch
import torch.nn as nn
import torch.optim as optim
from torch.distributions.categorical import Categorical
class Agent(nn.Module):
def __init__(self, action_n):
super().__init__()
self.network = nn.Sequential(
nn.Conv2d(4, 32, 8, stride=4),
nn.ReLU(),
nn.Conv2d(32, 64, 4, stride=2),
nn.ReLU(),
nn.Conv2d(64, 64, 3, stride=1),
nn.ReLU(),
nn.Flatten(),
nn.Linear(64 * 7 * 7, 512),
nn.ReLU(),
)
self.actor = nn.Linear(512, action_n)
self.critic = nn.Linear(512, 1)
def get_value(self, x):
return self.critic(self.network(x / 255.0))
def get_action_and_value(self, x, action=None):
hidden = self.network(x / 255.0)
logits = self.actor(hidden)
probs = Categorical(logits=logits)
if action is None:
action = probs.sample()
return action, probs.log_prob(action), probs.entropy(), self.critic(hidden)
# setup
agent = Agent(4)
optimizer = optim.Adam(agent.parameters())
next_obs = torch.rand(8, 4, 84, 84)
action, newlogprob, entropy, newvalue = agent.get_action_and_value(next_obs)
optimizer.zero_grad()
_, newlogprob, _, newvalue = agent.get_action_and_value(next_obs, action)
newvalue.mean().backward()
print(f"`agent.critic.weight.grad.sum() = {agent.critic.weight.grad.sum()}` after fitting value loss using data[0:8]")
optimizer.zero_grad()
_, newlogprob, _, newvalue = agent.get_action_and_value(next_obs[0:4], action[0:4])
newvalue.mean().backward()
_, newlogprob, _, newvalue = agent.get_action_and_value(next_obs[4:8], action[4:8])
newvalue.mean().backward()
print(f"`agent.critic.weight.grad.sum() / 2 = {agent.critic.weight.grad.sum() / 2}` after fitting value loss using data[0:4] and data[4:8] respectively")
optimizer.zero_grad()
_, newlogprob, _, newvalue = agent.get_action_and_value(next_obs, action)
newlogprob.mean().backward()
print(f"`agent.actor.weight.grad.sum() = {agent.actor.weight.grad.sum()}` after fitting value loss using data[0:8]")
optimizer.zero_grad()
_, newlogprob, _, newvalue = agent.get_action_and_value(next_obs[0:4], action[0:4])
newlogprob.mean().backward()
_, newlogprob, _, newvalue = agent.get_action_and_value(next_obs[4:8], action[4:8])
newlogprob.mean().backward()
print(f"`agent.actor.weight.grad.sum() / 2 = {agent.actor.weight.grad.sum() / 2}` after fitting value loss using data[0:4] and data[4:8] respectively") |
otherwise agent.module.get_action_value won't trigger proper gradient sync
Attempt 2: much more successfulFixed a couple of issues:
|
args.num_envs = int(args.num_envs / size) | ||
args.batch_size = int(args.num_envs * args.num_steps) | ||
args.minibatch_size = int(args.batch_size // args.num_minibatches) | ||
dist.init_process_group("gloo", rank=rank, world_size=size) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
According to PyTorch doc, you should be using NCCL for multi-gpu training, Gloo is recommended for multi-cpu training: https://pytorch.org/docs/stable/distributed.html
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don’t actually have a multi GPU stuff to try it out and NCCL would break :) but this is something I should try find a solution for.
https://wandb.ai/costa-huang/cleanRL/reports/Data-Parallelism-Experiment--VmlldzoxODI1OTY0 Ok did more testing and it looks like
Worth testing it out with |
Closing in favor of #178 |
Description
This PR contains some prototypes that bring multi-GPU support to PPO. There are many ways to do it so this PR tries to compare different approaches.
I don't really have multi-GPUs to test this out, so I launch two processes accessing the same GPU, mainly to test out if they would result in the same performance. In theory, multi-GPU support should not harm the performance. However, I plan to real multi-GPU performance on the sample efficiency is not affected.
Option 1:
ppo_atari_multigpu.py
My first try is
ppo_atari_multigpu.py
, which uses pytorch's low-level distributed API as shown in this link or this example:Option 2
ppo_atari_multigpu_batch_reduce.py
In this file, I adopted entity-neural-network/incubator#220 by batch reducing the gradient.
Option 3
ppo_atari_ddp.py
In this file, I adopted the high-level API here https://pytorch.org/docs/stable/notes/ddp.html.
Option 4
ppo_atari_elastic.py
This file adopts https://pytorch.org/docs/stable/elastic/run.html. We can run the training script via
torchrun --standalone --nnodes=1 --nproc_per_node=2 ppo_atari_elastic.py
The sample efficiency seems to suffer, as shown above, however the wall-time performance is pretty good.
My suspicion for the sample efficiency regression is that policy gradient averaging is more tricky compared to the value gradient averaging: see #162 (comment)
Options 3 and 4 are pretty impressive: they reduce the wall-time by half while using only a single GPU: maybe by using multi-GPU the speed up can be even more?
Types of changes
Checklist:
pre-commit run --all-files
passes (required).mkdocs serve
.If you are adding new algorithms or your change could result in performance difference, you may need to (re-)run tracked experiments. See #137 as an example PR.
--capture-video
flag toggled on (required).mkdocs serve
.width=500
andheight=300
).