Various minor PPO refactors

## Problem Description

A lot of the formatting changes are suggested by @Howuhh

### 1. Refactor on `next_done`
The current code to handle `done` looks like this 

```python
            next_obs, reward, done, info = envs.step(action.cpu().numpy())
            rewards[step] = torch.tensor(reward).to(device).view(-1)
            next_obs, next_done = torch.Tensor(next_obs).to(device), torch.Tensor(done).to(device)
```

which is fine, but when I tried to adapt isaacgym it became an issue. Specifically, I thought the `to(device)` code is no longer needed so just did

```python
            next_obs, reward, done, info = envs.step(action)
```

but this is wrong because I should have done `next_done = done`. The current `next_done = torch.Tensor(done).to(device)` just does not make a lot of sense.

We should refactor it to 
```python
            next_obs, reward, next_done, info = envs.step(action.cpu().numpy())
            rewards[step] = torch.tensor(reward).to(device).view(-1)
            next_obs, next_done = torch.Tensor(next_obs).to(device), torch.Tensor(next_done).to(device)
```


### 2. `make_env` refactor

```
if capture_video:
    if idx == 0:
        env = gym.wrappers.RecordVideo(env, f"videos/{run_name}")
```

to 

```
if capture_video and idx == 0:
    env = gym.wrappers.RecordVideo(env, f"videos/{run_name}")
```

### 3. flatten batch

```
        b_obs = obs.reshape((-1,) + envs.single_observation_space.shape)
        b_logprobs = logprobs.reshape(-1)
        b_actions = actions.reshape((-1,) + envs.single_action_space.shape)
        b_advantages = advantages.reshape(-1)
        b_returns = returns.reshape(-1)
        b_values = values.reshape(-1)
```
to 
```
        b_obs = obs.flatten(0, 1)
        b_actions = actions.flatten(0, 1)
        b_logprobs = logprobs.reshape(-1)
        b_returns = returns.reshape(-1)
        b_advantages = advantages.reshape(-1)
        b_values = values.reshape(-1)
```


### 4.

```

            if args.target_kl is not None:
                if approx_kl > args.target_kl:
                    break
```
to 
```
            if args.target_kl is not None and approx_kl > args.target_kl:
                break
```

### 5. 

https://github.com/vwxyzjn/cleanrl/blob/9a74142ff5955e2a4ee6fdb18130a63293d2b7ea/cleanrl/ppo_atari.py#L209

to 

`global_step += args.num_envs`

### 6. 

move

https://github.com/vwxyzjn/cleanrl/blob/9a74142ff5955e2a4ee6fdb18130a63293d2b7ea/cleanrl/ppo.py#L183

to the argparse.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Various minor PPO refactors #167

Problem Description

1. Refactor on `next_done`

2. `make_env` refactor

3. flatten batch

4.

5.

6.

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Various minor PPO refactors #167

Description

Problem Description

1. Refactor on next_done

2. make_env refactor

3. flatten batch

4.

5.

6.

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions

1. Refactor on `next_done`

2. `make_env` refactor