Environments #3367

August-murr · 2025-04-26T11:54:59Z

Environments for Customized Rollouts with VLLM and GRPO

Environments allows for customized rollouts and usage of the model during training with VLLM and GRPO.

Example: Creating a Custom Environment

from trl import Environment

class MyCustomEnv(Environment):
    def __init__(self, ...):
        # Initialize anything your agent needs, like code execution,
        # tool calls, or external APIs.
        pass

    def tool_call(self, ...):
        # Define any environment-specific methods you want to use.
        pass

    def generate(self, vllm_client, generation_config: VLLMClientGenerationConfig, prompts: list[str]) -> list:
        # This is the method GRPO training uses to generate responses.
        # It must return a list of completion IDs (tokenized responses).
        return completion_ids

`CodeAgentEnvironment`

A built-in CodeAgentEnvironment is included, which can be used with either:

E2B code interpreter
A local code interpreter

Example setup:

from trl import CodeAgentEnvironment, E2BExecutor

code_executor = E2BExecutor(api_key="YOUR_E2B_TOKEN")
my_env = CodeAgentEnvironment(
    code_executor=code_executor,
    tokenizer=tokenizer,
    parsing_string="<code>",
    stop_string="</code>",
)

Running the Environment

You can use the environment manually with a VLLM server to test and observe:

Start a VLLM server:

trl vllm-serve --model "your_model"

Then run the agent:

from trl import VLLMClientGenerationConfig, VLLMClient

client = VLLMClient()
gen_config = VLLMClientGenerationConfig(
    n=8,
    repetition_penalty=1.0,
    temperature=0.8,
    top_p=0.9,
    top_k=10,
    min_p=0.0,
    max_tokens=256,
)

responses = my_env.run_agent(  # Main method in CodeAgentEnvironment to generate responses
    vllm_client=client,
    generation_config=gen_config,
    prompts=prompts
)

Or, use the environment for training:

from trl import GRPOConfig, GRPOTrainer

training_args = GRPOConfig(
    output_dir="code_agent",
    use_vllm=True,
    # ... other config options ...
)

trainer = GRPOTrainer(
    model=...,                # Your model or model name
    reward_funcs=...,         # Your reward function(s)
    args=training_args,
    environment=my_env,       # Your custom or built-in environment
    # ... other trainer args ...
)

Related Work

While finishing this PR, ByteDance published ReTool: Reinforcement Learning for Strategic Tool Use in LLMs.
This PR with CodeAgentEnvironment is a similar idea, with two key differences:

ReTool uses PPO, while we use GRPO.
ReTool masks interpreter feedback — an idea I'll be investigating and possibly adding later.

Other Notes

I added a stop_string parameter to the VLLM CLI — it’s critical for agentic workflows where you want to cleanly delimit model outputs.

Demo Run

To test, I set up a simple math task with a one-shot prompt for Qwen-0.5B-instruct.
The reward function checked if the agent used the code interpreter during its response.

Here's the wandb run.
You can specifically check:
- train/rewards/count_function_calls_per_response/mean
- Code Interpreter Call Frequency (a custom evaluation trace with a larger n for lower noise)

Results:
The base model had a 14% code interpreter call frequency.
After 24 training steps, the frequency increased to almost 100%.

August-murr · 2025-05-01T07:03:51Z

@qgallouedec @shirinyamani Can I get a review?

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>

August-murr · 2025-05-09T12:30:14Z

To validate the feature, I was thinking we could try teaching a model to solve GSM8k problems with Python.

I got it! I’ll need test out some of the new changes and ideas, and maybe train a bit on using tools like web search to make sure everything is running properly.

Merge branch 'main' into environments

August-murr · 2025-06-03T12:03:13Z

@qgallouedec @lewtun @shirinyamani

Here's the report and the script for the GSM8K code agent trained with the branch. It is very similar to Microsoft's paper titled "Artist".
I recommend reviewing the table from the W&B run to see the results for yourself.

A few notes:

I implemented tool output masking, which may or may not have been necessary. However, it was included in nearly every research paper addressing similar issues, such as ReTool. Keep in mind that this will complicate customized environments since each one needs to output a completion mask that masks the tool outputs.
To ensure proper functionality of tool output masking, I had to add tokens to the tokenizer and resize the model embedding before training. If we weren't using VLLM, this would have been as simple as adding a few lines to the GRPOTrainer script. However, since VLLM pulls the model directly from a local source or the Hugging Face hub, adding the tokens must be done separately. I will write the documentation for this as we begin the review process.

Whats next?

August-murr · 2025-06-21T09:53:45Z

@lewtun @qgallouedec @shirinyamani

Whatever

August-murr added 29 commits April 11, 2025 11:30

initial commit

baeb4f2

profiling_context

fdba1e2

adding default environment

d7e1ff6

()

6045677

importing from trl

11a55a4

import forwarding for VLLMClient

7b98461

adding stop to vllm_client

49b69ec

CodeAgentEnvironment

000bb3d

E2BExecuter initializer validation

8d05a2c

import forwarding CodeAgentEnvironment

ff1320a

**

8426313

adding stop as a parameter for vllm serve

be54fd2

re export and code agent fix

61197cb

removde duplicated tag

14a31b2

refactoring codeagent

8a345e4

fixing multi step

a8d1910

cleaning some unnecessary comments

e2a4100

()

ad8e45f

more ()

379d9a6

optional dependncies

aecffec

precommit fixes

d79357a

Merge branch 'main' into environments

f1dbe82

adding test

3c7bdbb

we should stop killing all the code

0776d4c

docs and test fix

11c59db

adding training agents guide

696935d

raise error when vllm is false

e688cfa

Merge branch 'main' into environments

6178f35

E

bdb9411

August-murr and others added 5 commits May 9, 2025 10:23

Apply suggestions from code review

eba7af3

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>

Update trl/trainer/grpo_trainer.py

a12702e

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>

removing langchain_experimental dependancy

e787b82

moving VLLMClientGenerationConfig

405ff32

stylwe

3b33a7a

August-murr added 2 commits May 9, 2025 13:04

Merge 'main' into environments

234f179

added dependancies

1ab6815

lewtun mentioned this pull request May 9, 2025

Deprecate TextEnvironment and tools #3389

Merged

5 tasks

August-murr added 2 commits May 10, 2025 10:49

duplicated import

e51d31b

adding completion mask for default environment

70495e1

August-murr marked this pull request as draft May 10, 2025 20:29

August-murr added 15 commits May 11, 2025 08:04

completion mask for codeagent environment

cf7ae17

different completion mask whn using environments

80c1836

if statement fix

097b792

changing some defaults

a9bad64

debugging print statements for completion mask

3361688

ModalCodeExecuter

41c55db

colocate isnt supported for environments

44b1d9d

prompt to prompts in prapre function

9d25371

env_completion_mask for multi processing

4220d37

fix enc completion ids for accelerate

9b7aaa3

more rubust prompt parsing

232b3ff

Merge branch 'main' into environments

2ffd4ee

Merge branch 'main' into environments

completions for multiple prompts

a70b7df

max len for multi step generation

280471a

Merge branch 'main' into environments

3ae9131

August-murr closed this Jun 21, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Environments #3367

Environments #3367

Uh oh!

August-murr commented Apr 26, 2025

Uh oh!

August-murr commented May 1, 2025

Uh oh!

August-murr commented May 9, 2025 •

edited

Loading

Uh oh!

August-murr commented Jun 3, 2025 •

edited

Loading

Uh oh!

August-murr commented Jun 21, 2025

Uh oh!

Uh oh!

Environments #3367

Environments #3367

Uh oh!

Conversation

August-murr commented Apr 26, 2025

Environments for Customized Rollouts with VLLM and GRPO

Example: Creating a Custom Environment

CodeAgentEnvironment

Running the Environment

Related Work

Other Notes

Demo Run

Uh oh!

August-murr commented May 1, 2025

Uh oh!

August-murr commented May 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

August-murr commented Jun 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

August-murr commented Jun 21, 2025

Uh oh!

Uh oh!

`CodeAgentEnvironment`

August-murr commented May 9, 2025 •

edited

Loading

August-murr commented Jun 3, 2025 •

edited

Loading