Skip to content

Environments #3367

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 53 commits into from
Closed

Environments #3367

wants to merge 53 commits into from

Conversation

August-murr
Copy link
Contributor

Environments for Customized Rollouts with VLLM and GRPO

Environments allows for customized rollouts and usage of the model during training with VLLM and GRPO.

Example: Creating a Custom Environment

from trl import Environment

class MyCustomEnv(Environment):
    def __init__(self, ...):
        # Initialize anything your agent needs, like code execution,
        # tool calls, or external APIs.
        pass

    def tool_call(self, ...):
        # Define any environment-specific methods you want to use.
        pass

    def generate(self, vllm_client, generation_config: VLLMClientGenerationConfig, prompts: list[str]) -> list:
        # This is the method GRPO training uses to generate responses.
        # It must return a list of completion IDs (tokenized responses).
        return completion_ids

CodeAgentEnvironment

A built-in CodeAgentEnvironment is included, which can be used with either:

Example setup:

from trl import CodeAgentEnvironment, E2BExecutor

code_executor = E2BExecutor(api_key="YOUR_E2B_TOKEN")
my_env = CodeAgentEnvironment(
    code_executor=code_executor,
    tokenizer=tokenizer,
    parsing_string="<code>",
    stop_string="</code>",
)

Running the Environment

You can use the environment manually with a VLLM server to test and observe:

Start a VLLM server:

trl vllm-serve --model "your_model"

Then run the agent:

from trl import VLLMClientGenerationConfig, VLLMClient

client = VLLMClient()
gen_config = VLLMClientGenerationConfig(
    n=8,
    repetition_penalty=1.0,
    temperature=0.8,
    top_p=0.9,
    top_k=10,
    min_p=0.0,
    max_tokens=256,
)

responses = my_env.run_agent(  # Main method in CodeAgentEnvironment to generate responses
    vllm_client=client,
    generation_config=gen_config,
    prompts=prompts
)

Or, use the environment for training:

from trl import GRPOConfig, GRPOTrainer

training_args = GRPOConfig(
    output_dir="code_agent",
    use_vllm=True,
    # ... other config options ...
)

trainer = GRPOTrainer(
    model=...,                # Your model or model name
    reward_funcs=...,         # Your reward function(s)
    args=training_args,
    environment=my_env,       # Your custom or built-in environment
    # ... other trainer args ...
)

Related Work

While finishing this PR, ByteDance published ReTool: Reinforcement Learning for Strategic Tool Use in LLMs.
This PR with CodeAgentEnvironment is a similar idea, with two key differences:

  1. ReTool uses PPO, while we use GRPO.
  2. ReTool masks interpreter feedback — an idea I'll be investigating and possibly adding later.

Other Notes

  • I added a stop_string parameter to the VLLM CLI — it’s critical for agentic workflows where you want to cleanly delimit model outputs.

Demo Run

To test, I set up a simple math task with a one-shot prompt for Qwen-0.5B-instruct.
The reward function checked if the agent used the code interpreter during its response.

W B Code Interpreter Call Frequency

Results:
The base model had a 14% code interpreter call frequency.
After 24 training steps, the frequency increased to almost 100%.

@August-murr
Copy link
Contributor Author

@qgallouedec @shirinyamani Can I get a review?

@August-murr
Copy link
Contributor Author

August-murr commented May 9, 2025

To validate the feature, I was thinking we could try teaching a model to solve GSM8k problems with Python.

I got it! I’ll need test out some of the new changes and ideas, and maybe train a bit on using tools like web search to make sure everything is running properly.

@lewtun lewtun mentioned this pull request May 9, 2025
5 tasks
@August-murr August-murr marked this pull request as draft May 10, 2025 20:29
@August-murr
Copy link
Contributor Author

August-murr commented Jun 3, 2025

@qgallouedec @lewtun @shirinyamani

Here's the report and the script for the GSM8K code agent trained with the branch. It is very similar to Microsoft's paper titled "Artist".
I recommend reviewing the table from the W&B run to see the results for yourself.

A few notes:

  • I implemented tool output masking, which may or may not have been necessary. However, it was included in nearly every research paper addressing similar issues, such as ReTool. Keep in mind that this will complicate customized environments since each one needs to output a completion mask that masks the tool outputs.
  • To ensure proper functionality of tool output masking, I had to add tokens to the tokenizer and resize the model embedding before training. If we weren't using VLLM, this would have been as simple as adding a few lines to the GRPOTrainer script. However, since VLLM pulls the model directly from a local source or the Hugging Face hub, adding the tokens must be done separately. I will write the documentation for this as we begin the review process.

Whats next?

@August-murr
Copy link
Contributor Author

@lewtun @qgallouedec @shirinyamani

Whatever

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants