-
Notifications
You must be signed in to change notification settings - Fork 2.1k
Environments #3367
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Environments #3367
Conversation
@qgallouedec @shirinyamani Can I get a review? |
Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>
Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>
I got it! I’ll need test out some of the new changes and ideas, and maybe train a bit on using tools like web search to make sure everything is running properly. |
Merge branch 'main' into environments
@qgallouedec @lewtun @shirinyamani Here's the report and the script for the GSM8K code agent trained with the branch. It is very similar to Microsoft's paper titled "Artist". A few notes:
Whats next? |
@lewtun @qgallouedec @shirinyamani Whatever |
Environments for Customized Rollouts with VLLM and GRPO
Environments allows for customized rollouts and usage of the model during training with VLLM and GRPO.
Example: Creating a Custom Environment
CodeAgentEnvironment
A built-in
CodeAgentEnvironment
is included, which can be used with either:Example setup:
Running the Environment
You can use the environment manually with a VLLM server to test and observe:
Start a VLLM server:
trl vllm-serve --model "your_model"
Then run the agent:
Or, use the environment for training:
Related Work
While finishing this PR, ByteDance published ReTool: Reinforcement Learning for Strategic Tool Use in LLMs.
This PR with
CodeAgentEnvironment
is a similar idea, with two key differences:Other Notes
stop_string
parameter to the VLLM CLI — it’s critical for agentic workflows where you want to cleanly delimit model outputs.Demo Run
To test, I set up a simple math task with a one-shot prompt for
Qwen-0.5B-instruct
.The reward function checked if the agent used the code interpreter during its response.
train/rewards/count_function_calls_per_response/mean
n
for lower noise)Results:
The base model had a 14% code interpreter call frequency.
After 24 training steps, the frequency increased to almost 100%.