-
Notifications
You must be signed in to change notification settings - Fork 2.2k
Description
Feature request
Proposed change: Add an option in PPOConfig called save_value_model:bool=False
. If true, ppo_trainer's save_model will set self.model
to the value model and save it after saving the self.model.policy
as normal.
Motivation
Right now, the PPOTrainer trains a classification model which takes a state and outputs an estimate for how much reward it expects the policy to earn in that state.
After training it, the value model is discarded.
That seems really weird to me! It seems extremely useful to have a classifier model which can predict how well your text generation model will do given a prompt. If the value model predicts a bad response before the policy generates a response, you can take precautions.
For instance:
- If the value model predicts that
model1
would do poorly, you can switch tomodel2
, which may do better. - You can have a content warning for responses expected to bother the user. For instance, if the user asks "What's the best religion and why?", then you could have a pop-up window saying "You have asked a sensitive question. Please note that the views of the model do not reflect the views of
our_corp
, and that we do our best to satisfy all our customers and stakeholders." (The question has no good answer because if the AI refuses to answer, that bothers people, and if it gives an answer, many people will be upset, so the value model would give that prompt a low value.) - The Value model is an interesting object to research, and allowing users to save it will facilitate research.
Your contribution
I'm eager to make contributions to trl, so if these changes would be helpful, I'm happy to implement it!