-
Notifications
You must be signed in to change notification settings - Fork 2.3k
Description
Hi, I am trying to understand the code. I would like to try RL training on tool calling in an interactive environment.
As I understand it, the reward is calculated by some custom reward function for a particular dataset. In other words, the flow of data during PPO is like this:
graph TD
DatasetExample --> InferenceRollout --> RewardFunction --> UpdateGradients
But the inference step rollout here is a one-shot input/output function. If online tool calling was desired, we'd have to hook the llm.generate function here, right?
https://github.com/volcengine/verl/blob/main/verl/workers/rollout/vllm_rollout/vllm_rollout.py#L181
Then we could inject in function calling. But i'm confused because the inference engine is not an ordinary VLLM LLM class, but a subclass which monkey patches the output to return tensors instead of the normal VLLM output format.
So what would be the best way to add in dynamic function calling? Hook the generate method of vLLM's LLM class, then call LLM._post_process_output
to convert token_id and logprobs from VLLM into torch tensors at the very end?
Or is there an more obvious place to add in this feature?