Skip to content

Conversation

AlecHenx
Copy link
Contributor

@AlecHenx AlecHenx commented Jun 10, 2025

Checklist Before Starting

  • Searched for similar PR(s).
  • Checked PR Title format
    • In format of: [modules] type: Title
    • modules are in fsdp, megatron, sglang, vllm, rollout, trainer, tests, training_utils, recipe, hardware, deployment, ray, worker, single_controller, misc, perf, model, algo, env, tool, ckpt
    • type is in feat, fix, doc, refactor, chore
    • can involve multiple modules, seperated by , or space, like [megatron, fsdp] feat: xxx

What does this PR do?

  1. We implemented a MCP client manager which manages the connection with MCP server, such as session multiplexing, rate limit.
  2. We implemented a Search Tool with MCP client and Tavily MCP server, which delivers the same capability with Search R1 Tool.
  3. We offered a general MCP tool base for handling the logic of executing.

High-Level Design

Demonstrate the high-level design if this PR is complex.

Specific Changes

List the specific changes.

API

Demonstrate how the API changes if any.

Usage Example

Provide usage example(s) for easier usage.

  1. Register a Tavily account
  2. Edit the mcp_server.json file by replacing url and auth_token. Surely, you can use your own MCP server according to the instructions provided by FastMCP (supporting SSEServer, stdioServer and streamHTTP)
  3. Configure the mcp_tool_config.yaml file:
    • mcp_server_config_path should point to the JSON file from step 2
    • tool_selected_list specifies the tools you need to register from the MCP server
  4. (Optional) Implement a concrete instance based on MCPBaseTool to parse the results returned by the server

Details are listed in tutorial

Test

For changes that can not be tested by CI (e.g., algorithm implementation, new model support), validate by experiment(s) and show results like training curve plots, evaluatuion results, etc.

Additional Info.

  • Issue Number: Fixes part of issue Support MCP tool using for multi turn #1837
  • Training: [Note which backend this PR will affect: FSDP, Megatron, both, or none]
  • Inference: [Note which backend this PR will affect: vLLM, SGLang, both, or none]

Checklist Before Submitting

  • Read the Contribute Guide.
  • Apply pre-commit checks.
  • Add [BREAKING] to the PR title if it breaks any API.
  • Update the documentation about your changes in the docs.
  • New CI unit test(s) are added to cover the code path.
  • Rely on existing unit tests on CI that covers the code path.

@CLAassistant
Copy link

CLAassistant commented Jun 10, 2025

CLA assistant check
All committers have signed the CLA.

@AlecHenx AlecHenx changed the title tool feat: Add Search Tool implemented with MCP [tool feat]: Add Search Tool implemented with MCP Jun 11, 2025
config:
rate_limit: 120
timeout: 120
use_mcp: true
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if it would be better to introduce a tool type attribute in the tool configuration. This attribute could specify categories like "native" for tools implemented in Verl, "mcp" for MCP tools, "A2A" for A2A tools, or "langchain/langgraph" for related tools. Each type could then have its own distinct creation and registration logic.

Copy link
Contributor Author

@AlecHenx AlecHenx Jun 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed. Support native and mcp tools for now.

mcp:
mcp_servers_config_path: ./mcp_server.json
# optional
tool_selected_list:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The tool selection list appears somewhat unclear here. Why would we need an MCP tool selection list for just a single MCP tool? Perhaps it would make more sense to have it as a parallel configuration alongside tools? Alternatively, we could use the two mentioned above as a shared mcp_client_manager configuration within tools_config.

Copy link
Contributor Author

@AlecHenx AlecHenx Jun 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In fact, we'll register all tools from MCP server when tool_selected_list is not set. There are two reasons for tool_selected_list.

  1. A MCP server has multiple tools, but user may not want to use all the tools. So we offered a tool_selected_list attribute for selecting a part of tools for registration.

  2. We implement a general MCP tool using, but also implement the search tool, which is a single tool.

I'll refactor the config setting!

timeout: 120
use_mcp: true
mcp:
mcp_servers_config_path: ./mcp_server.json
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be better to include the config_path directly within the YAML file?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For now, the most of configurations of MCP servers are adopted as a json file, including offical MCP, FastMCP or Cursor, etc. This setting may facilitate users' migration across different platforms.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For now, the most of configurations of MCP servers are adopted as a json file, including offical MCP, FastMCP or Cursor, etc. This setting may facilitate users' migration across different platforms.

Agreed. If the JSON configuration already exists, it will be intuitive.

logger = logging.getLogger(__name__)


class TokenBucket:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we extract the token bucket as a common utility in the current PR?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

}
return instance_id

async def _async_execute(self, instance_id, parameters):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The naming might be enhanced, and a return type-hint should be included.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

if instance_id in self._instance_dict:
del self._instance_dict[instance_id]

def post_process(self, content: list):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As part of the execute semantics, it integrates the tool's return list into the tool's response, which undoubtedly impacts metrics and rewards.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

],
}

expect_turn_1_msg = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be possible to include a test case for two-turn calls? Thank you!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

of course, I've added that!

@AlecHenx AlecHenx requested a review from zhaochenyang20 as a code owner June 16, 2025 01:41
Copy link
Collaborator

@SwordFaith SwordFaith left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good example of the "one tool per server" scenario. For more complex MCP workloads, let's track via issue and discuss further. A training script with wandb logging would also help more users try it.

Comment on lines +91 to +93
metrics = {"query_count": metadata.get("query_count", 0), "status": metadata.get("status", "unknown"), "total_results": metadata.get("total_results", 0), "api_request_error": metadata.get("api_request_error")}

return result_text, 0.0, metrics
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As this is the base class for MCP Tools, shouldn't we return the raw metadata directly instead of extracting specific fields to construct metrics? The current approach not only causes metadata loss across different tools, but also arbitrarily restricts metadata keys, which seems unreasonable, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Metrics are not used for training for now, which contains some request information only.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Metrics are not used for training for now, which contains some request information only.

Additional telemetry features and a new tool logger will be introduced in the upcoming retool PR. These enhancements aim to improve the tracking of MCP tool behavior.

Comment on lines 39 to 40
logger.info(f"Initialized MCPSearchTool with config: {config}")

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

references to 'search' should be removed from the base class, particularly in lines 39, 73, 96, and 97.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice catch!

@AlecHenx AlecHenx changed the title [tool feat]: Add Search Tool implemented with MCP [tool] feat: Add Search Tool implemented with MCP Jun 18, 2025
@AlecHenx AlecHenx requested a review from chenhaiq as a code owner June 19, 2025 08:25
@chenhaiq chenhaiq enabled auto-merge (squash) June 19, 2025 14:40
@chenhaiq chenhaiq merged commit b401382 into volcengine:main Jun 19, 2025
36 of 37 checks passed
yellowbee686 pushed a commit to yellowbee686/verl that referenced this pull request Jun 23, 2025
1. MCP client manager which manages the connection with
MCP server, such as session multiplexing, rate limit.
2. Search Tool with MCP client and
[Tavily](https://app.tavily.com/home) MCP server, which delivers the
same capability with Search R1 Tool.
3. A general MCP tool base for handling the logic of
executing.

### High-Level Design

> Demonstrate the high-level design if this PR is complex.

### Specific Changes

> List the specific changes.

### API

> Demonstrate how the API changes if any.

### Usage Example

> Provide usage example(s) for easier usage.

1. Register a [Tavily](https://app.tavily.com/home) account  
2. Edit the `mcp_server.json` file by replacing `url` and `auth_token`.
Surely, you can use your own MCP server according to the instructions
provided by
[FastMCP](https://gofastmcp.com/clients/transports#configuration-based-transports)
(supporting SSEServer, stdioServer and streamHTTP)
3. Configure the `mcp_tool_config.yaml` file:  
- `mcp_server_config_path` should point to the JSON file from step 2
- `tool_selected_list` specifies the tools you need to register from the
MCP server
4. *(Optional)* Implement a concrete instance based on `MCPBaseTool` to
parse the results returned by the server

Details are listed in
[tutorial](https://github.com/AlecHenx/ml-recipe/blob/main/Tutorial%20for%20MCP%20Tool%20in%20veRL.md)

### Test

> For changes that can not be tested by CI (e.g., algorithm
implementation, new model support), validate by experiment(s) and show
results like training curve plots, evaluatuion results, etc.

### Additional Info.

- **Issue Number**: Fixes part of issue volcengine#1837 
- **Training**: [Note which backend this PR will affect: FSDP, Megatron,
both, or none]
- **Inference**: [Note which backend this PR will affect: vLLM, SGLang,
both, or none]

### Checklist Before Submitting

- [x] Read the [Contribute
Guide](https://github.com/volcengine/verl?tab=readme-ov-file#contribution-guide).
- [x] Apply [pre-commit
checks](https://github.com/volcengine/verl?tab=readme-ov-file#code-linting-and-formatting).
- [x] Add `[BREAKING]` to the PR title if it breaks any API.
- [x] Update the documentation about your changes in the
[docs](https://github.com/volcengine/verl/tree/main/docs).
- [x] New CI unit test(s) are added to cover the code path.
- [x] Rely on existing unit tests on CI that covers the code path.
@eric-haibin-lin
Copy link
Collaborator

Hi @AlecHenx would you mind adding the tutorial https://github.com/AlecHenx/ml-recipe/blob/main/Tutorial%20for%20MCP%20Tool%20in%20verl.md as a subsection in https://verl.readthedocs.io/en/latest/sglang_multiturn/multiturn.html? (docs/sglang_multiturn/multiturn.rst)

Tyizhanshen pushed a commit to HyperdriveHustle/verl that referenced this pull request Jul 1, 2025
1. MCP client manager which manages the connection with
MCP server, such as session multiplexing, rate limit.
2. Search Tool with MCP client and
[Tavily](https://app.tavily.com/home) MCP server, which delivers the
same capability with Search R1 Tool.
3. A general MCP tool base for handling the logic of
executing.

### High-Level Design

> Demonstrate the high-level design if this PR is complex.

### Specific Changes

> List the specific changes.

### API

> Demonstrate how the API changes if any.

### Usage Example

> Provide usage example(s) for easier usage.

1. Register a [Tavily](https://app.tavily.com/home) account  
2. Edit the `mcp_server.json` file by replacing `url` and `auth_token`.
Surely, you can use your own MCP server according to the instructions
provided by
[FastMCP](https://gofastmcp.com/clients/transports#configuration-based-transports)
(supporting SSEServer, stdioServer and streamHTTP)
3. Configure the `mcp_tool_config.yaml` file:  
- `mcp_server_config_path` should point to the JSON file from step 2
- `tool_selected_list` specifies the tools you need to register from the
MCP server
4. *(Optional)* Implement a concrete instance based on `MCPBaseTool` to
parse the results returned by the server

Details are listed in
[tutorial](https://github.com/AlecHenx/ml-recipe/blob/main/Tutorial%20for%20MCP%20Tool%20in%20veRL.md)

### Test

> For changes that can not be tested by CI (e.g., algorithm
implementation, new model support), validate by experiment(s) and show
results like training curve plots, evaluatuion results, etc.

### Additional Info.

- **Issue Number**: Fixes part of issue volcengine#1837 
- **Training**: [Note which backend this PR will affect: FSDP, Megatron,
both, or none]
- **Inference**: [Note which backend this PR will affect: vLLM, SGLang,
both, or none]

### Checklist Before Submitting

- [x] Read the [Contribute
Guide](https://github.com/volcengine/verl?tab=readme-ov-file#contribution-guide).
- [x] Apply [pre-commit
checks](https://github.com/volcengine/verl?tab=readme-ov-file#code-linting-and-formatting).
- [x] Add `[BREAKING]` to the PR title if it breaks any API.
- [x] Update the documentation about your changes in the
[docs](https://github.com/volcengine/verl/tree/main/docs).
- [x] New CI unit test(s) are added to cover the code path.
- [x] Rely on existing unit tests on CI that covers the code path.
@X1angyuLu
Copy link

Hi @AlecHenx, thank you very much for this amazing PR! 🙌

However, I encountered an issue when trying to use the MCP search tool during PPO training.

Initially, everything worked smoothly — the model was able to call the tool correctly, and the tool results were returned properly after the user message. But after some time, the MCP search tool started failing with the following error, although the training continued normally (just without receiving the search results):

Tool execution failed: local variable 'call_tool_result' referenced before assignment

Eventually, the entire pipeline gets stuck and times out — although I don't think this is directly caused by the error above.

You can find the full logs and environment details at the links below:

We ran the training script as follows:

set -x

ulimit -n 65535

PROJECT_DIR="$(pwd)"
CONFIG_PATH="$PROJECT_DIR/examples/sglang_multiturn/config"
echo $CONFIG_PATH

TRAIN_DATA=~/data/searchR1_processed_nq/test.parquet
VAL_DATA=~/data/searchR1_processed_nq/test.parquet

TOOL_CONFIG="$CONFIG_PATH/tool_config/mcp_tool_config.yaml"

model_path=Qwen/Qwen2.5-1.5B-Instruct

python3 -m verl.trainer.main_ppo \
 --config-path="$CONFIG_PATH" \
 --config-name='search_multiturn_grpo' \
 algorithm.adv_estimator=grpo \
 data.train_batch_size=64 \
 data.val_batch_size=256 \
 data.max_prompt_length=4096 \
 data.max_response_length=3000 \
 data.filter_overlong_prompts=True \
 data.truncation='error' \
 data.return_raw_chat=True \
 actor_rollout_ref.model.path=$model_path \
 actor_rollout_ref.actor.optim.lr=1e-6 \
 actor_rollout_ref.actor.optim.lr_warmup_steps_ratio=0.285 \
 actor_rollout_ref.model.use_remove_padding=True \
 actor_rollout_ref.actor.ppo_mini_batch_size=64 \
 actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=8 \
 actor_rollout_ref.actor.use_kl_loss=True \
 actor_rollout_ref.actor.kl_loss_coef=0.001 \
 actor_rollout_ref.actor.kl_loss_type=low_var_kl \
 actor_rollout_ref.actor.entropy_coeff=0 \
 actor_rollout_ref.model.enable_gradient_checkpointing=True \
 actor_rollout_ref.actor.fsdp_config.param_offload=False \
 actor_rollout_ref.actor.fsdp_config.optimizer_offload=False \
 actor_rollout_ref.rollout.max_model_len=15000 \
 actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=8 \
 actor_rollout_ref.rollout.tensor_model_parallel_size=1 \
 actor_rollout_ref.rollout.name=sglang \
 actor_rollout_ref.rollout.gpu_memory_utilization=0.5 \
 actor_rollout_ref.rollout.n=5 \
 actor_rollout_ref.rollout.multi_turn.enable=True \
 actor_rollout_ref.rollout.multi_turn.max_assistant_turns=2 \
 actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu=8 \
 actor_rollout_ref.ref.fsdp_config.param_offload=True \
 actor_rollout_ref.rollout.multi_turn.tool_config_path="$TOOL_CONFIG" \
 algorithm.use_kl_in_reward=False \
 trainer.critic_warmup=0 \
 trainer.val_before_train=False \
 trainer.logger=['console','wandb'] \
 trainer.project_name='search_r1_like_async_rl' \
 trainer.experiment_name='qwen2.5-1.5b-instruct_function_rm-search-async-sgl-multi-w-searchtool' \
 trainer.n_gpus_per_node=2 \
 trainer.nnodes=1 \
 trainer.save_freq=1500 \
 trainer.test_freq=1500 \
 data.train_files="$TRAIN_DATA" \
 data.val_files="$VAL_DATA" \
 trainer.total_epochs=1 $@

@AlecHenx
Copy link
Contributor Author

AlecHenx commented Jul 10, 2025

This is because a large number of tool calls exceeded Tavily's rate limit, which caused an exception to be thrown before the call_tool_result received the search results. I will make the following two optimizations:

  1. Improve the exception handling in the mcp_base_tool.py to allow viewing of the details of exceptions when calling the Tavily tool.
  2. Optimize the current rate limit module by changing it from a quantity limit to a time limit.

For now, you can try to decrease the rate limit size in mcp_tool_config.yaml.

@X1angyuLu
Copy link

Hi @AlecHenx, thanks for your response!

Yes, I believe Tavily's rate limit is indeed the direct cause of the Tool execution failed: local variable 'call_tool_result' referenced before assignment error. To verify this, I implemented a simple dummy MCP search tool and ran it with the same PPO training script — the error no longer occurred.

However, I still encountered the hang issue during the later stage of training: the entire pipeline keeps waiting for the tool result until it times out. During this period, the GPU remains fully allocated but with low power usage, indicating that it's not actively generating or updating. I suspect this is an issue between the MCP tool and the rollout logic, since the hang occurs with both the Tavily search tool and my dummy implementation.


Here is the traceback from hanging case

The failure appears to occur during a dist.barrier() call inside sglang_rollout.py.

Click to expand full traceback
Traceback (most recent call last):
  File "/home/user/verl/verl/trainer/main_ppo.py", line 58, in main
    run_ppo(config)
  File "/home/user/verl/verl/trainer/main_ppo.py", line 81, in run_ppo
    ray.get(runner.run.remote(config))
  File "/home/user/anaconda3/envs/verl/lib/python3.10/site-packages/ray/_private/auto_init_hook.py", line 22, in auto_init_wrapper
    return fn(*args, **kwargs)
  File "/home/user/anaconda3/envs/verl/lib/python3.10/site-packages/ray/_private/client_mode_hook.py", line 104, in wrapper
    return func(*args, **kwargs)
  File "/home/user/anaconda3/envs/verl/lib/python3.10/site-packages/ray/_private/worker.py", line 2849, in get
    values, debugger_breakpoint = worker.get_objects(object_refs, timeout=timeout)
  File "/home/user/anaconda3/envs/verl/lib/python3.10/site-packages/ray/_private/worker.py", line 937, in get_objects
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(RuntimeError): �[36mray::TaskRunner.run()�[39m (pid=117929, ip=172.20.114.81, actor_id=0538147587cda890d70d48fb01000000, repr=<main_ppo.TaskRunner object at 0x7f79c2e71ea0>)
  File "/home/user/verl/verl/trainer/main_ppo.py", line 231, in run
    trainer.fit()
  File "/home/user/verl/verl/trainer/ppo/ray_trainer.py", line 1144, in fit
    gen_batch_output = self.actor_rollout_wg.generate_sequences(gen_batch)
  File "/home/user/verl/verl/single_controller/ray/base.py", line 51, in __call__
    output = ray.get(output)
ray.exceptions.RayTaskError(RuntimeError): �[36mray::WorkerDict.actor_rollout_generate_sequences()�[39m (pid=135815, ip=172.20.114.81, actor_id=a40c8581e44e0cb416f6a75801000000, repr=<verl.single_controller.ray.base.WorkerDict object at 0x7f2d33f7aa40>)
  File "/home/user/verl/verl/single_controller/ray/base.py", line 708, in func
    return getattr(self.worker_dict[key], name)(*args, **kwargs)
  File "/home/user/verl/verl/single_controller/base/decorator.py", line 549, in inner
    return func(*args, **kwargs)
  File "/home/user/verl/verl/workers/fsdp_workers.py", line 738, in generate_sequences
    output = self.rollout.generate_sequences(prompts=prompts)
  File "/home/user/verl/verl/utils/profiler/performance.py", line 89, in f
    return self.log(decorated_function, *args, **kwargs)
  File "/home/user/verl/verl/utils/profiler/performance.py", line 102, in log
    output = func(*args, **kwargs)
  File "/home/user/anaconda3/envs/verl/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
  File "/home/user/verl/verl/workers/rollout/sglang_rollout/sglang_rollout.py", line 542, in generate_sequences
    return self._req_level_generate_sequences(prompts, **kwargs)
  File "/home/user/verl/verl/utils/profiler/performance.py", line 89, in f
    return self.log(decorated_function, *args, **kwargs)
  File "/home/user/verl/verl/utils/profiler/performance.py", line 102, in log
    output = func(*args, **kwargs)
  File "/home/user/anaconda3/envs/verl/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
  File "/home/user/verl/verl/workers/rollout/sglang_rollout/sglang_rollout.py", line 1049, in _req_level_generate_sequences
    dist.barrier()
  File "/home/user/anaconda3/envs/verl/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 81, in wrapper
    return func(*args, **kwargs)
  File "/home/user/anaconda3/envs/verl/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 4556, in barrier
    work.wait()
RuntimeError: [/pytorch/third_party/gloo/gloo/transport/tcp/unbound_buffer.cc:81] Timed out waiting 1800000ms for recv operation to complete

From my debugging, I’ve confirmed that the hang happens specifically at this line:

await self._call_tool(instance_id, parameters)

Temporarily replacing it with:

await asyncio.wait_for(self._call_tool(instance_id, parameters), self.timeout)

does prevent the hang, but it harms performance — many valid tool calls are skipped due to timeout. Notably, this issue does not occur when using non-MCP functional tools like gsm8k_tool.


  1. Could it be a server-side freeze?
    Unlikely — I can still access the MCP server normally during the hang.

  2. Could it be related to the transport protocol?
    I tried stdio, http, and sse, and the hang still happens with all of them.


I found some possibly related issues:


Here are my full logs and dummy tool setup:

MCP Server

from fastmcp import FastMCP
import asyncio

mcp = FastMCP("tavily_search_dummy_tool")

@mcp.tool()
async def tavily_search(query: str):
    metadata = {
        "query": query,
        "results": "No results found"
    }
    return metadata

if __name__ == "__main__":
    asyncio.run(mcp.run(transport='sse', port=50052))

MCP Client

import json
import logging
import os
from typing import Tuple
from verl.tools.mcp_base_tool import MCPBaseTool
from .schemas import OpenAIFunctionToolSchema

logger = logging.getLogger(__name__)
logger.setLevel(os.getenv("VERL_LOGGING_LEVEL", "WARN"))

class MCPDummyTool(MCPBaseTool):
    def __init__(self, config: dict, tool_schema: OpenAIFunctionToolSchema):
        super().__init__(config, tool_schema)

    def _parse_tool_result(self, content: list) -> Tuple[str, dict]:
        data = json.loads(content[0].text)
        result = data["results"]
        metadata = data
        metadata["api_request_error"] = ""
        return result, metadata

MCP Server Config

{
  "mcpServers": {
    "dummy_server": {
      "url": "http://127.0.0.1:50052/sse/",
      "transport": "sse",
      "auth_token": "dummy"
    }
  }
}

# Terminal 1: start the dummy server
python verl/tools/mcp_dummy_search_tool_server.py

# Terminal 2: launch PPO training

Thank you again for your great support!

@AlecHenx
Copy link
Contributor Author

Hi @AlecHenx, thanks for your response!

Yes, I believe Tavily's rate limit is indeed the direct cause of the Tool execution failed: local variable 'call_tool_result' referenced before assignment error. To verify this, I implemented a simple dummy MCP search tool and ran it with the same PPO training script — the error no longer occurred.

However, I still encountered the hang issue during the later stage of training: the entire pipeline keeps waiting for the tool result until it times out. During this period, the GPU remains fully allocated but with low power usage, indicating that it's not actively generating or updating. I suspect this is an issue between the MCP tool and the rollout logic, since the hang occurs with both the Tavily search tool and my dummy implementation.

Here is the traceback from hanging case

The failure appears to occur during a dist.barrier() call inside sglang_rollout.py.

Click to expand full traceback

Traceback (most recent call last):
  File "/home/user/verl/verl/trainer/main_ppo.py", line 58, in main
    run_ppo(config)
  File "/home/user/verl/verl/trainer/main_ppo.py", line 81, in run_ppo
    ray.get(runner.run.remote(config))
  File "/home/user/anaconda3/envs/verl/lib/python3.10/site-packages/ray/_private/auto_init_hook.py", line 22, in auto_init_wrapper
    return fn(*args, **kwargs)
  File "/home/user/anaconda3/envs/verl/lib/python3.10/site-packages/ray/_private/client_mode_hook.py", line 104, in wrapper
    return func(*args, **kwargs)
  File "/home/user/anaconda3/envs/verl/lib/python3.10/site-packages/ray/_private/worker.py", line 2849, in get
    values, debugger_breakpoint = worker.get_objects(object_refs, timeout=timeout)
  File "/home/user/anaconda3/envs/verl/lib/python3.10/site-packages/ray/_private/worker.py", line 937, in get_objects
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(RuntimeError): �[36mray::TaskRunner.run()�[39m (pid=117929, ip=172.20.114.81, actor_id=0538147587cda890d70d48fb01000000, repr=<main_ppo.TaskRunner object at 0x7f79c2e71ea0>)
  File "/home/user/verl/verl/trainer/main_ppo.py", line 231, in run
    trainer.fit()
  File "/home/user/verl/verl/trainer/ppo/ray_trainer.py", line 1144, in fit
    gen_batch_output = self.actor_rollout_wg.generate_sequences(gen_batch)
  File "/home/user/verl/verl/single_controller/ray/base.py", line 51, in __call__
    output = ray.get(output)
ray.exceptions.RayTaskError(RuntimeError): �[36mray::WorkerDict.actor_rollout_generate_sequences()�[39m (pid=135815, ip=172.20.114.81, actor_id=a40c8581e44e0cb416f6a75801000000, repr=<verl.single_controller.ray.base.WorkerDict object at 0x7f2d33f7aa40>)
  File "/home/user/verl/verl/single_controller/ray/base.py", line 708, in func
    return getattr(self.worker_dict[key], name)(*args, **kwargs)
  File "/home/user/verl/verl/single_controller/base/decorator.py", line 549, in inner
    return func(*args, **kwargs)
  File "/home/user/verl/verl/workers/fsdp_workers.py", line 738, in generate_sequences
    output = self.rollout.generate_sequences(prompts=prompts)
  File "/home/user/verl/verl/utils/profiler/performance.py", line 89, in f
    return self.log(decorated_function, *args, **kwargs)
  File "/home/user/verl/verl/utils/profiler/performance.py", line 102, in log
    output = func(*args, **kwargs)
  File "/home/user/anaconda3/envs/verl/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
  File "/home/user/verl/verl/workers/rollout/sglang_rollout/sglang_rollout.py", line 542, in generate_sequences
    return self._req_level_generate_sequences(prompts, **kwargs)
  File "/home/user/verl/verl/utils/profiler/performance.py", line 89, in f
    return self.log(decorated_function, *args, **kwargs)
  File "/home/user/verl/verl/utils/profiler/performance.py", line 102, in log
    output = func(*args, **kwargs)
  File "/home/user/anaconda3/envs/verl/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
  File "/home/user/verl/verl/workers/rollout/sglang_rollout/sglang_rollout.py", line 1049, in _req_level_generate_sequences
    dist.barrier()
  File "/home/user/anaconda3/envs/verl/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 81, in wrapper
    return func(*args, **kwargs)
  File "/home/user/anaconda3/envs/verl/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 4556, in barrier
    work.wait()
RuntimeError: [/pytorch/third_party/gloo/gloo/transport/tcp/unbound_buffer.cc:81] Timed out waiting 1800000ms for recv operation to complete

From my debugging, I’ve confirmed that the hang happens specifically at this line:

await self._call_tool(instance_id, parameters)

Temporarily replacing it with:

await asyncio.wait_for(self._call_tool(instance_id, parameters), self.timeout)

does prevent the hang, but it harms performance — many valid tool calls are skipped due to timeout. Notably, this issue does not occur when using non-MCP functional tools like gsm8k_tool.

  1. Could it be a server-side freeze?
    Unlikely — I can still access the MCP server normally during the hang.
  2. Could it be related to the transport protocol?
    I tried stdio, http, and sse, and the hang still happens with all of them.

I found some possibly related issues:

Here are my full logs and dummy tool setup:

MCP Server

from fastmcp import FastMCP
import asyncio

mcp = FastMCP("tavily_search_dummy_tool")

@mcp.tool()
async def tavily_search(query: str):
    metadata = {
        "query": query,
        "results": "No results found"
    }
    return metadata

if __name__ == "__main__":
    asyncio.run(mcp.run(transport='sse', port=50052))

MCP Client

import json
import logging
import os
from typing import Tuple
from verl.tools.mcp_base_tool import MCPBaseTool
from .schemas import OpenAIFunctionToolSchema

logger = logging.getLogger(__name__)
logger.setLevel(os.getenv("VERL_LOGGING_LEVEL", "WARN"))

class MCPDummyTool(MCPBaseTool):
    def __init__(self, config: dict, tool_schema: OpenAIFunctionToolSchema):
        super().__init__(config, tool_schema)

    def _parse_tool_result(self, content: list) -> Tuple[str, dict]:
        data = json.loads(content[0].text)
        result = data["results"]
        metadata = data
        metadata["api_request_error"] = ""
        return result, metadata

MCP Server Config

{
  "mcpServers": {
    "dummy_server": {
      "url": "http://127.0.0.1:50052/sse/",
      "transport": "sse",
      "auth_token": "dummy"
    }
  }
}
# Terminal 1: start the dummy server
python verl/tools/mcp_dummy_search_tool_server.py

# Terminal 2: launch PPO training

Thank you again for your great support!

Thank you for providing the details of hang problem, I'll have a look at it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants