[tool] feat: Add Search Tool implemented with MCP #1948

AlecHenx · 2025-06-10T13:00:38Z

Checklist Before Starting

Searched for similar PR(s).
Checked PR Title format
- In format of: [modules] type: Title
- modules are in fsdp, megatron, sglang, vllm, rollout, trainer, tests, training_utils, recipe, hardware, deployment, ray, worker, single_controller, misc, perf, model, algo, env, tool, ckpt
- type is in feat, fix, doc, refactor, chore
- can involve multiple modules, seperated by , or space, like [megatron, fsdp] feat: xxx

What does this PR do?

We implemented a MCP client manager which manages the connection with MCP server, such as session multiplexing, rate limit.
We implemented a Search Tool with MCP client and Tavily MCP server, which delivers the same capability with Search R1 Tool.
We offered a general MCP tool base for handling the logic of executing.

High-Level Design

Demonstrate the high-level design if this PR is complex.

Specific Changes

List the specific changes.

API

Demonstrate how the API changes if any.

Usage Example

Provide usage example(s) for easier usage.

Register a Tavily account
Edit the mcp_server.json file by replacing url and auth_token. Surely, you can use your own MCP server according to the instructions provided by FastMCP (supporting SSEServer, stdioServer and streamHTTP)
Configure the mcp_tool_config.yaml file:
- mcp_server_config_path should point to the JSON file from step 2
- tool_selected_list specifies the tools you need to register from the MCP server
(Optional) Implement a concrete instance based on MCPBaseTool to parse the results returned by the server

Details are listed in tutorial

Test

For changes that can not be tested by CI (e.g., algorithm implementation, new model support), validate by experiment(s) and show results like training curve plots, evaluatuion results, etc.

Additional Info.

Issue Number: Fixes part of issue Support MCP tool using for multi turn #1837
Training: [Note which backend this PR will affect: FSDP, Megatron, both, or none]
Inference: [Note which backend this PR will affect: vLLM, SGLang, both, or none]

Checklist Before Submitting

Read the Contribute Guide.
Apply pre-commit checks.
Add [BREAKING] to the PR title if it breaks any API.
Update the documentation about your changes in the docs.
New CI unit test(s) are added to cover the code path.
Rely on existing unit tests on CI that covers the code path.

CLAassistant · 2025-06-10T13:00:46Z

All committers have signed the CLA.

SwordFaith · 2025-06-12T03:30:10Z

examples/sglang_multiturn/config/tool_config/mcp_tool_config.yaml

+    config:
+      rate_limit: 120
+      timeout: 120
+      use_mcp: true


I wonder if it would be better to introduce a tool type attribute in the tool configuration. This attribute could specify categories like "native" for tools implemented in Verl, "mcp" for MCP tools, "A2A" for A2A tools, or "langchain/langgraph" for related tools. Each type could then have its own distinct creation and registration logic.

fixed. Support native and mcp tools for now.

SwordFaith · 2025-06-12T03:31:18Z

examples/sglang_multiturn/config/tool_config/mcp_tool_config.yaml

+    mcp:
+      mcp_servers_config_path: ./mcp_server.json
+      # optional
+      tool_selected_list: 


The tool selection list appears somewhat unclear here. Why would we need an MCP tool selection list for just a single MCP tool? Perhaps it would make more sense to have it as a parallel configuration alongside tools? Alternatively, we could use the two mentioned above as a shared mcp_client_manager configuration within tools_config.

In fact, we'll register all tools from MCP server when tool_selected_list is not set. There are two reasons for tool_selected_list.

A MCP server has multiple tools, but user may not want to use all the tools. So we offered a tool_selected_list attribute for selecting a part of tools for registration.

We implement a general MCP tool using, but also implement the search tool, which is a single tool.

I'll refactor the config setting!

SwordFaith · 2025-06-12T05:23:28Z

examples/sglang_multiturn/config/tool_config/mcp_tool_config.yaml

+      timeout: 120
+      use_mcp: true
+    mcp:
+      mcp_servers_config_path: ./mcp_server.json


Would it be better to include the config_path directly within the YAML file?

For now, the most of configurations of MCP servers are adopted as a json file, including offical MCP, FastMCP or Cursor, etc. This setting may facilitate users' migration across different platforms.

For now, the most of configurations of MCP servers are adopted as a json file, including offical MCP, FastMCP or Cursor, etc. This setting may facilitate users' migration across different platforms.

Agreed. If the JSON configuration already exists, it will be intuitive.

SwordFaith · 2025-06-12T05:24:35Z

verl/tools/utils/mcp_clients/McpClientManager.py

+logger = logging.getLogger(__name__)
+
+
+class TokenBucket:


Can we extract the token bucket as a common utility in the current PR?

SwordFaith · 2025-06-12T05:37:08Z

verl/tools/mcp_base_tool.py

+        }
+        return instance_id
+
+    async def _async_execute(self, instance_id, parameters):


The naming might be enhanced, and a return type-hint should be included.

SwordFaith · 2025-06-12T05:39:09Z

verl/tools/mcp_base_tool.py

+        if instance_id in self._instance_dict:
+            del self._instance_dict[instance_id]
+
+    def post_process(self, content: list):


As part of the execute semantics, it integrates the tool's return list into the tool's response, which undoubtedly impacts metrics and rewards.

Lins-01 · 2025-06-14T14:28:06Z

tests/workers/rollout/test_sglang_async_rollout_mcp_tools.py

+        ],
+    }
+
+    expect_turn_1_msg = {


Would it be possible to include a test case for two-turn calls? Thank you!

of course, I've added that!

SwordFaith

This is a good example of the "one tool per server" scenario. For more complex MCP workloads, let's track via issue and discuss further. A training script with wandb logging would also help more users try it.

bqw1013 · 2025-06-16T15:57:26Z

verl/tools/mcp_base_tool.py

+            metrics = {"query_count": metadata.get("query_count", 0), "status": metadata.get("status", "unknown"), "total_results": metadata.get("total_results", 0), "api_request_error": metadata.get("api_request_error")}
+
+            return result_text, 0.0, metrics


As this is the base class for MCP Tools, shouldn't we return the raw metadata directly instead of extracting specific fields to construct metrics? The current approach not only causes metadata loss across different tools, but also arbitrarily restricts metadata keys, which seems unreasonable, right?

Metrics are not used for training for now, which contains some request information only.

Metrics are not used for training for now, which contains some request information only.

Additional telemetry features and a new tool logger will be introduced in the upcoming retool PR. These enhancements aim to improve the tracking of MCP tool behavior.

bqw1013 · 2025-06-16T16:05:46Z

verl/tools/mcp_base_tool.py

+        logger.info(f"Initialized MCPSearchTool with config: {config}")
+


references to 'search' should be removed from the base class, particularly in lines 39, 73, 96, and 97.

nice catch!

1. MCP client manager which manages the connection with MCP server, such as session multiplexing, rate limit. 2. Search Tool with MCP client and [Tavily](https://app.tavily.com/home) MCP server, which delivers the same capability with Search R1 Tool. 3. A general MCP tool base for handling the logic of executing. ### High-Level Design > Demonstrate the high-level design if this PR is complex. ### Specific Changes > List the specific changes. ### API > Demonstrate how the API changes if any. ### Usage Example > Provide usage example(s) for easier usage. 1. Register a [Tavily](https://app.tavily.com/home) account 2. Edit the `mcp_server.json` file by replacing `url` and `auth_token`. Surely, you can use your own MCP server according to the instructions provided by [FastMCP](https://gofastmcp.com/clients/transports#configuration-based-transports) (supporting SSEServer, stdioServer and streamHTTP) 3. Configure the `mcp_tool_config.yaml` file: - `mcp_server_config_path` should point to the JSON file from step 2 - `tool_selected_list` specifies the tools you need to register from the MCP server 4. *(Optional)* Implement a concrete instance based on `MCPBaseTool` to parse the results returned by the server Details are listed in [tutorial](https://github.com/AlecHenx/ml-recipe/blob/main/Tutorial%20for%20MCP%20Tool%20in%20veRL.md) ### Test > For changes that can not be tested by CI (e.g., algorithm implementation, new model support), validate by experiment(s) and show results like training curve plots, evaluatuion results, etc. ### Additional Info. - **Issue Number**: Fixes part of issue volcengine#1837 - **Training**: [Note which backend this PR will affect: FSDP, Megatron, both, or none] - **Inference**: [Note which backend this PR will affect: vLLM, SGLang, both, or none] ### Checklist Before Submitting - [x] Read the [Contribute Guide](https://github.com/volcengine/verl?tab=readme-ov-file#contribution-guide). - [x] Apply [pre-commit checks](https://github.com/volcengine/verl?tab=readme-ov-file#code-linting-and-formatting). - [x] Add `[BREAKING]` to the PR title if it breaks any API. - [x] Update the documentation about your changes in the [docs](https://github.com/volcengine/verl/tree/main/docs). - [x] New CI unit test(s) are added to cover the code path. - [x] Rely on existing unit tests on CI that covers the code path.

eric-haibin-lin · 2025-06-26T23:04:00Z

Hi @AlecHenx would you mind adding the tutorial https://github.com/AlecHenx/ml-recipe/blob/main/Tutorial%20for%20MCP%20Tool%20in%20verl.md as a subsection in https://verl.readthedocs.io/en/latest/sglang_multiturn/multiturn.html? (docs/sglang_multiturn/multiturn.rst)

1. MCP client manager which manages the connection with MCP server, such as session multiplexing, rate limit. 2. Search Tool with MCP client and [Tavily](https://app.tavily.com/home) MCP server, which delivers the same capability with Search R1 Tool. 3. A general MCP tool base for handling the logic of executing. ### High-Level Design > Demonstrate the high-level design if this PR is complex. ### Specific Changes > List the specific changes. ### API > Demonstrate how the API changes if any. ### Usage Example > Provide usage example(s) for easier usage. 1. Register a [Tavily](https://app.tavily.com/home) account 2. Edit the `mcp_server.json` file by replacing `url` and `auth_token`. Surely, you can use your own MCP server according to the instructions provided by [FastMCP](https://gofastmcp.com/clients/transports#configuration-based-transports) (supporting SSEServer, stdioServer and streamHTTP) 3. Configure the `mcp_tool_config.yaml` file: - `mcp_server_config_path` should point to the JSON file from step 2 - `tool_selected_list` specifies the tools you need to register from the MCP server 4. *(Optional)* Implement a concrete instance based on `MCPBaseTool` to parse the results returned by the server Details are listed in [tutorial](https://github.com/AlecHenx/ml-recipe/blob/main/Tutorial%20for%20MCP%20Tool%20in%20veRL.md) ### Test > For changes that can not be tested by CI (e.g., algorithm implementation, new model support), validate by experiment(s) and show results like training curve plots, evaluatuion results, etc. ### Additional Info. - **Issue Number**: Fixes part of issue volcengine#1837 - **Training**: [Note which backend this PR will affect: FSDP, Megatron, both, or none] - **Inference**: [Note which backend this PR will affect: vLLM, SGLang, both, or none] ### Checklist Before Submitting - [x] Read the [Contribute Guide](https://github.com/volcengine/verl?tab=readme-ov-file#contribution-guide). - [x] Apply [pre-commit checks](https://github.com/volcengine/verl?tab=readme-ov-file#code-linting-and-formatting). - [x] Add `[BREAKING]` to the PR title if it breaks any API. - [x] Update the documentation about your changes in the [docs](https://github.com/volcengine/verl/tree/main/docs). - [x] New CI unit test(s) are added to cover the code path. - [x] Rely on existing unit tests on CI that covers the code path.

X1angyuLu · 2025-07-08T03:20:07Z

Hi @AlecHenx, thank you very much for this amazing PR! 🙌

However, I encountered an issue when trying to use the MCP search tool during PPO training.

Initially, everything worked smoothly — the model was able to call the tool correctly, and the tool results were returned properly after the user message. But after some time, the MCP search tool started failing with the following error, although the training continued normally (just without receiving the search results):

Tool execution failed: local variable 'call_tool_result' referenced before assignment

Eventually, the entire pipeline gets stuck and times out — although I don't think this is directly caused by the error above.

You can find the full logs and environment details at the links below:

We ran the training script as follows:

set -x

ulimit -n 65535

PROJECT_DIR="$(pwd)"
CONFIG_PATH="$PROJECT_DIR/examples/sglang_multiturn/config"
echo $CONFIG_PATH

TRAIN_DATA=~/data/searchR1_processed_nq/test.parquet
VAL_DATA=~/data/searchR1_processed_nq/test.parquet

TOOL_CONFIG="$CONFIG_PATH/tool_config/mcp_tool_config.yaml"

model_path=Qwen/Qwen2.5-1.5B-Instruct

python3 -m verl.trainer.main_ppo \
 --config-path="$CONFIG_PATH" \
 --config-name='search_multiturn_grpo' \
 algorithm.adv_estimator=grpo \
 data.train_batch_size=64 \
 data.val_batch_size=256 \
 data.max_prompt_length=4096 \
 data.max_response_length=3000 \
 data.filter_overlong_prompts=True \
 data.truncation='error' \
 data.return_raw_chat=True \
 actor_rollout_ref.model.path=$model_path \
 actor_rollout_ref.actor.optim.lr=1e-6 \
 actor_rollout_ref.actor.optim.lr_warmup_steps_ratio=0.285 \
 actor_rollout_ref.model.use_remove_padding=True \
 actor_rollout_ref.actor.ppo_mini_batch_size=64 \
 actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=8 \
 actor_rollout_ref.actor.use_kl_loss=True \
 actor_rollout_ref.actor.kl_loss_coef=0.001 \
 actor_rollout_ref.actor.kl_loss_type=low_var_kl \
 actor_rollout_ref.actor.entropy_coeff=0 \
 actor_rollout_ref.model.enable_gradient_checkpointing=True \
 actor_rollout_ref.actor.fsdp_config.param_offload=False \
 actor_rollout_ref.actor.fsdp_config.optimizer_offload=False \
 actor_rollout_ref.rollout.max_model_len=15000 \
 actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=8 \
 actor_rollout_ref.rollout.tensor_model_parallel_size=1 \
 actor_rollout_ref.rollout.name=sglang \
 actor_rollout_ref.rollout.gpu_memory_utilization=0.5 \
 actor_rollout_ref.rollout.n=5 \
 actor_rollout_ref.rollout.multi_turn.enable=True \
 actor_rollout_ref.rollout.multi_turn.max_assistant_turns=2 \
 actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu=8 \
 actor_rollout_ref.ref.fsdp_config.param_offload=True \
 actor_rollout_ref.rollout.multi_turn.tool_config_path="$TOOL_CONFIG" \
 algorithm.use_kl_in_reward=False \
 trainer.critic_warmup=0 \
 trainer.val_before_train=False \
 trainer.logger=['console','wandb'] \
 trainer.project_name='search_r1_like_async_rl' \
 trainer.experiment_name='qwen2.5-1.5b-instruct_function_rm-search-async-sgl-multi-w-searchtool' \
 trainer.n_gpus_per_node=2 \
 trainer.nnodes=1 \
 trainer.save_freq=1500 \
 trainer.test_freq=1500 \
 data.train_files="$TRAIN_DATA" \
 data.val_files="$VAL_DATA" \
 trainer.total_epochs=1 $@

AlecHenx · 2025-07-10T10:28:58Z

This is because a large number of tool calls exceeded Tavily's rate limit, which caused an exception to be thrown before the call_tool_result received the search results. I will make the following two optimizations:

Improve the exception handling in the mcp_base_tool.py to allow viewing of the details of exceptions when calling the Tavily tool.
Optimize the current rate limit module by changing it from a quantity limit to a time limit.

For now, you can try to decrease the rate limit size in mcp_tool_config.yaml.

X1angyuLu · 2025-07-16T06:19:04Z

Hi @AlecHenx, thanks for your response!

Yes, I believe Tavily's rate limit is indeed the direct cause of the Tool execution failed: local variable 'call_tool_result' referenced before assignment error. To verify this, I implemented a simple dummy MCP search tool and ran it with the same PPO training script — the error no longer occurred.

However, I still encountered the hang issue during the later stage of training: the entire pipeline keeps waiting for the tool result until it times out. During this period, the GPU remains fully allocated but with low power usage, indicating that it's not actively generating or updating. I suspect this is an issue between the MCP tool and the rollout logic, since the hang occurs with both the Tavily search tool and my dummy implementation.

Here is the traceback from hanging case

The failure appears to occur during a dist.barrier() call inside sglang_rollout.py.

Click to expand full traceback

Traceback (most recent call last):
  File "/home/user/verl/verl/trainer/main_ppo.py", line 58, in main
    run_ppo(config)
  File "/home/user/verl/verl/trainer/main_ppo.py", line 81, in run_ppo
    ray.get(runner.run.remote(config))
  File "/home/user/anaconda3/envs/verl/lib/python3.10/site-packages/ray/_private/auto_init_hook.py", line 22, in auto_init_wrapper
    return fn(*args, **kwargs)
  File "/home/user/anaconda3/envs/verl/lib/python3.10/site-packages/ray/_private/client_mode_hook.py", line 104, in wrapper
    return func(*args, **kwargs)
  File "/home/user/anaconda3/envs/verl/lib/python3.10/site-packages/ray/_private/worker.py", line 2849, in get
    values, debugger_breakpoint = worker.get_objects(object_refs, timeout=timeout)
  File "/home/user/anaconda3/envs/verl/lib/python3.10/site-packages/ray/_private/worker.py", line 937, in get_objects
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(RuntimeError): �[36mray::TaskRunner.run()�[39m (pid=117929, ip=172.20.114.81, actor_id=0538147587cda890d70d48fb01000000, repr=<main_ppo.TaskRunner object at 0x7f79c2e71ea0>)
  File "/home/user/verl/verl/trainer/main_ppo.py", line 231, in run
    trainer.fit()
  File "/home/user/verl/verl/trainer/ppo/ray_trainer.py", line 1144, in fit
    gen_batch_output = self.actor_rollout_wg.generate_sequences(gen_batch)
  File "/home/user/verl/verl/single_controller/ray/base.py", line 51, in __call__
    output = ray.get(output)
ray.exceptions.RayTaskError(RuntimeError): �[36mray::WorkerDict.actor_rollout_generate_sequences()�[39m (pid=135815, ip=172.20.114.81, actor_id=a40c8581e44e0cb416f6a75801000000, repr=<verl.single_controller.ray.base.WorkerDict object at 0x7f2d33f7aa40>)
  File "/home/user/verl/verl/single_controller/ray/base.py", line 708, in func
    return getattr(self.worker_dict[key], name)(*args, **kwargs)
  File "/home/user/verl/verl/single_controller/base/decorator.py", line 549, in inner
    return func(*args, **kwargs)
  File "/home/user/verl/verl/workers/fsdp_workers.py", line 738, in generate_sequences
    output = self.rollout.generate_sequences(prompts=prompts)
  File "/home/user/verl/verl/utils/profiler/performance.py", line 89, in f
    return self.log(decorated_function, *args, **kwargs)
  File "/home/user/verl/verl/utils/profiler/performance.py", line 102, in log
    output = func(*args, **kwargs)
  File "/home/user/anaconda3/envs/verl/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
  File "/home/user/verl/verl/workers/rollout/sglang_rollout/sglang_rollout.py", line 542, in generate_sequences
    return self._req_level_generate_sequences(prompts, **kwargs)
  File "/home/user/verl/verl/utils/profiler/performance.py", line 89, in f
    return self.log(decorated_function, *args, **kwargs)
  File "/home/user/verl/verl/utils/profiler/performance.py", line 102, in log
    output = func(*args, **kwargs)
  File "/home/user/anaconda3/envs/verl/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
  File "/home/user/verl/verl/workers/rollout/sglang_rollout/sglang_rollout.py", line 1049, in _req_level_generate_sequences
    dist.barrier()
  File "/home/user/anaconda3/envs/verl/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 81, in wrapper
    return func(*args, **kwargs)
  File "/home/user/anaconda3/envs/verl/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 4556, in barrier
    work.wait()
RuntimeError: [/pytorch/third_party/gloo/gloo/transport/tcp/unbound_buffer.cc:81] Timed out waiting 1800000ms for recv operation to complete

From my debugging, I’ve confirmed that the hang happens specifically at this line:

await self._call_tool(instance_id, parameters)

Temporarily replacing it with:

await asyncio.wait_for(self._call_tool(instance_id, parameters), self.timeout)

does prevent the hang, but it harms performance — many valid tool calls are skipped due to timeout. Notably, this issue does not occur when using non-MCP functional tools like gsm8k_tool.

Could it be a server-side freeze?
Unlikely — I can still access the MCP server normally during the hang.
Could it be related to the transport protocol?
I tried stdio, http, and sse, and the hang still happens with all of them.

I found some possibly related issues:

Here are my full logs and dummy tool setup:

📄 full runtime log

MCP Server

from fastmcp import FastMCP
import asyncio

mcp = FastMCP("tavily_search_dummy_tool")

@mcp.tool()
async def tavily_search(query: str):
    metadata = {
        "query": query,
        "results": "No results found"
    }
    return metadata

if __name__ == "__main__":
    asyncio.run(mcp.run(transport='sse', port=50052))

MCP Client

import json
import logging
import os
from typing import Tuple
from verl.tools.mcp_base_tool import MCPBaseTool
from .schemas import OpenAIFunctionToolSchema

logger = logging.getLogger(__name__)
logger.setLevel(os.getenv("VERL_LOGGING_LEVEL", "WARN"))

class MCPDummyTool(MCPBaseTool):
    def __init__(self, config: dict, tool_schema: OpenAIFunctionToolSchema):
        super().__init__(config, tool_schema)

    def _parse_tool_result(self, content: list) -> Tuple[str, dict]:
        data = json.loads(content[0].text)
        result = data["results"]
        metadata = data
        metadata["api_request_error"] = ""
        return result, metadata

MCP Server Config

{
  "mcpServers": {
    "dummy_server": {
      "url": "http://127.0.0.1:50052/sse/",
      "transport": "sse",
      "auth_token": "dummy"
    }
  }
}

# Terminal 1: start the dummy server
python verl/tools/mcp_dummy_search_tool_server.py

# Terminal 2: launch PPO training

Thank you again for your great support!

AlecHenx · 2025-07-30T04:31:14Z

Hi @AlecHenx, thanks for your response!

Yes, I believe Tavily's rate limit is indeed the direct cause of the Tool execution failed: local variable 'call_tool_result' referenced before assignment error. To verify this, I implemented a simple dummy MCP search tool and ran it with the same PPO training script — the error no longer occurred.

However, I still encountered the hang issue during the later stage of training: the entire pipeline keeps waiting for the tool result until it times out. During this period, the GPU remains fully allocated but with low power usage, indicating that it's not actively generating or updating. I suspect this is an issue between the MCP tool and the rollout logic, since the hang occurs with both the Tavily search tool and my dummy implementation.

Here is the traceback from hanging case

The failure appears to occur during a dist.barrier() call inside sglang_rollout.py.

Click to expand full traceback

Traceback (most recent call last):
  File "/home/user/verl/verl/trainer/main_ppo.py", line 58, in main
    run_ppo(config)
  File "/home/user/verl/verl/trainer/main_ppo.py", line 81, in run_ppo
    ray.get(runner.run.remote(config))
  File "/home/user/anaconda3/envs/verl/lib/python3.10/site-packages/ray/_private/auto_init_hook.py", line 22, in auto_init_wrapper
    return fn(*args, **kwargs)
  File "/home/user/anaconda3/envs/verl/lib/python3.10/site-packages/ray/_private/client_mode_hook.py", line 104, in wrapper
    return func(*args, **kwargs)
  File "/home/user/anaconda3/envs/verl/lib/python3.10/site-packages/ray/_private/worker.py", line 2849, in get
    values, debugger_breakpoint = worker.get_objects(object_refs, timeout=timeout)
  File "/home/user/anaconda3/envs/verl/lib/python3.10/site-packages/ray/_private/worker.py", line 937, in get_objects
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(RuntimeError): �[36mray::TaskRunner.run()�[39m (pid=117929, ip=172.20.114.81, actor_id=0538147587cda890d70d48fb01000000, repr=<main_ppo.TaskRunner object at 0x7f79c2e71ea0>)
  File "/home/user/verl/verl/trainer/main_ppo.py", line 231, in run
    trainer.fit()
  File "/home/user/verl/verl/trainer/ppo/ray_trainer.py", line 1144, in fit
    gen_batch_output = self.actor_rollout_wg.generate_sequences(gen_batch)
  File "/home/user/verl/verl/single_controller/ray/base.py", line 51, in __call__
    output = ray.get(output)
ray.exceptions.RayTaskError(RuntimeError): �[36mray::WorkerDict.actor_rollout_generate_sequences()�[39m (pid=135815, ip=172.20.114.81, actor_id=a40c8581e44e0cb416f6a75801000000, repr=<verl.single_controller.ray.base.WorkerDict object at 0x7f2d33f7aa40>)
  File "/home/user/verl/verl/single_controller/ray/base.py", line 708, in func
    return getattr(self.worker_dict[key], name)(*args, **kwargs)
  File "/home/user/verl/verl/single_controller/base/decorator.py", line 549, in inner
    return func(*args, **kwargs)
  File "/home/user/verl/verl/workers/fsdp_workers.py", line 738, in generate_sequences
    output = self.rollout.generate_sequences(prompts=prompts)
  File "/home/user/verl/verl/utils/profiler/performance.py", line 89, in f
    return self.log(decorated_function, *args, **kwargs)
  File "/home/user/verl/verl/utils/profiler/performance.py", line 102, in log
    output = func(*args, **kwargs)
  File "/home/user/anaconda3/envs/verl/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
  File "/home/user/verl/verl/workers/rollout/sglang_rollout/sglang_rollout.py", line 542, in generate_sequences
    return self._req_level_generate_sequences(prompts, **kwargs)
  File "/home/user/verl/verl/utils/profiler/performance.py", line 89, in f
    return self.log(decorated_function, *args, **kwargs)
  File "/home/user/verl/verl/utils/profiler/performance.py", line 102, in log
    output = func(*args, **kwargs)
  File "/home/user/anaconda3/envs/verl/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
  File "/home/user/verl/verl/workers/rollout/sglang_rollout/sglang_rollout.py", line 1049, in _req_level_generate_sequences
    dist.barrier()
  File "/home/user/anaconda3/envs/verl/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 81, in wrapper
    return func(*args, **kwargs)
  File "/home/user/anaconda3/envs/verl/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 4556, in barrier
    work.wait()
RuntimeError: [/pytorch/third_party/gloo/gloo/transport/tcp/unbound_buffer.cc:81] Timed out waiting 1800000ms for recv operation to complete

From my debugging, I’ve confirmed that the hang happens specifically at this line:

await self._call_tool(instance_id, parameters)

Temporarily replacing it with:

await asyncio.wait_for(self._call_tool(instance_id, parameters), self.timeout)

does prevent the hang, but it harms performance — many valid tool calls are skipped due to timeout. Notably, this issue does not occur when using non-MCP functional tools like gsm8k_tool.

Could it be a server-side freeze?
Unlikely — I can still access the MCP server normally during the hang.
Could it be related to the transport protocol?
I tried stdio, http, and sse, and the hang still happens with all of them.

I found some possibly related issues:

Here are my full logs and dummy tool setup:

📄 full runtime log

MCP Server

from fastmcp import FastMCP
import asyncio

mcp = FastMCP("tavily_search_dummy_tool")

@mcp.tool()
async def tavily_search(query: str):
    metadata = {
        "query": query,
        "results": "No results found"
    }
    return metadata

if __name__ == "__main__":
    asyncio.run(mcp.run(transport='sse', port=50052))

MCP Client

import json
import logging
import os
from typing import Tuple
from verl.tools.mcp_base_tool import MCPBaseTool
from .schemas import OpenAIFunctionToolSchema

logger = logging.getLogger(__name__)
logger.setLevel(os.getenv("VERL_LOGGING_LEVEL", "WARN"))

class MCPDummyTool(MCPBaseTool):
    def __init__(self, config: dict, tool_schema: OpenAIFunctionToolSchema):
        super().__init__(config, tool_schema)

    def _parse_tool_result(self, content: list) -> Tuple[str, dict]:
        data = json.loads(content[0].text)
        result = data["results"]
        metadata = data
        metadata["api_request_error"] = ""
        return result, metadata

MCP Server Config

{
  "mcpServers": {
    "dummy_server": {
      "url": "http://127.0.0.1:50052/sse/",
      "transport": "sse",
      "auth_token": "dummy"
    }
  }
}

# Terminal 1: start the dummy server
python verl/tools/mcp_dummy_search_tool_server.py

# Terminal 2: launch PPO training

Thank you again for your great support!

Thank you for providing the details of hang problem, I'll have a look at it.

feifeibear approved these changes Jun 11, 2025

View reviewed changes

AlecHenx changed the title ~~tool feat: Add Search Tool implemented with MCP~~ [tool feat]: Add Search Tool implemented with MCP Jun 11, 2025

SwordFaith reviewed Jun 12, 2025

View reviewed changes

Lins-01 reviewed Jun 14, 2025

View reviewed changes

AlecHenx requested a review from zhaochenyang20 as a code owner June 16, 2025 01:41

SwordFaith approved these changes Jun 16, 2025

View reviewed changes

bqw1013 reviewed Jun 16, 2025

View reviewed changes

bqw1013 approved these changes Jun 17, 2025

View reviewed changes

AlecHenx changed the title ~~[tool feat]: Add Search Tool implemented with MCP~~ [tool] feat: Add Search Tool implemented with MCP Jun 18, 2025

AlecHenx added 10 commits June 18, 2025 21:43

first commit

a1d31a5

support general tool registry and invoke

0281941

fix search tool parser

95f0429

refactor config

7143344

typo

55b9c22

add two-turn calls

5f5de5e

typo

e549f2f

typo

fe8794b

ci modify

7bbb2ce

ci modify

15ca75a

AlecHenx force-pushed the feat/search_mcp branch from d7b4ebf to 15ca75a Compare June 18, 2025 13:51

AlecHenx added 3 commits June 19, 2025 11:08

ut modify

848a94c

clear tavily configure

76cdc66

revert

b5a17f2

AlecHenx requested a review from chenhaiq as a code owner June 19, 2025 08:25

chenhaiq enabled auto-merge (squash) June 19, 2025 14:40

chenhaiq approved these changes Jun 19, 2025

View reviewed changes

chenhaiq merged commit b401382 into volcengine:main Jun 19, 2025
36 of 37 checks passed

		metrics = {"query_count": metadata.get("query_count", 0), "status": metadata.get("status", "unknown"), "total_results": metadata.get("total_results", 0), "api_request_error": metadata.get("api_request_error")}

		return result_text, 0.0, metrics

		logger.info(f"Initialized MCPSearchTool with config: {config}")

[tool] feat: Add Search Tool implemented with MCP #1948

[tool] feat: Add Search Tool implemented with MCP #1948

Uh oh!

Conversation

AlecHenx commented Jun 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Checklist Before Starting

What does this PR do?

High-Level Design

Specific Changes

API

Usage Example

Test

Additional Info.

Checklist Before Submitting

Uh oh!

CLAassistant commented Jun 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

AlecHenx Jun 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

AlecHenx Jun 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SwordFaith left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

eric-haibin-lin commented Jun 26, 2025

Uh oh!

X1angyuLu commented Jul 8, 2025

Uh oh!

AlecHenx commented Jul 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

X1angyuLu commented Jul 16, 2025

MCP Server

MCP Client

MCP Server Config

Uh oh!

AlecHenx commented Jul 30, 2025

AlecHenx commented Jun 10, 2025 •

edited

Loading

CLAassistant commented Jun 10, 2025 •

edited

Loading

AlecHenx Jun 13, 2025 •

edited

Loading

AlecHenx Jun 12, 2025 •

edited

Loading

AlecHenx commented Jul 10, 2025 •

edited

Loading