Streaming model outputs #1236

aymeric-roucher · 2025-04-22T13:15:42Z

Implement streaming model outputs, to let user see the thoughts of their model displaying live.

Tested for:

OpenAI
InferenceProviders
LiteLLM
TransformersModel

Streaming was not implemented, left for future PRs:

VLLMModel
MLXModel
AzureOpenAIServerModel
AmazonBedrockServerModel

aymeric-roucher · 2025-04-22T14:14:44Z

src/smolagents/agents.py

@@ -184,7 +185,7 @@ class MultiStepAgent(ABC):
    def __init__(
        self,
        tools: List[Tool],
-        model: Callable[[List[Dict[str, str]]], ChatMessage],
+        model: Model,


If we add the option to stream model outputs, model can not just be a Callable returning a ChatMessage.
This means we'l have to edit the parts of the doc that show how to create a Model, to explain how to inherit from the base Model class instead of directly creating a Callable.

aymeric-roucher · 2025-04-22T14:15:54Z

src/smolagents/agents.py

@@ -340,7 +342,7 @@ def run(

    def _run(
        self, task: str, max_steps: int, images: List["PIL.Image.Image"] | None = None
-    ) -> Generator[ActionStep | AgentType, None, None]:
+    ) -> Generator[ActionStep | FinalAnswerStep, None, None]:


Fixing this type hint.

aymeric-roucher · 2025-04-22T14:16:27Z

src/smolagents/agents.py

@@ -350,12 +352,14 @@ def _run(
            if self.planning_interval is not None and (
                self.step_number == 1 or (self.step_number - 1) % self.planning_interval == 0
            ):
-                planning_step = self._create_planning_step(
+                planning_step = self._generate_planning_step(


"Generate" is better IMO since an LLM output is generated in this function: it's not simply about creating an empty object.

aymeric-roucher · 2025-04-22T14:16:48Z

src/smolagents/agents.py

@@ -375,9 +379,6 @@ def _run(
            yield action_step
        yield FinalAnswerStep(handle_agent_output_types(final_answer))

-    def _create_action_step(self, step_start_time: float, images: List["PIL.Image.Image"] | None) -> ActionStep:


This method is not useful anymore and obscures the workflow

src/smolagents/agents.py

aymeric-roucher · 2025-04-22T14:18:28Z

src/smolagents/agents.py

-        except Exception as e:
-            raise AgentParsingError(f"Error while generating or parsing output:\n{e}", self.logger) from e
+            if self.stream_outputs:
+                raise NotImplementedError("Stream outputs are not yet implemented for ToolCallingAgent")


Streaming output with ToolCallingAgent implies streaming ChoiceDeltaToolCallFunction objects from various APIs, which is worth another PR.

aymeric-roucher · 2025-04-22T14:19:03Z

src/smolagents/memory.py

@@ -44,7 +44,7 @@ class MemoryStep:
    def dict(self):
        return asdict(self)

-    def to_messages(self, **kwargs) -> List[Dict[str, Any]]:
+    def to_messages(self, summary_mode: bool = False) -> List[Message]:


Harmonize the API for all to_messages methods

aymeric-roucher · 2025-04-22T14:20:08Z

src/smolagents/models.py

            response.choices[0].message.model_dump(include={"role", "content", "tool_calls"}),
            raw=response,
        )
-        return self.postprocess_message(first_message, tools_to_call_from)
+
+    def generate_stream(


New generate_stream methods. once we've setup streaming for ToolCallingAgent, the generate method will simply be able to call generate_stream and return the final completion.

src/smolagents/agents.py

albertvillanova

Thanks for the contributions! There's a lot of great work here. Having so many changes bundled into a single PR does make it a bit challenging to review thoroughly, but I appreciate the effort.

These are just some initial comments, I’ll continue reviewing the rest of the PR shortly, once you tell me no more changes are coming in...

src/smolagents/agents.py

albertvillanova · 2025-04-23T09:03:49Z

src/smolagents/models.py

@@ -377,7 +391,26 @@ def __call__(
        Returns:
            `ChatMessage`: A chat message object containing the model's response.
        """
-        pass  # To be implemented in child classes!
+        raise NotImplementedError("This method must be implemented in child classes")


What about defining Model as an abstract class and decorating this method as abstractmethod?

For generate we could! For generate_stream however, it will sometimes be implemented by child classes, sometimes not, so making it an abstract method would prevent proper intialization. Do we prefer to make only generate an abstract method, or keep the common implementation by only raising NotImplementedError in both methods?

I think we can delete generate_stream here (see comment above).

Co-authored-by: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com>

HuggingFaceDocBuilderDev · 2025-04-23T09:44:16Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

aymeric-roucher · 2025-04-23T09:46:47Z

@albertvillanova it's only minor changes now, you can review

src/smolagents/agents.py

albertvillanova

Some comments to maintain backward compatibility: users may pass a Callable as model.

src/smolagents/agents.py

Co-authored-by: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com>

albertvillanova

Another batch of comments...
Sorry, difficult to go trough more than 1,000 modified lines...

albertvillanova · 2025-04-23T13:50:02Z

src/smolagents/utils.py

+
+
+def has_implemented_method(instance, parent_class, method_name: str) -> bool:
+    instance_method = getattr(instance.__class__, method_name, None)
+    parent_method = getattr(parent_class, method_name, None)
+    return instance_method is not parent_method


No longer used:

Suggested change

def has_implemented_method(instance, parent_class, method_name: str) -> bool:

instance_method = getattr(instance.__class__, method_name, None)

parent_method = getattr(parent_class, method_name, None)

return instance_method is not parent_method

It was a mistake to not be using it, we do need it as a check in the init: just reintroduced it!

I think this is a very hacky way to check if the model has the generate_stream method.

My suggestion:

as this method is optional, the parent Model should not have it (see discussion about setting generate as abstractmethod, but not generate_stream: Streaming model outputs #1236 (comment)).

we can remove this hacky method

we can just check if the model hast generate_stream method: hasattr(self.model, "generate_stream")

albertvillanova · 2025-04-23T15:07:48Z

src/smolagents/models.py

+            **completion_kwargs, stream=True, stream_options={"include_usage": True}
+        ):
+            if event.choices:
+                if event.choices[0].delta is None:


Have you tested this? I'm wondering if event.choices[0].delta can be None or it is always a class instance.

Anyway, maybe we could add some tests for generate_stream.

Just aded tests for generate_stream in LiteLLMModel, InferenceClientModel, and TransformersModel.

I have manually checked your tests for Transformers and InferenceClient and the condition event.choices[0].delta is None is never fulfilled.

Have you checked it with LiteLLMModel, gpt-4?

albertvillanova · 2025-04-23T15:10:55Z

src/smolagents/models.py

+    def parse_tool_calls(self, message: ChatMessage) -> ChatMessage:
+        """Sometimes APIs do not return the tool call as a specific object, so we need to parse it."""
+        message.role = MessageRole.ASSISTANT  # Overwrite role if needed
+        if not message.tool_calls:
+            assert message.content is not None, "Message contains no content and no tool calls"
+            message.tool_calls = [
+                get_tool_call_from_text(message.content, self.tool_name_key, self.tool_arguments_key)
+            ]
+        assert len(message.tool_calls) > 0, "No tool call was found in the model output"
+        for tool_call in message.tool_calls:
+            tool_call.function.arguments = parse_json_if_needed(tool_call.function.arguments)
+        return message


Why do we need this for streaming and we didn't need before? Maybe I'm missing something...

It's more a simplification: cf this comment

albertvillanova · 2025-04-23T15:15:02Z

src/smolagents/models.py

+        from vllm import LLM  # type: ignore
+        from vllm.transformers_utils.tokenizer import get_tokenizer  # type: ignore

-        self.model_kwargs = {
-            **(model_kwargs or {}),
-            "model": model_id,
-        }
+        self.model_kwargs = model_kwargs or {}
        super().__init__(**kwargs)
        self.model_id = model_id
-        self.model = LLM(**self.model_kwargs)
+        self.model = LLM(model=model_id, **self.model_kwargs)
+        assert self.model is not None


I think these changes are not related to streaming. But is it necessary the assert here? I mean, any model is prone to receiving a None as model_id...

albertvillanova · 2025-04-23T15:15:49Z

src/smolagents/models.py

        self.tokenizer = get_tokenizer(model_id)
        self._is_vlm = False  # VLLMModel does not support vision models yet.

    def cleanup(self):
        import gc

        import torch
-        from vllm.distributed.parallel_state import destroy_distributed_environment, destroy_model_parallel
+        from vllm.distributed.parallel_state import (  # type: ignore


Why this # type: ignore? We are not enforcing static type checking...

It will improve readability for everyone using static type checking. If you're against that we can also remove it!

albertvillanova · 2025-04-23T15:17:55Z

src/smolagents/models.py

+        for message in messages:
+            if not isinstance(message["content"], str):
+                message["content"] = message["content"][0]["text"]


Why do we need this now and not before?

It was a dirty fix for an error that I missed: just fixed it.

albertvillanova

Another batch of reviews.

Thanks for your contribution.

albertvillanova · 2025-04-23T15:38:17Z

src/smolagents/models.py

+        for event in self.client.completion(**completion_kwargs, stream=True, stream_options={"include_usage": True}):
+            if event.choices:
+                if event.choices[0].delta is None:
+                    if not event.choices[0].finish_reason == "stop":


Simplify the logic:

Suggested change

if not event.choices[0].finish_reason == "stop":

if event.choices[0].finish_reason != "stop":

albertvillanova · 2025-04-23T15:39:25Z

src/smolagents/models.py

+                    yield CompletionDelta(
+                        content=event.choices[0].delta.content,
+                    )
+            if getattr(event, "usage", None):


This condition can only happen if the condition above is False, is this assumption right?

Suggested change

if getattr(event, "usage", None):

elif getattr(event, "usage", None):

Maybe some messages contain both some content and usage, so we would need to catch both using the double if instead of if/elif.

albertvillanova · 2025-04-23T15:43:21Z

src/smolagents/models.py

-        if tools_to_call_from:
-            chat_message.tool_calls = [
-                get_tool_call_from_text(output_text, self.tool_name_key, self.tool_arguments_key)
-            ]
-        return chat_message


Why do we no longer need to set .tool_calls attribute?

Because now this will be handled directly in the ToolCallingAgent.step method by parse_tool_calls!

albertvillanova · 2025-04-23T15:45:55Z

src/smolagents/models.py


-        default_max_tokens = 5000
+        default_max_tokens = 4096


Any reason fir this change?

Powers of 2 are always better!

albertvillanova · 2025-04-23T15:47:18Z

src/smolagents/models.py

@@ -787,44 +825,51 @@ def __call__(
            or kwargs.get("max_tokens")
            or self.kwargs.get("max_new_tokens")
            or self.kwargs.get("max_tokens")
+            or 1024


Do we want to hardcode this value?

I'm actually not sure: in case it's not filled, should we leave this to the underlying model/API?

albertvillanova · 2025-04-23T15:52:49Z

src/smolagents/models.py

+        """Sometimes APIs do not return the tool call as a specific object, so we need to parse it."""
+        message.role = MessageRole.ASSISTANT  # Overwrite role if needed
+        if not message.tool_calls:
+            assert message.content is not None, "Message contains no content and no tool calls"


Differently from before, now we can raise an error here. Is this intended?

Yes: either the model returns a tool call, either it returns some text, but it should at least return one.

albertvillanova · 2025-04-23T15:53:10Z

src/smolagents/models.py

+            message.tool_calls = [
+                get_tool_call_from_text(message.content, self.tool_name_key, self.tool_arguments_key)
+            ]
+        assert len(message.tool_calls) > 0, "No tool call was found in the model output"


Differently from before, now we can raise an error here. Is this intended?

Yes: it will help the model correct its output!

albertvillanova · 2025-04-23T15:58:10Z

src/smolagents/models.py

+    def __call__(self, *args, **kwargs):
+        return self.generate(*args, **kwargs)
+
+    def parse_tool_calls(self, message: ChatMessage) -> ChatMessage:


This function seems to replace the previous postprocess_message. However, this new function is only called by ToolCallingAgent.step, whereas the previous postprocess_message was called by all API models (__call__ method). Is this intended?

Yes: the idea is that we now more clearly separate:

Generation: the Model just generates text. Sometimes, depending on the API/Model, it can contain pre-defined tool_calls in the dedicated attribute.

Parsing: postprocess_message, which will if there's no tool call so far, fill the tool_calls attribute using tool calls parsed from the text.

albertvillanova

Another batch of reviews done before your today modifications.

albertvillanova · 2025-04-24T07:05:32Z

src/smolagents/agents.py

        prompt_templates: Optional[PromptTemplates] = None,
        grammar: Optional[Dict[str, str]] = None,
        additional_authorized_imports: Optional[List[str]] = None,
        planning_interval: Optional[int] = None,
        executor_type: str | None = "local",
        executor_kwargs: Optional[Dict[str, Any]] = None,
        max_print_outputs_length: Optional[int] = None,
+        stream_outputs: bool = False,


What about calling the param just stream, as in the OpenAI spec for Chat completion create?

This is a difficult question: for a chat completion, stream is obviously about streaming model outputs.
For an agent, what do you stream: agent steps? (as in agent.run() with stream=True)
Since here it's about streaming outputs, I put that in the name stream_outputs. but maybe there's an even more intuitive API.

albertvillanova · 2025-04-24T07:07:35Z

src/smolagents/agents.py

        prompt_templates ([`~agents.PromptTemplates`], *optional*): Prompt templates.
        grammar (`dict[str, str]`, *optional*): Grammar used to parse the LLM output.
        additional_authorized_imports (`list[str]`, *optional*): Additional authorized imports for the agent.
        planning_interval (`int`, *optional*): Interval at which the agent will run a planning step.
        executor_type (`str`, default `"local"`): Which executor type to use between `"local"`, `"e2b"`, or `"docker"`.
        executor_kwargs (`dict`, *optional*): Additional arguments to pass to initialize the executor.
        max_print_outputs_length (`int`, *optional*): Maximum length of the print outputs.
+        stream_outputs (`bool`, *optional*, default `False`): Whether to stream outputs during execution.


In docstrings, optional means default None.

Suggested change

stream_outputs (`bool`, *optional*, default `False`): Whether to stream outputs during execution.

stream_outputs (`bool`, default `False`): Whether to stream outputs during execution.

albertvillanova · 2025-04-24T07:27:20Z

src/smolagents/utils.py

+
+
+def has_implemented_method(instance, parent_class, method_name: str) -> bool:
+    instance_method = getattr(instance.__class__, method_name, None)
+    parent_method = getattr(parent_class, method_name, None)
+    return instance_method is not parent_method


I think this is a very hacky way to check if the model has the generate_stream method.

My suggestion:

as this method is optional, the parent Model should not have it (see discussion about setting generate as abstractmethod, but not generate_stream: Streaming model outputs #1236 (comment)).

we can remove this hacky method

we can just check if the model hast generate_stream method: hasattr(self.model, "generate_stream")

albertvillanova · 2025-04-24T07:30:43Z

src/smolagents/models.py

@@ -377,7 +391,26 @@ def __call__(
        Returns:
            `ChatMessage`: A chat message object containing the model's response.
        """
-        pass  # To be implemented in child classes!
+        raise NotImplementedError("This method must be implemented in child classes")


I think we can delete generate_stream here (see comment above).

albertvillanova · 2025-04-24T07:30:56Z

src/smolagents/models.py

+    def generate_stream(self, *args, **kwargs) -> Generator[CompletionDelta, None, None]:
+        raise NotImplementedError("This method must be implemented in child classes")
+


Suggested change

def generate_stream(self, *args, **kwargs) -> Generator[CompletionDelta, None, None]:

raise NotImplementedError("This method must be implemented in child classes")

albertvillanova · 2025-04-24T07:36:51Z

src/smolagents/agents.py

+        self.stream_outputs = stream_outputs
+        can_stream = has_implemented_method(self.model, Model, "generate_stream")
+        if self.stream_outputs and not can_stream:
+            raise ValueError(
+                "`stream_outputs` is set to True, but the model class implements no `generate_stream` method."
+            )


Suggested change

self.stream_outputs = stream_outputs

can_stream = has_implemented_method(self.model, Model, "generate_stream")

if self.stream_outputs and not can_stream:

raise ValueError(

"`stream_outputs` is set to True, but the model class implements no `generate_stream` method."

)

if stream_outputs and not hasattr(self.model, "generate_stream"):

raise ValueError(

"`stream_outputs` is set to True, but the model class implements no `generate_stream` method."

)

self.stream_outputs = stream_outputs

albertvillanova · 2025-04-24T08:09:20Z

src/smolagents/models.py

+            **completion_kwargs, stream=True, stream_options={"include_usage": True}
+        ):
+            if event.choices:
+                if event.choices[0].delta is None:


I have manually checked your tests for Transformers and InferenceClient and the condition event.choices[0].delta is None is never fulfilled.

aymeric-roucher added 3 commits April 22, 2025 11:52

Stream model outputs

74ac966

Working streaming outputs for OpenAIServerModel & CodeAgent

12cd84f

Fix some errors

fc4546a

aymeric-roucher force-pushed the streaming-model-outputs branch from aaba9d4 to fc4546a Compare April 22, 2025 14:13

aymeric-roucher commented Apr 22, 2025

View reviewed changes

src/smolagents/agents.py Outdated Show resolved Hide resolved

aymeric-roucher commented Apr 22, 2025

View reviewed changes

src/smolagents/agents.py Show resolved Hide resolved

aymeric-roucher added 10 commits April 22, 2025 16:31

Correct usage

c4eeff7

Support TransformersModel, both LLM and VLM

d5d9c54

Start adapting VLLMModel

2565bc0

Merge branch 'main' into streaming-model-outputs

f25f056

Fix VLLM and AzureOpenAI

04df7f9

Simplify transformers generation kwargs

b83a038

Pass some tests

2b29cc8

Fix model_id

5dad39f

Fix last test

db8df5b

Really fix this test

f679b0a

aymeric-roucher requested a review from albertvillanova April 22, 2025 18:23

aymeric-roucher marked this pull request as ready for review April 22, 2025 18:23

aymeric-roucher added 4 commits April 23, 2025 09:59

Set streaming by default if stream_outputs is not passed

9d1980d

Remove Optional type hints in models

ab0db42

Merge branch 'main' into streaming-model-outputs

85efc25

Remove Optional typing after merge

43892df

albertvillanova reviewed Apr 23, 2025

View reviewed changes

aymeric-roucher and others added 2 commits April 23, 2025 11:41

Update doc

256a5c5

Update src/smolagents/agents.py

1b7ad77

Co-authored-by: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com>

albertvillanova reviewed Apr 23, 2025

View reviewed changes

src/smolagents/agents.py Outdated Show resolved Hide resolved

albertvillanova reviewed Apr 23, 2025

View reviewed changes

src/smolagents/agents.py Outdated Show resolved Hide resolved

src/smolagents/agents.py Outdated Show resolved Hide resolved

aymeric-roucher and others added 5 commits April 23, 2025 15:11

Set stream_outputs to False

f20d1ff

Remove self.input_messages

f969a4e

Update src/smolagents/agents.py

9240e38

Co-authored-by: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com>

Update src/smolagents/agents.py

4a21006

Co-authored-by: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com>

Pass tests

597e0cb

albertvillanova reviewed Apr 23, 2025

View reviewed changes

albertvillanova approved these changes Apr 23, 2025

View reviewed changes

aymeric-roucher added 8 commits April 23, 2025 23:01

Check with has_implemented_method

27a1f9f

Add tests on generate_stream

aa0e342

Pass tests

8f87ec5

Format

3943c84

Fix VLLMModel

0d74df1

Fix VLLMModel

28c562e

Remove live options from AgentLogger

dd08c9c

Change error condition

686b492

aymeric-roucher merged commit 113d8c9 into main Apr 24, 2025
5 checks passed

albertvillanova reviewed Apr 24, 2025

View reviewed changes

albertvillanova mentioned this pull request May 6, 2025

[BUG] Stream not supported 'Stream' object has no attribute 'usage' #1213

Closed

albertvillanova mentioned this pull request May 19, 2025

Stop streaming if LiteLLM provide a finish_reason #1350

Merged

	if not event.choices[0].finish_reason == "stop":
	if event.choices[0].finish_reason != "stop":

	if getattr(event, "usage", None):
	elif getattr(event, "usage", None):

	stream_outputs (`bool`, optional, default `False`): Whether to stream outputs during execution.
	stream_outputs (`bool`, default `False`): Whether to stream outputs during execution.

		def generate_stream(self, args, *kwargs) -> Generator[CompletionDelta, None, None]:
		raise NotImplementedError("This method must be implemented in child classes")

Streaming model outputs #1236

Streaming model outputs #1236

Conversation

aymeric-roucher commented Apr 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

aymeric-roucher Apr 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

albertvillanova left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HuggingFaceDocBuilderDev commented Apr 23, 2025

Uh oh!

aymeric-roucher commented Apr 23, 2025

Uh oh!

Uh oh!

albertvillanova left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

albertvillanova left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

aymeric-roucher Apr 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

aymeric-roucher Apr 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

aymeric-roucher commented Apr 22, 2025 •

edited

Loading

aymeric-roucher Apr 22, 2025 •

edited

Loading

aymeric-roucher Apr 23, 2025 •

edited

Loading

aymeric-roucher Apr 23, 2025 •

edited

Loading

aymeric-roucher Apr 23, 2025 •

edited

Loading