-
Notifications
You must be signed in to change notification settings - Fork 2k
Fix and refactor final answer checks #1448
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
final_answer = self._handle_max_steps_reached(task, images) | ||
yield action_step | ||
yield FinalAnswerStep(handle_agent_output_types(final_answer)) | ||
|
||
def _execute_step(self, memory_step: ActionStep) -> Generator[ChatMessageStreamDelta | FinalOutput]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think porting this logic into a separate function obfuscates the logic more than it clarifies.
src/smolagents/agents.py
Outdated
@@ -236,6 +238,7 @@ def __init__( | |||
tools: list[Tool], | |||
model: Model, | |||
prompt_templates: PromptTemplates | None = None, | |||
instructions: str | None = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is from #1442, will not appear after merging pr 1442.
a2e464e
to
fc7cc89
Compare
b5d3d18
to
54dbde6
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR refactors the final answer handling in agents and updates related tests to use PIL.Image
instances and the renamed step_number
field.
- Refactored
ActionStep.dict()
to serialize images, addis_final_answer
, and replace"step"
with"step_number"
. - Replaced the old
FinalOutput
type withActionOutput
and updated streaming logic inagents.py
. - Updated tests to use
Image.new(...)
, renamed keys in dict assertions, and added scenarios for final answer checks.
Reviewed Changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.
File | Description |
---|---|
tests/test_memory.py | Switched from string paths to PIL.Image objects, renamed "step" to "step_number" , added checks for observations_images . |
tests/test_agents.py | Added a test case for final_answer_checks , updated max_steps usage, and renamed planning step variables. |
src/smolagents/memory.py | Updated ActionStep.dict() to output step_number , serialize observations_images , and include is_final_answer . |
src/smolagents/agents.py | Removed FinalOutput , introduced ActionOutput , and refactored the streaming loop to validate and flag final answers. |
Comments suppressed due to low confidence (2)
tests/test_memory.py:101
- Consider adding an assertion to verify that
action_step_dict["observations_images"]
contains the expected byte data (e.g., matchingImage.new(...).tobytes()
).
assert "observations_images" in action_step_dict
tests/test_memory.py:203
- Add assertions in
test_task_step_to_messages
to confirm that the generated messages include an image entry (e.g., check for a content dict with"type": "image"
).
task_step = TaskStep(task="This is a task.", task_images=[Image.new("RGB", (100, 100))])
@@ -110,8 +110,9 @@ def populate_template(template: str, variables: dict[str, Any]) -> str: | |||
|
|||
|
|||
@dataclass | |||
class FinalOutput: | |||
output: Any | None | |||
class ActionOutput: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The ActionOutput
class is missing the @dataclass
decorator, so instances won’t accept constructor arguments. Add @dataclass
above this class definition.
Copilot uses AI. Check for mistakes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No it isn't, go home copilot you're drunk
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for addressing these issues.
Not sure about the need to replace FinalOutput
with ActionOutput
and the need to add the is_final_answer
...
tests/test_agents.py
Outdated
agent = CodeAgent( | ||
model=FakeCodeModel(), | ||
tools=[], | ||
final_answer_checks=[lambda x, y: x == 7.2904], | ||
verbosity_level=1000, | ||
) | ||
output = agent.run("Dummy task.") | ||
assert output == 7.2904 # Check that output is correct | ||
assert len([step for step in agent.memory.steps if isinstance(step, ActionStep)]) == 2 | ||
assert "Error raised in check" not in str(agent.write_memory_to_messages()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this test passes before the fixes introduced in this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right, just fixed it to not be passing before!
e0b2e39
to
ca5a539
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I understand correctly, the ActionOutput
class will be:
- Either
ActionOutput(output: Any, is_final_answer=True)
for final answer - Or
ActionOutput(output=None, is_final_answer=False)
for non final answer
IMHO it would be simpler to return:
- Either
FinalOutput(output: Any)
for final answer - Or
None
for non final answer
I find the ActionOutput
is non-optimal because it adds complexity without adding relevant information, but feel free to ignore it!
@Albert pasting my slack message here for visibility:
|
@aymeric-roucher I think when streaming, users are already consuming The alternative approach just adds a None: |
This refactor fixes #1440 and refactors the logic to address underlying issues such as "agent cannot return None in final_answer"