Skip to content

[Feature Request] Explore native image handling method #2928

@Wendong-Fan

Description

@Wendong-Fan

Required prerequisites

Motivation

Our primary method for handling image-related tasks is to have an agent execute a tool call. This call delegates the image analysis to a secondary agent or a vision-capable Large Language Model (LLM), which then returns the relevant information.

Optional Solutions for Future Exploration:

  • Embedded Image Processing: An alternative approach would be to send the user's message directly to ChatAgent.step. If an image is detected, a built-in function would be triggered to handle the image input, streamlining the process.

  • Sequence Validation: To ensure the conversational message history remains valid, we could programmatically add a placeholder (or "dummy") message to the tool's output.

  • clone the registered agent with memory of the original agent in ChatAgent.step, let the cloned agent step the image, give response

Solution

No response

Alternatives

No response

Additional context

No response

Metadata

Metadata

Assignees

Labels

CAMEL 2.0P1Task with middle level priorityenhancementNew feature or request

Projects

Status

No status

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions