About the order of instruction and image

Thank you for your excellent work! This work is fantastic! I have a question that I hope you can answer:
[Here](https://github.com/PKU-YuanGroup/UniWorld-V1?tab=readme-ov-file#:~:text=We%20find%20that%20multimodal%20features%20encoded%20by%20VLMs%20can%20interpret%20instructions%20while%20retaining%20image%20priors.%20Due%20to%20causal%20attention%2C%20the%20format%20%3Cinstruction%3E%3Cimage%3E%20is%20particularly%20important.) saying：
> We find that multimodal features encoded by VLMs can interpret instructions while retaining image priors. Due to causal attention, the format  \<instruction\>\<image\> is particularly important.

Therefore, when editing images, the input part should **place the image after the instruction**, but the editing data samples format in the dataset is as follows:

> head -n 30 results_replace_laion_part4_edit.json 
> [
>     {
>         "image": [
>             "00024_00036_000367969/original.png",
>             "00024_00036_000367969/result.png"
>         ],
>         "conversations": [
>             {
>                 "from": "human",
>                 "value": "<image>\nreplace car located towards the upper-right corner of the image with a red motorcycle"
>             },
>             {
>                 "from": "gpt",
>                 "value": "<gen_image>"
>             }
>         ]
>     },

The input part is the **image placed before the instruction**, which seems to contradict this statement? Where did I go wrong? I sincerely seek your guidance.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

About the order of instruction and image #14

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

About the order of instruction and image #14

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions