Skip to content

About the order of instruction and image #14

@hutaiHang

Description

@hutaiHang

Thank you for your excellent work! This work is fantastic! I have a question that I hope you can answer:
Here saying:

We find that multimodal features encoded by VLMs can interpret instructions while retaining image priors. Due to causal attention, the format <instruction><image> is particularly important.

Therefore, when editing images, the input part should place the image after the instruction, but the editing data samples format in the dataset is as follows:

head -n 30 results_replace_laion_part4_edit.json
[
{
"image": [
"00024_00036_000367969/original.png",
"00024_00036_000367969/result.png"
],
"conversations": [
{
"from": "human",
"value": "\nreplace car located towards the upper-right corner of the image with a red motorcycle"
},
{
"from": "gpt",
"value": "<gen_image>"
}
]
},

The input part is the image placed before the instruction, which seems to contradict this statement? Where did I go wrong? I sincerely seek your guidance.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions