-
Notifications
You must be signed in to change notification settings - Fork 22
Description
Thank you for your excellent work! This work is fantastic! I have a question that I hope you can answer:
Here saying:
We find that multimodal features encoded by VLMs can interpret instructions while retaining image priors. Due to causal attention, the format <instruction><image> is particularly important.
Therefore, when editing images, the input part should place the image after the instruction, but the editing data samples format in the dataset is as follows:
head -n 30 results_replace_laion_part4_edit.json
[
{
"image": [
"00024_00036_000367969/original.png",
"00024_00036_000367969/result.png"
],
"conversations": [
{
"from": "human",
"value": "\nreplace car located towards the upper-right corner of the image with a red motorcycle"
},
{
"from": "gpt",
"value": "<gen_image>"
}
]
},
The input part is the image placed before the instruction, which seems to contradict this statement? Where did I go wrong? I sincerely seek your guidance.