-
Notifications
You must be signed in to change notification settings - Fork 463
Closed
Labels
miebThe image extension of MTEBThe image extension of MTEB
Description
- The dataset has many duplicates e.g. https://huggingface.co/datasets/Ahren09/MMSoc_HatefulMemes/viewer/default/train?q=meanwhile+at+the+isis+strip+club which are currently not accounted for so we waste compute encoding the same images and texts I think; Can we just dedup based on the exact images? (else there's also code at https://github.com/Muennighoff/vilio/blob/50eb7cc9c901795c394070f7705fc0c5a7667bd2/utils/pandas_scripts.py#L61 using image hashes). We could reupload a clean version under mteb/ on HF
- IIUC the current code assumes that there is one match per image or text but this is not the case because of (a) duplicates as mentioned above and (b) HatefulMemes contains the same caption and the same image multiple times but with different image/caption combinations. This adversarial construction is one of the main selling points of the difficulty of the dataset. Below is a simple example of the same caption with two different images (from here)
So I guess two options are:
a - Dedup all captions only (this will leave some duplicate images but with different captions but should remove all images which are complete dups)
b - Have multiple labels for all images where the caption appears (in this case probably worth dedupping duplicate images)
Metadata
Metadata
Assignees
Labels
miebThe image extension of MTEBThe image extension of MTEB