Skip to content

Problems with HatefulMemesRetrieval #2510

@Muennighoff

Description

@Muennighoff
  1. The dataset has many duplicates e.g. https://huggingface.co/datasets/Ahren09/MMSoc_HatefulMemes/viewer/default/train?q=meanwhile+at+the+isis+strip+club which are currently not accounted for so we waste compute encoding the same images and texts I think; Can we just dedup based on the exact images? (else there's also code at https://github.com/Muennighoff/vilio/blob/50eb7cc9c901795c394070f7705fc0c5a7667bd2/utils/pandas_scripts.py#L61 using image hashes). We could reupload a clean version under mteb/ on HF
  2. IIUC the current code assumes that there is one match per image or text but this is not the case because of (a) duplicates as mentioned above and (b) HatefulMemes contains the same caption and the same image multiple times but with different image/caption combinations. This adversarial construction is one of the main selling points of the difficulty of the dataset. Below is a simple example of the same caption with two different images (from here)
Image

So I guess two options are:
a - Dedup all captions only (this will leave some duplicate images but with different captions but should remove all images which are complete dups)
b - Have multiple labels for all images where the caption appears (in this case probably worth dedupping duplicate images)

Metadata

Metadata

Assignees

No one assigned

    Labels

    miebThe image extension of MTEB

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions