Problems with HatefulMemesRetrieval

1. The dataset has many duplicates e.g. https://huggingface.co/datasets/Ahren09/MMSoc_HatefulMemes/viewer/default/train?q=meanwhile+at+the+isis+strip+club which are currently not accounted for so we waste compute encoding the same images and texts I think; Can we just dedup based on the exact images? (else there's also code at https://github.com/Muennighoff/vilio/blob/50eb7cc9c901795c394070f7705fc0c5a7667bd2/utils/pandas_scripts.py#L61 using image hashes). We could reupload a clean version under mteb/ on HF
2. IIUC the [current code assumes that there is one match per image or text](https://github.com/embeddings-benchmark/mteb/blob/77bef06186c365ee6945fe234663439c3c984f77/mteb/tasks/Image/Any2AnyRetrieval/eng/HatefulMemesT2IRetrieval.py#L56) but this is not the case because of (a) duplicates as mentioned above and (b) HatefulMemes contains the same caption and the same image multiple times but with different image/caption combinations. This adversarial construction is one of the main selling points of the difficulty of the dataset. Below is a simple example of the same caption with two different images (from [here](https://arxiv.org/abs/2012.07788))

<img width="882" alt="Image" src="https://github.com/user-attachments/assets/21a9f679-bb2f-4593-b8f8-dadffe8ec1a4" />


So I guess two options are:
a - Dedup all captions only (this will leave some duplicate images but with different captions but should remove all images which are complete dups)
b - Have multiple labels for all images where the caption appears (in this case probably worth dedupping duplicate images)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Problems with HatefulMemesRetrieval #2510

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Problems with HatefulMemesRetrieval #2510

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions