Add support for the LongEmbed benchmark #393

dwzhu-pku · 2024-04-17T07:50:53Z

Checklist for adding MMTEB dataset

Reason for dataset addition:

I have tested that the dataset runs with the mteb package.
I have run the following models on the task (adding the results to the pr). These can be run using the mteb run -m {model_name} -t {task_name} command.
- sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2
- intfloat/multilingual-e5-small
I have checked that the performance is neither trivial (both models gain close to perfect scores) nor random (both models gain close to random scores).
I have considered the size of the dataset and reduced it if it is too big (2048 examples is typically large enough for most tasks)
Run tests locally to make sure nothing is broken using make test.
Run the formatter to format the code using make lint.
I have added points for my submission to the POINTS.md file. (i will add these information when the arxiv version of our pape is out)

isaac-chung

Could you also add results from running mteb run -m {model_name} -t {task_name} please?

KennethEnevoldsen · 2024-04-17T09:04:32Z

@isaac-chung, if you want to take charge of some of these reviews that would be great. If so I can add you to the list of reviewers for MMTEB (see overview issue)

KennethEnevoldsen

Wonderful addition! Seems like we a still missing a bit of metadata as well as model scores.

mteb/tasks/Retrieval/eng/LEMBNarrativeQARetrieval.py

mteb/tasks/Retrieval/eng/LEMBSummScreenFDRetrieval.py

isaac-chung · 2024-04-17T10:24:35Z

@isaac-chung, if you want to take charge of some of these reviews that would be great. If so I can add you to the list of reviewers for MMTEB (see overview issue)

@KennethEnevoldsen Yes please, feel free to add me.

dwzhu-pku · 2024-04-17T10:59:48Z

Thanks to @KennethEnevoldsen @isaac-chung for your timely review! I have updated the PR, where I added some model scores and meta data. Specifically, I have included results on all six datasets except for Passkey and Needle, which requires an additional parameter of context_length, and results of different context_length just overwrites each other. For the meta data, I have added some key information such as domains and forms. However, I have also find some fields very confusing. For example, what do you mean by saying the socioeconomic_status of a dataset?

KennethEnevoldsen

Tried to give reasonable directions on the metadata. Let us know if the documentation is lacking.

Specifically, I have included results on all six datasets except for Passkey and Needle, which requires an additional parameter of context_length, and results of different context_length just overwrites each other.

Can't you just use the default?

KennethEnevoldsen · 2024-04-17T11:19:26Z

mteb/tasks/Retrieval/eng/LEMBNarrativeQARetrieval.py

+        eval_splits=[_EVAL_SPLIT],
+        eval_langs=["eng-Latn"],
+        main_score="ndcg_at_10",
+        date=None,


when the text was created (from, to) or best guess.

take Narrative for example, do you mean when the original dataset is created (which is back in 2020), or when it is curated for retrieval (which could be today)

Text creation time, so if it e.g. includes books written from 1800-2000. These can be approximate guesses. Given that it was created in 2020, that at least set the upper bound.

I see, thank you!

KennethEnevoldsen · 2024-04-17T11:19:48Z

mteb/tasks/Retrieval/eng/LEMBNarrativeQARetrieval.py

+        form=["written"],
+        domains=["Fiction", "Non-fiction"],
+        task_subtypes=["Article retrieval"],
+        license=None,


required, if not specified then use "Not specified"

KennethEnevoldsen · 2024-04-17T11:21:22Z

mteb/tasks/Retrieval/eng/LEMBNarrativeQARetrieval.py

+        domains=["Fiction", "Non-fiction"],
+        task_subtypes=["Article retrieval"],
+        license=None,
+        socioeconomic_status=None,


The socioeconomic status of the text creators (lawyers is high, social media is mixed, news articles would be medium, but depends on the outlet)

It is just a rough estimate

ok, I see now. So the document here is referring to the text creators, rather than the data itself, right? https://github.com/embeddings-benchmark/mteb/blob/main/mteb/abstasks/TaskMetadata.py#L139

Ahh right, yea that is not very clear. Feel free to suggest a reformulation and add a point to (bug)fixes for yourself.

KennethEnevoldsen · 2024-04-17T11:23:16Z

mteb/tasks/Retrieval/eng/LEMBNarrativeQARetrieval.py

+        text_creation=None,
+        bibtex_citation=None,
+        n_samples={_EVAL_SPLIT: 10449},
+        avg_character_length=None,


you can use the calculate_metadata_metrics from abstask retrieval for this.

I see. Actually, what confuses me is the document saying that for retrieval task, 'this should be the average character length of the query-document pairs'. What's the average character length of a pair ? should I report a pair of average length, or the average length of the concatenation of query and document?

I agree it isn't too clear, which is why we specified the function (to at least make it consistent).

KennethEnevoldsen · 2024-04-17T11:23:35Z

mteb/tasks/Retrieval/eng/LEMBNarrativeQARetrieval.py

+        annotations_creators="derived",
+        dialect=None,
+        text_creation=None,
+        bibtex_citation=None,


It seems like it is from a paper so should be specified

KennethEnevoldsen · 2024-04-17T11:23:51Z

mteb/tasks/Retrieval/eng/LEMBNarrativeQARetrieval.py

+        socioeconomic_status=None,
+        annotations_creators="derived",
+        dialect=None,
+        text_creation=None,


How was the text created? E.g. "Found" online

I think it partly overlaps with annotations_creators. Isn't it enough to set annotations_creators to derived, if I have just transformed existing dataset for retrieval?

the reason why it is different is since the text_creation (e.g. tweets, "found") isn't nec. the same as the annotation creators (e.g. annotating them for sentiment, "human/expert annotated").

Isn't it enough to set annotations_creators to derived, if I have just transformed existing dataset for retrieval?

We def. prefer explicit annotations.

KennethEnevoldsen · 2024-04-17T11:24:13Z

mteb/tasks/Retrieval/eng/LEMBNarrativeQARetrieval.py

+        license=None,
+        socioeconomic_status=None,
+        annotations_creators="derived",
+        dialect=None,


assuming there are no dialects:

Suggested change

dialect=None,

dialect=[],

Thanks a lot for you elaboration! @KennethEnevoldsen

dwzhu-pku · 2024-04-17T12:48:55Z

many thanks to @KennethEnevoldsen for the detailed elaboration on each parameter for meta data. I have filled them all, and added results on Passkey and Needle using 512 context length.

KennethEnevoldsen

Thanks! Seems like there are a few odd scores, we should probably look into, but other than that the PR looks good.

KennethEnevoldsen · 2024-04-17T13:01:04Z

results/intfloat__multilingual-e5-small/LEMBNeedleRetrieval.json

+    "mrr_at_3": 0.86667,
+    "mrr_at_5": 0.87067,
+    "ndcg_at_1": 0.8,
+    "ndcg_at_10": 0.90614,


Seems like performance here is quite high. Does that correspond with your expectations? Not too familiar with the dataset, but it might be worth either using ndcg_at_1 or whatever they use in the paper.

KennethEnevoldsen · 2024-04-17T13:01:26Z

results/intfloat__multilingual-e5-small/LEMBPasskeyRetrieval.json

+    "ndcg_at_1": 1.0,
+    "ndcg_at_10": 1.0,
+    "ndcg_at_100": 1.0,
+    "ndcg_at_1000": 1.0,
+    "ndcg_at_20": 1.0,
+    "ndcg_at_3": 1.0,
+    "ndcg_at_5": 1.0,


Perfect scores?

yes, it is expected. The passkey and needle test have various evaluation lengths: {256,512,1024,2048,4096,8192,16384,32768}. (That's why we need an additional context length parameter) This is designed to measure the effective context window length of the embedding models. The reported scores are from 512 context length. Since embedding models typically have a context length of 512 and above, it's not strange for the perfect score here. After all, passkey retrieval itself is very easy. Needle retrieval is a little bit harder, so we did not observe 100% accuracy here

Ahh thanks for clarifying. This seems like they should have their own task subtypes.

A potential solution could simply be to include an equal percentage of each size. Then models with higher context length would perform better. An alternative model. Another option would also be to use the splits to differentiate. So e.g. "test_256", "test_512" etc.?

Actually it also bothers me in implementation. The first solution by include an equal percentage of each size may not work, as I do not want the candidate documents for each length to be mixed together. As for the second choice, I have used split to specify corpus, queries and qrels. As a result, I cannot figure out a better way then passing the context_length parameter to do the filtering.

Well the "test" split seems to be used in the data structure (not needed on HF), so splitting it up during the the dataset_transform() seems like a valid approach.

make sense, I'll try it. Really appreciate your help.

No worries, sorry for the long review period, but this seems like a good thing to get right early on in the process.

Hi, I have changed the code to use test_256, test_512, ... instead of the context length. Looks much better now, really appreciate your reminder! Results are also updated. Would you like to take a look? By the way, we have made our repo publicly available here: https://github.com/dwzhu-pku/LongEmbed. Welcome to take a look!

KennethEnevoldsen

discovered a few prints which I would remove.

KennethEnevoldsen · 2024-04-17T13:13:18Z

mteb/tasks/Retrieval/eng/LEMBWikimQARetrieval.py

+        print("Example Query")
+        print(list(queries.values())[5])
+        print("Example Passage (truncate at 200 characters)")
+        print(list(corpus.values())[5]["text"][:200])


Suggested change

print("Example Query")

print(list(queries.values())[5])

print("Example Passage (truncate at 200 characters)")

print(list(corpus.values())[5]["text"][:200])

KennethEnevoldsen · 2024-04-17T13:13:30Z

mteb/tasks/Retrieval/eng/LEMBSummScreenFDRetrieval.py

+        print("Example Query")
+        print(list(queries.values())[5])
+        print("Example Passage (truncate at 200 characters)")
+        print(list(corpus.values())[5]["text"][:200])


KennethEnevoldsen · 2024-04-17T13:13:39Z

mteb/tasks/Retrieval/eng/LEMBQMSumRetrieval.py

+        print("Example Query")
+        print(list(queries.values())[5])
+        print("Example Passage (truncate at 200 characters)")
+        print(list(corpus.values())[5]["text"][:200])


KennethEnevoldsen · 2024-04-17T13:13:48Z

mteb/tasks/Retrieval/eng/LEMBPasskeyRetrieval.py

+        print("Example Query")
+        print(list(queries.values())[5])
+        print("Example Passage (truncate at 200 characters)")
+        print(list(corpus.values())[5]["text"][:200])


KennethEnevoldsen · 2024-04-17T13:13:57Z

mteb/tasks/Retrieval/eng/LEMBNeedleRetrieval.py

+        print("Example Query")
+        print(list(queries.values())[5])
+        print("Example Passage (truncate at 200 characters)")
+        print(list(corpus.values())[5]["text"][:200])


KennethEnevoldsen · 2024-04-17T13:14:07Z

mteb/tasks/Retrieval/eng/LEMBNarrativeQARetrieval.py

+        print("Example Query")
+        print(list(queries.values())[5])
+        print("Example Passage (truncate at 200 characters)")
+        print(list(corpus.values())[5]["text"][:200])


ok, let me do the removal

Muennighoff · 2024-04-17T14:14:53Z

This looks amazing! If you also want a leaderboard tab for your benchmark feel free to open a PR here: https://huggingface.co/datasets/mteb/results with the result files. We could add it under Retrieval similar to https://huggingface.co/spaces/mteb/leaderboard?task=retrieval&language=law

dwzhu-pku · 2024-04-17T14:47:20Z

This looks amazing! If you also want a leaderboard tab for your benchmark feel free to open a PR here: https://huggingface.co/datasets/mteb/results with the result files. We could add it under Retrieval similar to https://huggingface.co/spaces/mteb/leaderboard?task=retrieval&language=law

Thanks!

KennethEnevoldsen

This looks really good, very happy with how it has ended up. @Muennighoff as you suggest let us add a page on the

I found two missing citations.
Will you also add the points as well.
We can create the benchmark in a separate PR (this one has already gone on for a while now)

mteb/tasks/Retrieval/eng/LEMBNeedleRetrieval.py

KennethEnevoldsen · 2024-04-18T08:13:30Z

mteb/tasks/Retrieval/eng/LEMBPasskeyRetrieval.py

+        annotations_creators="derived",
+        dialect=[],
+        text_creation="found",
+        bibtex_citation=None,


KennethEnevoldsen · 2024-04-18T10:50:47Z

merging into branch to add points and then merge

* add the longembed benchmark * add longembed bench & make lint * add meta data and model scores * add all metadata and passkey&needle scores * remove prints * replace context length with test_256, test_512, ... Co-authored-by: Dawei Zhu <52273452+dwzhu-pku@users.noreply.github.com>

tomaarsen · 2024-04-23T10:21:51Z

Excellent work!

Tom Aarsen

dwzhu-pku · 2024-04-26T05:25:13Z

hi, I'm ready to add points for this submission, may I ask what kind of information should I provide? I think I'm still a little bit confused about what to fill in these fields: @KennethEnevoldsen

{
    "GitHub": "GitHubUser1",
    "New dataset": 2-6,  # 2 points for the dataset and 4 points for the task
    "New task": 2, # e.g. a new style of task (e.g. classification, or retrieval)
    "Dataset annotations": 1, # 1 point for each full dataset annotation
    "Bug fixes": 2-10, # depends on the complexity of the fix
    "Running Models": 1, # pr model run
    "Review PR": 2, # two points pr. reviewer, can be given to multiple reviewers
    "Paper Writing": NA, 
    "Ideation": NA,
    "Coordination": NA
}

dwzhu-pku added 2 commits April 17, 2024 15:39

add the longembed benchmark

e3aa3a1

add longembed bench & make lint

0922490

isaac-chung reviewed Apr 17, 2024

View reviewed changes

KennethEnevoldsen requested changes Apr 17, 2024

View reviewed changes

mteb/tasks/Retrieval/eng/LEMBNarrativeQARetrieval.py Outdated Show resolved Hide resolved

mteb/tasks/Retrieval/eng/LEMBSummScreenFDRetrieval.py Outdated Show resolved Hide resolved

add meta data and model scores

bf77d46

KennethEnevoldsen reviewed Apr 17, 2024

View reviewed changes

add all metadata and passkey&needle scores

b9570e9

KennethEnevoldsen reviewed Apr 17, 2024

View reviewed changes

remove prints

6e690ee

replace context length with test_256, test_512, ...

700980b

KennethEnevoldsen approved these changes Apr 18, 2024

View reviewed changes

KennethEnevoldsen changed the base branch from main to merge_393 April 18, 2024 10:50

KennethEnevoldsen merged commit 69f821f into embeddings-benchmark:merge_393 Apr 18, 2024

KennethEnevoldsen mentioned this pull request Apr 18, 2024

fix: Add support for the LongEmbed benchmark (#393) #421

Merged

dwzhu-pku mentioned this pull request May 3, 2024

add info for LongEmbed #625

Merged

Muennighoff mentioned this pull request May 20, 2024

Integrate with MTEB? kaistAI/InstructIR#3

Open

Muennighoff mentioned this pull request May 31, 2024

Integrate with MTEB? gowitheflow-1998/RAR-b#4

Closed

Muennighoff mentioned this pull request Jul 10, 2024

Integrate with MTEB? CoIR-team/coir#4

Closed

Add support for the LongEmbed benchmark #393

Add support for the LongEmbed benchmark #393

Uh oh!

Conversation

dwzhu-pku commented Apr 17, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Checklist for adding MMTEB dataset

Uh oh!

isaac-chung left a comment

Choose a reason for hiding this comment

Uh oh!

KennethEnevoldsen commented Apr 17, 2024

Uh oh!

KennethEnevoldsen left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

isaac-chung commented Apr 17, 2024

Uh oh!

dwzhu-pku commented Apr 17, 2024

Uh oh!

KennethEnevoldsen left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dwzhu-pku commented Apr 17, 2024

Uh oh!

KennethEnevoldsen left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

dwzhu-pku commented Apr 17, 2024 •

edited

Loading

KennethEnevoldsen left a comment •

edited

Loading