-
Notifications
You must be signed in to change notification settings - Fork 463
Add support for the LongEmbed benchmark #393
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for the LongEmbed benchmark #393
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you also add results from running mteb run -m {model_name} -t {task_name}
please?
@isaac-chung, if you want to take charge of some of these reviews that would be great. If so I can add you to the list of reviewers for MMTEB (see overview issue) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wonderful addition! Seems like we a still missing a bit of metadata as well as model scores.
@KennethEnevoldsen Yes please, feel free to add me. |
Thanks to @KennethEnevoldsen @isaac-chung for your timely review! I have updated the PR, where I added some model scores and meta data. Specifically, I have included results on all six datasets except for Passkey and Needle, which requires an additional parameter of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tried to give reasonable directions on the metadata. Let us know if the documentation is lacking.
Specifically, I have included results on all six datasets except for Passkey and Needle, which requires an additional parameter of context_length, and results of different context_length just overwrites each other.
Can't you just use the default?
eval_splits=[_EVAL_SPLIT], | ||
eval_langs=["eng-Latn"], | ||
main_score="ndcg_at_10", | ||
date=None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
when the text was created (from, to) or best guess.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
take Narrative for example, do you mean when the original dataset is created (which is back in 2020), or when it is curated for retrieval (which could be today)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Text creation time, so if it e.g. includes books written from 1800-2000. These can be approximate guesses. Given that it was created in 2020, that at least set the upper bound.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, thank you!
form=["written"], | ||
domains=["Fiction", "Non-fiction"], | ||
task_subtypes=["Article retrieval"], | ||
license=None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
required, if not specified then use "Not specified"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok
domains=["Fiction", "Non-fiction"], | ||
task_subtypes=["Article retrieval"], | ||
license=None, | ||
socioeconomic_status=None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The socioeconomic status of the text creators (lawyers is high, social media is mixed, news articles would be medium, but depends on the outlet)
It is just a rough estimate
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, I see now. So the document here is referring to the text creators, rather than the data itself, right? https://github.com/embeddings-benchmark/mteb/blob/main/mteb/abstasks/TaskMetadata.py#L139
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ahh right, yea that is not very clear. Feel free to suggest a reformulation and add a point to (bug)fixes for yourself.
text_creation=None, | ||
bibtex_citation=None, | ||
n_samples={_EVAL_SPLIT: 10449}, | ||
avg_character_length=None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can use the calculate_metadata_metrics
from abstask retrieval for this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. Actually, what confuses me is the document saying that for retrieval task, 'this should be the average character length of the query-document pairs'. What's the average character length of a pair ? should I report a pair of average length, or the average length of the concatenation of query and document?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree it isn't too clear, which is why we specified the function (to at least make it consistent).
annotations_creators="derived", | ||
dialect=None, | ||
text_creation=None, | ||
bibtex_citation=None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems like it is from a paper so should be specified
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok
socioeconomic_status=None, | ||
annotations_creators="derived", | ||
dialect=None, | ||
text_creation=None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How was the text created? E.g. "Found" online
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it partly overlaps with annotations_creators
. Isn't it enough to set annotations_creators
to derived
, if I have just transformed existing dataset for retrieval?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the reason why it is different is since the text_creation (e.g. tweets, "found") isn't nec. the same as the annotation creators (e.g. annotating them for sentiment, "human/expert annotated").
Isn't it enough to set annotations_creators to derived, if I have just transformed existing dataset for retrieval?
We def. prefer explicit annotations.
license=None, | ||
socioeconomic_status=None, | ||
annotations_creators="derived", | ||
dialect=None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
assuming there are no dialects:
dialect=None, | |
dialect=[], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
great
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot for you elaboration! @KennethEnevoldsen
many thanks to @KennethEnevoldsen for the detailed elaboration on each parameter for meta data. I have filled them all, and added results on Passkey and Needle using 512 context length. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! Seems like there are a few odd scores, we should probably look into, but other than that the PR looks good.
"mrr_at_3": 0.86667, | ||
"mrr_at_5": 0.87067, | ||
"ndcg_at_1": 0.8, | ||
"ndcg_at_10": 0.90614, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like performance here is quite high. Does that correspond with your expectations? Not too familiar with the dataset, but it might be worth either using ndcg_at_1 or whatever they use in the paper.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see below.
"ndcg_at_1": 1.0, | ||
"ndcg_at_10": 1.0, | ||
"ndcg_at_100": 1.0, | ||
"ndcg_at_1000": 1.0, | ||
"ndcg_at_20": 1.0, | ||
"ndcg_at_3": 1.0, | ||
"ndcg_at_5": 1.0, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perfect scores?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, it is expected. The passkey and needle test have various evaluation lengths: {256,512,1024,2048,4096,8192,16384,32768}. (That's why we need an additional context length
parameter) This is designed to measure the effective context window length of the embedding models. The reported scores are from 512 context length. Since embedding models typically have a context length of 512 and above, it's not strange for the perfect score here. After all, passkey retrieval itself is very easy. Needle retrieval is a little bit harder, so we did not observe 100% accuracy here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ahh thanks for clarifying. This seems like they should have their own task subtypes.
A potential solution could simply be to include an equal percentage of each size. Then models with higher context length would perform better. An alternative model. Another option would also be to use the splits to differentiate. So e.g. "test_256", "test_512" etc.?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually it also bothers me in implementation. The first solution by include an equal percentage of each size may not work, as I do not want the candidate documents for each length to be mixed together. As for the second choice, I have used split
to specify corpus
, queries
and qrels
. As a result, I cannot figure out a better way then passing the context_length
parameter to do the filtering.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well the "test" split seems to be used in the data structure (not needed on HF), so splitting it up during the the dataset_transform() seems like a valid approach.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
make sense, I'll try it. Really appreciate your help.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No worries, sorry for the long review period, but this seems like a good thing to get right early on in the process.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, I have changed the code to use test_256
, test_512
, ... instead of the context length. Looks much better now, really appreciate your reminder! Results are also updated. Would you like to take a look? By the way, we have made our repo publicly available here: https://github.com/dwzhu-pku/LongEmbed. Welcome to take a look!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
discovered a few prints which I would remove.
print("Example Query") | ||
print(list(queries.values())[5]) | ||
print("Example Passage (truncate at 200 characters)") | ||
print(list(corpus.values())[5]["text"][:200]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
print("Example Query") | |
print(list(queries.values())[5]) | |
print("Example Passage (truncate at 200 characters)") | |
print(list(corpus.values())[5]["text"][:200]) |
print("Example Query") | ||
print(list(queries.values())[5]) | ||
print("Example Passage (truncate at 200 characters)") | ||
print(list(corpus.values())[5]["text"][:200]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
delete
print("Example Query") | ||
print(list(queries.values())[5]) | ||
print("Example Passage (truncate at 200 characters)") | ||
print(list(corpus.values())[5]["text"][:200]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
delete
print("Example Query") | ||
print(list(queries.values())[5]) | ||
print("Example Passage (truncate at 200 characters)") | ||
print(list(corpus.values())[5]["text"][:200]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
delete
print("Example Query") | ||
print(list(queries.values())[5]) | ||
print("Example Passage (truncate at 200 characters)") | ||
print(list(corpus.values())[5]["text"][:200]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
delete
print("Example Query") | ||
print(list(queries.values())[5]) | ||
print("Example Passage (truncate at 200 characters)") | ||
print(list(corpus.values())[5]["text"][:200]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
delete
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, let me do the removal
This looks amazing! If you also want a leaderboard tab for your benchmark feel free to open a PR here: https://huggingface.co/datasets/mteb/results with the result files. We could add it under Retrieval similar to https://huggingface.co/spaces/mteb/leaderboard?task=retrieval&language=law |
Thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks really good, very happy with how it has ended up. @Muennighoff as you suggest let us add a page on the
- I found two missing citations.
- Will you also add the points as well.
- We can create the benchmark in a separate PR (this one has already gone on for a while now)
annotations_creators="derived", | ||
dialect=[], | ||
text_creation="found", | ||
bibtex_citation=None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
missing
merging into branch to add points and then merge |
* add the longembed benchmark * add longembed bench & make lint * add meta data and model scores * add all metadata and passkey&needle scores * remove prints * replace context length with test_256, test_512, ... Co-authored-by: Dawei Zhu <52273452+dwzhu-pku@users.noreply.github.com>
Excellent work!
|
hi, I'm ready to add points for this submission, may I ask what kind of information should I provide? I think I'm still a little bit confused about what to fill in these fields: @KennethEnevoldsen
|
Checklist for adding MMTEB dataset
Reason for dataset addition:
mteb
package.mteb run -m {model_name} -t {task_name}
command.sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2
intfloat/multilingual-e5-small
make test
.make lint
.