Skip to content

Conversation

dwzhu-pku
Copy link
Contributor

@dwzhu-pku dwzhu-pku commented Apr 17, 2024

Checklist for adding MMTEB dataset

Reason for dataset addition:

  • I have tested that the dataset runs with the mteb package.
  • I have run the following models on the task (adding the results to the pr). These can be run using the mteb run -m {model_name} -t {task_name} command.
    • sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2
    • intfloat/multilingual-e5-small
  • I have checked that the performance is neither trivial (both models gain close to perfect scores) nor random (both models gain close to random scores).
  • I have considered the size of the dataset and reduced it if it is too big (2048 examples is typically large enough for most tasks)
  • Run tests locally to make sure nothing is broken using make test.
  • Run the formatter to format the code using make lint.
  • I have added points for my submission to the POINTS.md file. (i will add these information when the arxiv version of our pape is out)

Copy link
Collaborator

@isaac-chung isaac-chung left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you also add results from running mteb run -m {model_name} -t {task_name} please?

@KennethEnevoldsen
Copy link
Contributor

@isaac-chung, if you want to take charge of some of these reviews that would be great. If so I can add you to the list of reviewers for MMTEB (see overview issue)

Copy link
Contributor

@KennethEnevoldsen KennethEnevoldsen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wonderful addition! Seems like we a still missing a bit of metadata as well as model scores.

@isaac-chung
Copy link
Collaborator

@isaac-chung, if you want to take charge of some of these reviews that would be great. If so I can add you to the list of reviewers for MMTEB (see overview issue)

@KennethEnevoldsen Yes please, feel free to add me.

@dwzhu-pku
Copy link
Contributor Author

Thanks to @KennethEnevoldsen @isaac-chung for your timely review! I have updated the PR, where I added some model scores and meta data. Specifically, I have included results on all six datasets except for Passkey and Needle, which requires an additional parameter of context_length, and results of different context_length just overwrites each other. For the meta data, I have added some key information such as domains and forms. However, I have also find some fields very confusing. For example, what do you mean by saying the socioeconomic_status of a dataset?

Copy link
Contributor

@KennethEnevoldsen KennethEnevoldsen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tried to give reasonable directions on the metadata. Let us know if the documentation is lacking.

Specifically, I have included results on all six datasets except for Passkey and Needle, which requires an additional parameter of context_length, and results of different context_length just overwrites each other.

Can't you just use the default?

eval_splits=[_EVAL_SPLIT],
eval_langs=["eng-Latn"],
main_score="ndcg_at_10",
date=None,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when the text was created (from, to) or best guess.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

take Narrative for example, do you mean when the original dataset is created (which is back in 2020), or when it is curated for retrieval (which could be today)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Text creation time, so if it e.g. includes books written from 1800-2000. These can be approximate guesses. Given that it was created in 2020, that at least set the upper bound.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, thank you!

form=["written"],
domains=["Fiction", "Non-fiction"],
task_subtypes=["Article retrieval"],
license=None,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

required, if not specified then use "Not specified"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

domains=["Fiction", "Non-fiction"],
task_subtypes=["Article retrieval"],
license=None,
socioeconomic_status=None,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The socioeconomic status of the text creators (lawyers is high, social media is mixed, news articles would be medium, but depends on the outlet)

It is just a rough estimate

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, I see now. So the document here is referring to the text creators, rather than the data itself, right? https://github.com/embeddings-benchmark/mteb/blob/main/mteb/abstasks/TaskMetadata.py#L139

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh right, yea that is not very clear. Feel free to suggest a reformulation and add a point to (bug)fixes for yourself.

text_creation=None,
bibtex_citation=None,
n_samples={_EVAL_SPLIT: 10449},
avg_character_length=None,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can use the calculate_metadata_metrics from abstask retrieval for this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. Actually, what confuses me is the document saying that for retrieval task, 'this should be the average character length of the query-document pairs'. What's the average character length of a pair ? should I report a pair of average length, or the average length of the concatenation of query and document?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree it isn't too clear, which is why we specified the function (to at least make it consistent).

annotations_creators="derived",
dialect=None,
text_creation=None,
bibtex_citation=None,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like it is from a paper so should be specified

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

socioeconomic_status=None,
annotations_creators="derived",
dialect=None,
text_creation=None,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How was the text created? E.g. "Found" online

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it partly overlaps with annotations_creators. Isn't it enough to set annotations_creators to derived, if I have just transformed existing dataset for retrieval?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the reason why it is different is since the text_creation (e.g. tweets, "found") isn't nec. the same as the annotation creators (e.g. annotating them for sentiment, "human/expert annotated").

Isn't it enough to set annotations_creators to derived, if I have just transformed existing dataset for retrieval?

We def. prefer explicit annotations.

license=None,
socioeconomic_status=None,
annotations_creators="derived",
dialect=None,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

assuming there are no dialects:

Suggested change
dialect=None,
dialect=[],

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

great

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for you elaboration! @KennethEnevoldsen

@dwzhu-pku
Copy link
Contributor Author

many thanks to @KennethEnevoldsen for the detailed elaboration on each parameter for meta data. I have filled them all, and added results on Passkey and Needle using 512 context length.

Copy link
Contributor

@KennethEnevoldsen KennethEnevoldsen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Seems like there are a few odd scores, we should probably look into, but other than that the PR looks good.

"mrr_at_3": 0.86667,
"mrr_at_5": 0.87067,
"ndcg_at_1": 0.8,
"ndcg_at_10": 0.90614,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like performance here is quite high. Does that correspond with your expectations? Not too familiar with the dataset, but it might be worth either using ndcg_at_1 or whatever they use in the paper.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see below.

Comment on lines +21 to +27
"ndcg_at_1": 1.0,
"ndcg_at_10": 1.0,
"ndcg_at_100": 1.0,
"ndcg_at_1000": 1.0,
"ndcg_at_20": 1.0,
"ndcg_at_3": 1.0,
"ndcg_at_5": 1.0,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perfect scores?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, it is expected. The passkey and needle test have various evaluation lengths: {256,512,1024,2048,4096,8192,16384,32768}. (That's why we need an additional context length parameter) This is designed to measure the effective context window length of the embedding models. The reported scores are from 512 context length. Since embedding models typically have a context length of 512 and above, it's not strange for the perfect score here. After all, passkey retrieval itself is very easy. Needle retrieval is a little bit harder, so we did not observe 100% accuracy here

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh thanks for clarifying. This seems like they should have their own task subtypes.

A potential solution could simply be to include an equal percentage of each size. Then models with higher context length would perform better. An alternative model. Another option would also be to use the splits to differentiate. So e.g. "test_256", "test_512" etc.?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually it also bothers me in implementation. The first solution by include an equal percentage of each size may not work, as I do not want the candidate documents for each length to be mixed together. As for the second choice, I have used split to specify corpus, queries and qrels. As a result, I cannot figure out a better way then passing the context_length parameter to do the filtering.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well the "test" split seems to be used in the data structure (not needed on HF), so splitting it up during the the dataset_transform() seems like a valid approach.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make sense, I'll try it. Really appreciate your help.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No worries, sorry for the long review period, but this seems like a good thing to get right early on in the process.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, I have changed the code to use test_256, test_512, ... instead of the context length. Looks much better now, really appreciate your reminder! Results are also updated. Would you like to take a look? By the way, we have made our repo publicly available here: https://github.com/dwzhu-pku/LongEmbed. Welcome to take a look!

Copy link
Contributor

@KennethEnevoldsen KennethEnevoldsen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

discovered a few prints which I would remove.

Comment on lines 67 to 70
print("Example Query")
print(list(queries.values())[5])
print("Example Passage (truncate at 200 characters)")
print(list(corpus.values())[5]["text"][:200])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
print("Example Query")
print(list(queries.values())[5])
print("Example Passage (truncate at 200 characters)")
print(list(corpus.values())[5]["text"][:200])

Comment on lines 78 to 81
print("Example Query")
print(list(queries.values())[5])
print("Example Passage (truncate at 200 characters)")
print(list(corpus.values())[5]["text"][:200])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

delete

Comment on lines 91 to 94
print("Example Query")
print(list(queries.values())[5])
print("Example Passage (truncate at 200 characters)")
print(list(corpus.values())[5]["text"][:200])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

delete

Comment on lines 67 to 70
print("Example Query")
print(list(queries.values())[5])
print("Example Passage (truncate at 200 characters)")
print(list(corpus.values())[5]["text"][:200])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

delete

Comment on lines 67 to 70
print("Example Query")
print(list(queries.values())[5])
print("Example Passage (truncate at 200 characters)")
print(list(corpus.values())[5]["text"][:200])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

delete

Comment on lines 82 to 85
print("Example Query")
print(list(queries.values())[5])
print("Example Passage (truncate at 200 characters)")
print(list(corpus.values())[5]["text"][:200])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

delete

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, let me do the removal

@Muennighoff
Copy link
Contributor

This looks amazing! If you also want a leaderboard tab for your benchmark feel free to open a PR here: https://huggingface.co/datasets/mteb/results with the result files. We could add it under Retrieval similar to https://huggingface.co/spaces/mteb/leaderboard?task=retrieval&language=law

@dwzhu-pku
Copy link
Contributor Author

This looks amazing! If you also want a leaderboard tab for your benchmark feel free to open a PR here: https://huggingface.co/datasets/mteb/results with the result files. We could add it under Retrieval similar to https://huggingface.co/spaces/mteb/leaderboard?task=retrieval&language=law

Thanks!

Copy link
Contributor

@KennethEnevoldsen KennethEnevoldsen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks really good, very happy with how it has ended up. @Muennighoff as you suggest let us add a page on the

  • I found two missing citations.
  • Will you also add the points as well.
  • We can create the benchmark in a separate PR (this one has already gone on for a while now)

annotations_creators="derived",
dialect=[],
text_creation="found",
bibtex_citation=None,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

missing

@KennethEnevoldsen KennethEnevoldsen changed the base branch from main to merge_393 April 18, 2024 10:50
@KennethEnevoldsen
Copy link
Contributor

merging into branch to add points and then merge

@KennethEnevoldsen KennethEnevoldsen merged commit 69f821f into embeddings-benchmark:merge_393 Apr 18, 2024
KennethEnevoldsen added a commit that referenced this pull request Apr 18, 2024
* add the longembed benchmark

* add longembed bench & make lint

* add meta data and model scores

* add all metadata and passkey&needle scores

* remove prints

* replace context length with test_256, test_512, ...

Co-authored-by: Dawei Zhu <52273452+dwzhu-pku@users.noreply.github.com>
@tomaarsen
Copy link
Member

Excellent work!

  • Tom Aarsen

@dwzhu-pku
Copy link
Contributor Author

hi, I'm ready to add points for this submission, may I ask what kind of information should I provide? I think I'm still a little bit confused about what to fill in these fields: @KennethEnevoldsen

{
    "GitHub": "GitHubUser1",
    "New dataset": 2-6,  # 2 points for the dataset and 4 points for the task
    "New task": 2, # e.g. a new style of task (e.g. classification, or retrieval)
    "Dataset annotations": 1, # 1 point for each full dataset annotation
    "Bug fixes": 2-10, # depends on the complexity of the fix
    "Running Models": 1, # pr model run
    "Review PR": 2, # two points pr. reviewer, can be given to multiple reviewers
    "Paper Writing": NA, 
    "Ideation": NA,
    "Coordination": NA
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants