[Feature] Added Bradley-Terry subjective evaluation method to wildbench, alpacaeval, and compassarena datasets #1791

acylam · 2024-12-27T11:36:31Z

Motivation

Added Bradley-Terry subjective evaluation method to wildbench, alpacaeval, and compassarena datasets.

Modification

Added Bradley-Terry subjective evaluation method to wildbench, alpacaeval, and compassarena datasets.
Added base_models_abbrs (from dataset config) to references during subjective evaluation (passed to references in LMEvaluator). This can be used to determine the base models in the post-processor function for pairwise subjective comparisons.
Added support for base_models in regular (non-Bradley-Terry) subjective evaluation post-processors for the following datasets: wildbench, alpacaeval, compassarena. Previously, the base model for comparison is assumed to be answer1 in the first record of references (references[0]["answer1"]), but this might not necessarily hold if infer_order is set to "random". Using the base_models passed from the dataset configuration guarantees that the correct base models are used.
Added all_scores summary files for reference in CompassArenaBradleyTerrySummarizer.

BC-breaking (Optional)

No breaking changes.

Use cases (Optional)

Perform subjective evaluation using the Bradley-Terry method with the following command:

opencompass configs/eval_subjective_bradleyterry.py -r latest --mode=all

More details about the Bradley-Terry evaluation method in: opencompass/configs/datasets/subjective/compass_arena_subjective_bench/README_pairwise_bt.md

…d bradleyterry subjective evaluation method for wildbench, alpacaeval, and compassarena datasets; added all_scores output files for reference in CompassArenaBradleyTerrySummarizer;

MaiziXiao

Verified by @bittersweet1999

acylam added 2 commits December 27, 2024 11:15

added base_models_abbrs to references (passed from LMEvaluator); adde…

20ddf96

…d bradleyterry subjective evaluation method for wildbench, alpacaeval, and compassarena datasets; added all_scores output files for reference in CompassArenaBradleyTerrySummarizer;

Merge branch 'main' into subj_bradleyterry

d9f7c80

acylam added the Enhancement New feature or request label Dec 27, 2024

acylam requested review from tonysy and bittersweet1999 December 27, 2024 11:36

mm-assistant bot assigned tonysy Dec 27, 2024

acylam temporarily deployed to prod December 27, 2024 11:36 — with GitHub Actions Inactive

acylam assigned MaiziXiao Dec 27, 2024

MaiziXiao approved these changes Dec 31, 2024

View reviewed changes

MaiziXiao merged commit dc6035c into open-compass:main Dec 31, 2024
8 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Feature] Added Bradley-Terry subjective evaluation method to wildbench, alpacaeval, and compassarena datasets #1791

[Feature] Added Bradley-Terry subjective evaluation method to wildbench, alpacaeval, and compassarena datasets #1791

Uh oh!

acylam commented Dec 27, 2024

Uh oh!

MaiziXiao left a comment

Uh oh!

Uh oh!

Uh oh!

[Feature] Added Bradley-Terry subjective evaluation method to wildbench, alpacaeval, and compassarena datasets #1791

[Feature] Added Bradley-Terry subjective evaluation method to wildbench, alpacaeval, and compassarena datasets #1791

Uh oh!

Conversation

acylam commented Dec 27, 2024

Motivation

Modification

BC-breaking (Optional)

Use cases (Optional)

Uh oh!

MaiziXiao left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!