Skip to content

Conversation

acylam
Copy link
Collaborator

@acylam acylam commented Dec 27, 2024

Motivation

Added Bradley-Terry subjective evaluation method to wildbench, alpacaeval, and compassarena datasets.

Modification

  • Added Bradley-Terry subjective evaluation method to wildbench, alpacaeval, and compassarena datasets.
  • Added base_models_abbrs (from dataset config) to references during subjective evaluation (passed to references in LMEvaluator). This can be used to determine the base models in the post-processor function for pairwise subjective comparisons.
  • Added support for base_models in regular (non-Bradley-Terry) subjective evaluation post-processors for the following datasets: wildbench, alpacaeval, compassarena. Previously, the base model for comparison is assumed to be answer1 in the first record of references (references[0]["answer1"]), but this might not necessarily hold if infer_order is set to "random". Using the base_models passed from the dataset configuration guarantees that the correct base models are used.
  • Added all_scores summary files for reference in CompassArenaBradleyTerrySummarizer.

BC-breaking (Optional)

No breaking changes.

Use cases (Optional)

Perform subjective evaluation using the Bradley-Terry method with the following command:

opencompass configs/eval_subjective_bradleyterry.py -r latest --mode=all

More details about the Bradley-Terry evaluation method in: opencompass/configs/datasets/subjective/compass_arena_subjective_bench/README_pairwise_bt.md

…d bradleyterry subjective evaluation method for wildbench, alpacaeval, and compassarena datasets; added all_scores output files for reference in CompassArenaBradleyTerrySummarizer;
Copy link
Collaborator

@MaiziXiao MaiziXiao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Verified by @bittersweet1999

@MaiziXiao MaiziXiao merged commit dc6035c into open-compass:main Dec 31, 2024
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants