Skip to content

[Feature] Add Bradley-Terry Subjective Evaluation method to Arena Hard dataset #1802

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Jan 3, 2025

Conversation

acylam
Copy link
Collaborator

@acylam acylam commented Jan 3, 2025

Motivation

Added Bradley-Terry subjective evaluation method to arena_hard dataset.

Modification

  • Added Bradley-Terry subjective evaluation method to arena_hard dataset

BC-breaking (Optional)

No breaking changes.

Use cases (Optional)

Perform subjective evaluation using the Bradley-Terry method with the following command:

opencompass configs/eval_subjective_bradleyterry.py -r latest --mode=all

More details about the Bradley-Terry evaluation method in: opencompass/configs/datasets/subjective/compass_arena_subjective_bench/README_pairwise_bt.md

…d bradleyterry subjective evaluation method for wildbench, alpacaeval, and compassarena datasets; added all_scores output files for reference in CompassArenaBradleyTerrySummarizer;
@acylam acylam merged commit f871e80 into open-compass:main Jan 3, 2025
8 checks passed
stephen-nju pushed a commit to stephen-nju/opencompass that referenced this pull request May 14, 2025
…d dataset (open-compass#1802)

* added base_models_abbrs to references (passed from LMEvaluator); added bradleyterry subjective evaluation method for wildbench, alpacaeval, and compassarena datasets; added all_scores output files for reference in CompassArenaBradleyTerrySummarizer;

* added bradleyterry subjective evaluation method to arena_hard dataset
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants