-
Notifications
You must be signed in to change notification settings - Fork 647
MuSR Datset Evaluation #1689
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MuSR Datset Evaluation #1689
Conversation
opencompass/datasets/musr/musr.py
Outdated
exclude_contrastive_examples (bool): 是否排除对比样本。 | ||
reverse_contrastive_sample (bool): 是否反转对比样本的选择。 | ||
skip_ablated (bool): 是否跳过消融样本。 | ||
randomize (bool): 是否随机打乱数据集。 | ||
offset (int): 数据集起始偏移量。 | ||
sample_size (int): 采样大小,None 表示使用全部数据。 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use english for Docstrings and comments
opencompass/datasets/musr/musr.py
Outdated
offset=0, | ||
sample_size=None, | ||
**kwargs): | ||
"""加载数据集并展平字段,同时构造 prompts,考虑 self_consistency_n 和 ablations。""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use english for Docstrings and comments
Add an assertion and a Readme.md
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
* MuSR Datset Evaluation * MuSR Datset Evaluation Add an assertion and a Readme.md
MuSR: Multistep Soft Reasoning Dataset
MuSR (Multistep Soft Reasoning) is a dataset designed to evaluate language models (LLMs) on complex reasoning tasks embedded in natural language narratives. Created to challenge state-of-the-art models like GPT-4 and others, MuSR emphasizes nuanced reasoning across different domains, including social and physical reasoning, commonsense reasoning, and planning, with tasks framed within realistic scenarios such as murder mysteries, object placements, and team allocations.
Overview
Purpose
Current large language models can perform complex tasks through prompting techniques like chain-of-thought reasoning. However, robust multistep reasoning remains challenging. MuSR addresses these limitations by evaluating LLM performance on tasks involving multistep reasoning in three domains:
Dataset Construction
MuSR instances are generated using a neurosymbolic synthetic-to-natural narrative generation algorithm. This approach allows for the creation of complex reasoning instances that combine structured reasoning trees with natural language narratives, challenging both direct and nuanced inference capabilities in LLMs.
MuSR's dataset consists of:
Dataset Access
MuSR dataset is publicly available, with instructions provided on the GitHub Project. You can download the dataset and use pre-defined prompts or create your own configurations.
Evaluation
opencompass configs/eval_musr.py
to assess LLM performance.Example Command
Baselines and Results
MuSR includes baseline results for multiple LLMs evaluated with chain-of-thought and advanced reasoning strategies. These benchmarks assess model accuracy on reasoning tasks across the three domains.
Citation
If you use MuSR in your research, please cite:
Details
For further details, please refer to the MuSR paper here.