MuSR Datset Evaluation #1689

abrohamLee · 2024-11-14T11:17:13Z

MuSR: Multistep Soft Reasoning Dataset

MuSR (Multistep Soft Reasoning) is a dataset designed to evaluate language models (LLMs) on complex reasoning tasks embedded in natural language narratives. Created to challenge state-of-the-art models like GPT-4 and others, MuSR emphasizes nuanced reasoning across different domains, including social and physical reasoning, commonsense reasoning, and planning, with tasks framed within realistic scenarios such as murder mysteries, object placements, and team allocations.

Overview

Purpose

Current large language models can perform complex tasks through prompting techniques like chain-of-thought reasoning. However, robust multistep reasoning remains challenging. MuSR addresses these limitations by evaluating LLM performance on tasks involving multistep reasoning in three domains:

Murder Mysteries: Requires social and physical deductive reasoning.
Object Placements: Tests observational and theory-of-mind reasoning.
Team Allocations: Focuses on social reasoning and constraint satisfaction.

Dataset Construction

MuSR instances are generated using a neurosymbolic synthetic-to-natural narrative generation algorithm. This approach allows for the creation of complex reasoning instances that combine structured reasoning trees with natural language narratives, challenging both direct and nuanced inference capabilities in LLMs.

MuSR's dataset consists of:

Murder Mysteries: Scenarios with suspects, motives, and opportunities requiring deductive inference.
Object Placements: Scenarios where individuals' observations inform reasoning about object locations.
Team Allocations: Scenarios that simulate social relationships and teamwork for optimal task assignments.

Dataset Access

MuSR dataset is publicly available, with instructions provided on the GitHub Project. You can download the dataset and use pre-defined prompts or create your own configurations.

Evaluation

Install dependencies and configure the environment.
Run evaluations using opencompass configs/eval_musr.py to assess LLM performance.
Analyze results against human performance benchmarks.

Example Command

opencompass configs/eval_musr.py

Baselines and Results

MuSR includes baseline results for multiple LLMs evaluated with chain-of-thought and advanced reasoning strategies. These benchmarks assess model accuracy on reasoning tasks across the three domains.

Domain	Baseline Accuracy (GPT-4)	Human Performance
Murder Mystery	80.4%	94.1%
Object Placement	60.9%	95.0%
Team Allocation	68.4%	100%

dataset	version	metric	mode	internlm2_5-7b-chat-turbomind	qwen2.5-7b-instruct-turbomind	qwen2.5-14b-instruct-turbomind	yi-1.5-9b-chat-turbomind	qwen2.5-32b-instruct-turbomind	glm-4-9b-chat-turbomind	llama-3_1-8b-instruct-turbomind	ministral-8B-instruct-2410-turbomind	gemma-2-9b-it-turbomind	gemma-2-27b-it-turbomind
musr_murder_mysteries	a5ce30	accuracy	gen	59.20	63.20	76.00	68.80	78.80	71.20	73.60	73.60	74.80	77.20
musr_object_placements	a5ce30	accuracy	gen	54.69	56.25	57.42	52.73	66.02	49.22	57.42	60.94	60.94	62.11
musr_team_allocation	a5ce30	accuracy	gen	39.20	32.40	55.60	40.00	67.60	50.40	46.00	36.40	40.80	41.20
musr_average	-	naive_average	gen	51.03	50.62	63.01	53.84	70.81	56.94	59.01	56.98	58.85	60.17

Citation

If you use MuSR in your research, please cite:

@misc{sprague2024musrtestinglimitschainofthought,
      title={MuSR: Testing the Limits of Chain-of-thought with Multistep Soft Reasoning}, 
      author={Zayne Sprague and Xi Ye and Kaj Bostrom and Swarat Chaudhuri and Greg Durrett},
      year={2024},
      eprint={2310.16049},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2310.16049}, 
}

Details

For further details, please refer to the MuSR paper here.

MaiziXiao · 2024-11-14T11:24:27Z

opencompass/datasets/musr/musr.py

+        exclude_contrastive_examples (bool): 是否排除对比样本。
+        reverse_contrastive_sample (bool): 是否反转对比样本的选择。
+        skip_ablated (bool): 是否跳过消融样本。
+        randomize (bool): 是否随机打乱数据集。
+        offset (int): 数据集起始偏移量。
+        sample_size (int): 采样大小，None 表示使用全部数据。


Use english for Docstrings and comments

MaiziXiao · 2024-11-14T11:24:42Z

opencompass/datasets/musr/musr.py

+             offset=0,
+             sample_size=None,
+             **kwargs):
+        """加载数据集并展平字段，同时构造 prompts，考虑 self_consistency_n 和 ablations。"""


Use english for Docstrings and comments

opencompass/datasets/musr/musr.py

Add an assertion and a Readme.md

… musr

MaiziXiao

LGTM

MaiziXiao

LGTM

* MuSR Datset Evaluation * MuSR Datset Evaluation Add an assertion and a Readme.md

MuSR Datset Evaluation

34d6d7a

mm-assistant bot assigned acylam Nov 14, 2024

MaiziXiao reviewed Nov 14, 2024

View reviewed changes

opencompass/datasets/musr/musr.py Show resolved Hide resolved

abrohamLee added 2 commits November 14, 2024 11:54

MuSR Datset Evaluation

61e407f

Add an assertion and a Readme.md

Merge branch 'musr' of https://github.com/abrohamLee/opencompass into…

61d69b5

… musr

MaiziXiao approved these changes Nov 14, 2024

View reviewed changes

Merge remote-tracking branch 'upstream/main' into musr

2eca3ff

MaiziXiao approved these changes Nov 14, 2024

View reviewed changes

abrohamLee temporarily deployed to prod November 14, 2024 12:30 — with GitHub Actions Inactive

MaiziXiao merged commit e9e4b69 into open-compass:main Nov 14, 2024
8 checks passed

stephen-nju pushed a commit to stephen-nju/opencompass that referenced this pull request May 14, 2025

[Feature] MuSR Datset Evaluation (open-compass#1689)

41cc5a4

* MuSR Datset Evaluation * MuSR Datset Evaluation Add an assertion and a Readme.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

MuSR Datset Evaluation #1689

MuSR Datset Evaluation #1689

Uh oh!

abrohamLee commented Nov 14, 2024

Uh oh!

MaiziXiao Nov 14, 2024

Uh oh!

MaiziXiao Nov 14, 2024

Uh oh!

Uh oh!

MaiziXiao left a comment

Uh oh!

MaiziXiao left a comment

Uh oh!

Uh oh!

Uh oh!

MuSR Datset Evaluation #1689

MuSR Datset Evaluation #1689

Uh oh!

Conversation

abrohamLee commented Nov 14, 2024

MuSR: Multistep Soft Reasoning Dataset

Overview

Purpose

Dataset Construction

Dataset Access

Evaluation

Example Command

Baselines and Results

Citation

Details

Uh oh!

MaiziXiao Nov 14, 2024

Choose a reason for hiding this comment

Uh oh!

MaiziXiao Nov 14, 2024

Choose a reason for hiding this comment

Uh oh!

Uh oh!

MaiziXiao left a comment

Choose a reason for hiding this comment

Uh oh!

MaiziXiao left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!