Benchmark | Real User Query | Self-awareness Evaluation | Proverb Reasoning | Generative Task & LLM-as-Judge | Hungarian Lang | Comprehensive Hu-specific |
---|---|---|---|---|---|---|
WildBench | ✔ | ✘ | ✘ | ✔ | ✘ | ✘ |
SimpleQA, ChineseSimpleQA | ✘ | ✔ | ✘ | ✔ | ✘ | ✘ |
MAPS | ✘ | ✘ | ✔ | ✘ | ✘ | ✘ |
MARC, MMMLU et al. | ✘ | ✘ | ✘ | ✘ | ✔ | ✘ |
BenchMAX | ✘ | ✘ | ✘ | ✔ | ✔ | ✘ |
MILQA | ✘ | ✘ | ✘ | ✘ | ✔ | ✘ |
HuLU | ✘ | ✘ | ✘ | ✘ | ✔ | ✘ |
OpenHuEval (ours) | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ |
Below are the steps for quickly downloading OpenHuEval and using OpenCompass for evaluation.
Refer to the installation steps for OpenCompass. Please use this forked repo of OpenCompass for the evaluation of OpenHuEval.
git clone https://github.com/opendatalab/OpenHuEval.git ${path_to_OpenHuEval_repo}
cd ${path_to_opencompass}
mkdir data
ln -snf ${path_to_OpenHuEval_repo} ./data/OpenHuEval
# use HuSimpleQA task as an example.
cd ${path_to_opencompass}
# modify config file `examples/eval_OpenHuEval_HuSimpleQA.py`: uncomment or add models you want to evaluate
python run.py examples/eval_OpenHuEval_HuSimpleQA.py -r --dump-eval-details
The inference and evaluation results would be in ${path_to_opencompass}/outputs
, like this:
outputs
└── eval_OpenHuEval_HuSimpleQA
└── 20250312_150000
├── predictions # prediction
│ ├── llama-3_1-8b-instruct-lmdeploy
│ ├── ...
│ └── qwen2.5-72b-instruct-lmdeploy
├── results # evaluation
│ ├── llama-3_1-8b-instruct-lmdeploy_judged-by--GPT-4o
│ ├── ...
│ └── qwen2.5-72b-instruct-lmdeploy_judged-by--GPT-4o
└── summary # evaluation summary
├── judged-by--GPT-4o-capability_en.csv
└── judged-by--GPT-4o-capability_hu.csv
cd ${path_to_OpenHuEval_repo}
# generate Figure 5 and Figure 6 related statistic result in https://arxiv.org/abs/2503.21500
# Config the related parameters in the tools/HuSimpleQA/lrm_reasoning_process_analysis.py according to the tools/HuSimpleQA/README.md before running.
python tools/HuSimpleQA/lrm_reasoning_process_analysis.py
# generate Figure 9 related statistic result in https://arxiv.org/abs/2503.21500
# Config the related parameters in the tools/HuMatchingFIB/main_process.py according to the tools/HuMatchingFIB/README.md before running.
python tools/HuMatchingFIB/main_process.py
@inproceedings{yang-etal-2025-openhueval,
title = "{O}pen{H}u{E}val: Evaluating Large Language Model on {H}ungarian Specifics",
author = "Yang, Haote and Wei, Xingjian and Wu, Jiang and Ligeti-Nagy, No{\'e}mi and Sun, Jiaxing and Wang, Yinfan and Yang, Gy{\H{o}}z{\H{o}} Zijian and Gao, Junyuan and Wang, Jingchao and Jiang, Bowen and Wang, Shasha and Yu, Nanjun and Zhang, Zihao and Hong, Shixin and Liu, Hongwei and Li, Wei and Zhang, Songyang and Lin, Dahua and Wu, Lijun and Pr{\'o}sz{\'e}ky, G{\'a}bor and He, Conghui",
editor = "Che, Wanxiang and Nabende, Joyce and Shutova, Ekaterina and Pilehvar, Mohammad Taher",
booktitle = "Findings of the Association for Computational Linguistics: ACL 2025",
month = jul,
year = "2025",
address = "Vienna, Austria",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2025.findings-acl.390/",
doi = "10.18653/v1/2025.findings-acl.390",
pages = "7464--7520",
ISBN = "979-8-89176-256-5",
}
This project is released under the Apache 2.0 license.