Awesome-LLM-Eval: a curated list of tools, datasets/benchmark, demos, leaderboard, papers, docs and models, mainly for Evaluation on Large Language Models and exploring the boundaries and limits of Generative AI.
- News
- Tools
- Datasets / Benchmark
- Demos
- Leaderboards
- Papers
- LLM-List
- LLMOps
- Frameworks for Training
- Courses
- Others
- Other Awesome Lists
- Licenses
- Citation
- [2025/08/20] We added the Anthropomorphic-Taxonomy section.
- [2024/04/26] We added the Inference-Speed section.
- [2024/02/26] We added the Coding-Evaluation section.
- [2024/02/08] We added the lighteval tool from Huggingface.
- [2024/01/15] We added CRUXEval, DebugBench, OpenFinData, and LAiW.
- [2023/12/20] We added the RAG-Evaluation section.
- [2023/11/15] We added Instruction-Following-Evaluation and LLMBar for evaluating the instruction following capabilities of LLMs.
- [2023/10/20] We added SuperCLUE-Agent for LLM agent evaluation.
- [2023/09/25] We added ColossalEval from Colossal-AI.
- [2023/09/22] We added the LeaderboardFinder chapter.
- [2023/09/20] We added DeepEval, FinEval, and SuperCLUE-Safety from CLUEbenchmark.
- [2023/09/18] We added OpenCompass from Shanghai AI Lab.
- [2023/08/03] We added new Chinese LLMs: Baichuan and Qwen.
- [2023/06/28] We added AlpacaEval and multiple tools.
- [2023/04/26] We released the V0.1 evaluation list with multiple benchmarks.
Name | Year | Task Type | Institution | Evaluation Focus | Datasets | Url |
---|---|---|---|---|---|---|
MMLU-Pro | 2024 | Multi-Choice Knowledge | TIGER-AI-Lab | Subtle Reasoning, Fewer Noise | MMLU-Pro | link |
DyVal | 2024 | Dynamic Evaluation | Microsoft | Data Pollution, Complexity Control | DyVal | link |
PertEval | 2024 | General | USTC | Knowledge capacity | PertEval | link |
LV-Eval | 2024 | Long Text QA | Infinigence-AI | Length Variability, Factuality | 11 Subsets | link |
LLM-Uncertainty-Bench | 2024 | NLP Tasks | Tencent | Uncertainty Quantification | 5 NLP Tasks | link |
CommonGen-Eval | 2024 | Generation | AI2 | Common Sense | CommonGen-lite | link |
MathBench | 2024 | Math | Shanghai AI Lab | Theoretical and practical problem-solving | Various | link |
AIME | 2024 | Math | MAA | American Invitational Mathematics Examination | Various | link |
FrontierMath | 2024 | Math | Epoch AI | Original, challenging mathematics problems | Various | link |
FELM | 2023 | Factuality | HKUST | Factuality | 847 Questions | link |
Just-Eval-Instruct | 2023 | General | AI2 Mosaic | Helpfulness, Explainability | Various | link |
MLAgentBench | 2023 | ML Research | snap-stanford | End-to-End ML Tasks | 15 Tasks | link |
UltraEval | 2023 | General | OpenBMB | Lightweight, Flexible, Fast | Various | link |
FMTI | 2023 | Transparency | Stanford | Model Transparency | 100 Metrics | link |
BAMBOO | 2023 | Long Text | RUCAIBox | Long Text Modeling | 10 Datasets | link |
TRACE | 2023 | Continuous Learning | Fudan University | Continuous Learning | 8 Datasets | link |
ColossalEval | 2023 | General | Colossal-AI | Unified Evaluation | Various | link |
LLMEval² | 2023 | General | AlibabaResearch | Wide and Deep Evaluation | 2,553 Samples | link |
BigBench | 2023 | General | knowledge, language, reasoning | Various | link | |
LucyEval | 2023 | General | Oracle | Maturity Assessment | Various | link |
Zhujiu | 2023 | General | IACAS | Comprehensive Evaluation | 51 Tasks | link |
ChatEval | 2023 | Chat | THU-NLP | Human-like Evaluation | Various | link |
FlagEval | 2023 | General | THU | Subjective and Objective Scoring | Various | link |
AlpacaEval | 2023 | General | tatsu-lab | Automatic Evaluation | Various | link |
GPQA | 2023 | General | NYU | Graduate-Level Google-Proof QA | Various | link |
MuSR | 2023 | Reasoning | Zayne Sprague | Narrative-Based Reasoning | 756 | link |
FreshQA | 2023 | Knowledge | FreshLLMs | Current World Knowledge | 599 | link |
AGIEval | 2023 | General | Microsoft | Human-Centric Reasoning | NA | link |
SummEdits | 2023 | General | Salesforce | Inconsistency Detection | 6,348 | link |
ScienceQA | 2022 | Reasoning | UCLA | Science Reasoning | 21,208 | link |
e-CARE | 2022 | Reasoning | HIT | Explainable Causality | 21,000 | link |
BigBench Hard | 2022 | Reasoning | BigBench | Challenging Subtasks | 6,500 | link |
PlanBench | 2022 | Reasoning | ASU | Action Planning | 11,113 | link |
MGSM | 2022 | Math | Grade-school math problems in 10 languages | Various | link | |
MATH | 2021 | Math | UC Berkeley | Mathematical Problem Solving | Various | link |
GSM8K | 2021 | Math | OpenAI | Diverse grade school math word problems | Various | link |
SVAMP | 2021 | Math | Microsoft | Arithmetic Reasoning | 1,000 | link |
SpartQA | 2021 | Reasoning | MSU | Textual Spatial QA | 510 | link |
MLSUM | 2020 | General | Thomas Scialom | News Summarization | 535,062 | link |
Natural Questions | 2019 | Language, Reasoning | Search-Based QA | 300,000 | link | |
ANLI | 2019 | Language, Reasoning | Facebook AI | Adversarial Reasoning | 169,265 | link |
BoolQ | 2019 | Language, Reasoning | Binary QA | 16,000 | link | |
SuperGLUE | 2019 | Language, Reasoning | NYU | Advanced GLUE Tasks | NA | link |
DROP | 2019 | Language, Reasoning | UCI NLP | Paragraph-Level Reasoning | 96,000 | link |
HellaSwag | 2019 | Language, Reasoning | AI2 | Commonsense Inference | 59,950 | link |
Winogrande | 2019 | Language, Reasoning | AI2 | Pronoun Disambiguation | 44,000 | link |
PIQA | 2019 | Language, Reasoning | AI2 | Physical Interaction QA | 18,000 | link |
HotpotQA | 2018 | Language, Reasoning | HotpotQA | Explainable QA | 113,000 | link |
GLUE | 2018 | Language, Reasoning | NYU | Foundational NLU Tasks | NA | link |
OpenBookQA | 2018 | Language, Reasoning | AI2 | Open Book Exams | 12,000 | link |
SQuAD2.0 | 2018 | Language, Reasoning | Stanford University | Unanswerable Questions | 150,000 | link |
ARC | 2018 | Language, Reasoning | AI2 | AI2 Reasoning Challenge | 7,787 | link |
SWAG | 2018 | Language, Reasoning | AI2 | Adversarial Commonsense | 113,000 | link |
CommonsenseQA | 2018 | Language, Reasoning | AI2 | Commonsense Reasoning | 12,102 | link |
RACE | 2017 | Language, Reasoning | CMU | Exam-Style QA | 100,000 | link |
SciQ | 2017 | Language, Reasoning | AI2 | Crowd-Sourced Science | 13,700 | link |
TriviaQA | 2017 | Language, Reasoning | AI2 | Distant Supervision | 650,000 | link |
MultiNLI | 2017 | Language, Reasoning | NYU | Cross-Genre Entailment | 433,000 | link |
SQuAD | 2016 | Language, Reasoning | Stanford University | Wikipedia-Based QA | 100,000 | link |
LAMBADA | 2016 | Language, Reasoning | CIMEC | Discourse Context | 12,684 | link |
MS MARCO | 2016 | Language, Reasoning | Microsoft | Search-Based QA | 1,112,939 | link |
Domain | Name | Institution | Scope of Tasks | Unique Contributions | Url |
---|---|---|---|---|---|
BLURB | Mindrank AI | Six diverse NLP tasks, thirteen datasets | A macro-average score across all tasks | link | |
Seismometer | Epic | Using local data and workflows | patient demographics, clinical interventions, and outcomes | link | |
Healthcare | Medbench | OpenMEDLab | Emphasizes scientific rigor and fairness | 40,041 questions from medical exams and reports | link |
GenMedicalEval | E | 16 majors, 3 training stages, 6 clinical scenarios | Open-ended metrics and automated assessment models | link | |
PsyEval | SJTU | Six subtasks covering three dimensions | Customized benchmark for mental health LLMs | link | |
Fin-Eva | Ant Group | Wealth management, insurance, investment research | Both industrial and academic financial evaluations | link | |
Finance | FinEval | SUFE-AIFLM-Lab | Multiple-choice QA on finance, economics, accounting | Focuses on high-quality evaluation questions | link |
OpenFinData | Shanghai AI Lab | Multi-scenario financial tasks | First comprehensive finance evaluation dataset | link | |
FinBen | FinAI | 35 datasets across 23 financial tasks | Inductive reasoning, quantitative reasoning | link | |
LAiW | Sichuan University | 13 fundamental legal NLP tasks | Divides legal NLP capabilities into three major abilities | link | |
Legal | LawBench | Nanjing University | Legal entity recognition, reading comprehension | Real-world tasks, "abstention rate" metric | link |
LegalBench | Stanford University | 162 tasks covering six types of legal reasoning | Enables interdisciplinary conversations | link | |
LexEval | Tsinghua University | Legal cognitive abilities to organize different tasks | Larger legal evaluation dataset, examining the ethical issues | link | |
SPEC5G | Purdue University | Security-related text classification and summarization | 5G protocol analysis automation | link | |
Telecom | TeleQnA | Huawei(Paris) | General telecom inquiries | Proficiency in telecom-related questions | link |
OpsEval | Tsinghua University | Wired network ops, 5G, database ops | Focus on AIOps, evaluates proficiency | link | |
TelBench | SK Telecom | Math modeling, open-ended QA, code generation | Holistic evaluation in telecom | link | |
TelecomGPT | UAE | Telecom Math Modeling, Open QnA and Code Tasks | Holistic evaluation in telecom | link | |
Linguistic | Queen's University | Multiple language-centric tasks | zero-shot evaluation | link | |
TelcoLM | Orange | Multiple-choice questionnaires | Domain-specific data (800M tokens, 80K instructions) | link | |
ORAN-Bench-13K | GMU | Multiple-choice questions | Open Radio Access Networks (O-RAN) | link | |
Open-Telco Benchmarks | GSMA | Multiple language-centric tasks | zero-shot evaluation | link | |
FullStackBench | ByteDance | Code writing, debugging, code review | Featuring the most recent Stack Overflow QA | link | |
Coding | StackEval | Prosus AI | 11 real-world scenarios, 16 languages | Evaluation across diverse & practical coding environments | link |
CodeBenchGen | Various Institutions | Execution-based code generation tasks | Benchmarks scaling with the size and complexity | link | |
HumanEval | University of Washington | Rigorous testing | Stricter protocol for assessing correctness of generated code | link | |
APPS | University of California | Coding challenges from competitive platforms | Checking problem-solving of generated code on test cases | link | |
MBPP | Google Research | Programming problems sourced from various origins | Diverse programming tasks | link | |
ClassEval | Tsinghua University | Class-level code generation | Manually crafted, object-oriented programming concepts | link | |
CoderEval | Peking University | Pragmatic code generation | Proficiency to generate functional code patches for described issues | link | |
MultiPL-E | Princeton University | Neural code generation | Benchmarking neural code generation models | link | |
CodeXGLUE | Microsoft | Code intelligence | Wide tasks covering: code-code, text-code, code-text and text-text | link | |
EvoCodeBench | Peking University | Evolving code generation benchmark | Aligned with real-world code repositories, evolving over time | link |
Name | Year | Task Type | Institution | Category | Datasets | Url |
---|---|---|---|---|---|---|
DiffAware | 2025 | Bias | Stanford | General Bias | 8 datasets | link |
CASE-Bench | 2025 | Safety | Cambridge | Context-Aware Safety | CASE-Bench | link |
Fairness | 2025 | Fairness | PSU | Distributive Fairness | - | - |
HarmBench | 2024 | Safety | UIUC | Adversarial Behaviors | 510 | link |
SimpleQA | 2024 | Safety | OpenAI | Factuality | 4,326 | link |
AgentHarm | 2024 | Safety | BEIS | Malicious Agent Tasks | 110 | link |
StrongReject | 2024 | Safety | dsbowen | Attack Resistance | n/a | link |
LLMBar | 2024 | Instruction | Princeton | Instruction Following | 419 Instances | link |
AIR-Bench | 2024 | Safety | Stanford | Regulatory Alignment | 5,694 | link |
TrustLLM | 2024 | General | TrustLLM | Trustworthiness | 30+ | link |
RewardBench | 2024 | Alignment | AIAI | Human preference | RewardBench | link |
EQ-Bench | 2024 | Emotion | Paech | Emotional intelligence | 171 Questions | link |
Forbidden | 2023 | Safety | CISPA | Jailbreak Detection | 15,140 | link |
MaliciousInstruct | 2023 | Safety | Princeton | Malicious Intentions | 100 | link |
SycophancyEval | 2023 | Safety | Anthropic | Opinion Alignment | n/a | link |
DecodingTrust | 2023 | Safety | UIUC | Trustworthiness | 243,877 | link |
AdvBench | 2023 | Safety | CMU | Adversarial Attacks | 1,000 | link |
XSTest | 2023 | Safety | Bocconi | Safety Overreach | 450 | link |
OpinionQA | 2023 | Safety | tatsu-lab | Demographic Alignment | 1,498 | link |
SafetyBench | 2023 | Safety | THU | Content Safety | 11,435 | link |
HarmfulQA | 2023 | Safety | declare-lab | Harmful Topics | 1,960 | link |
QHarm | 2023 | Safety | vinid | Safety Sampling | 100 | link |
BeaverTails | 2023 | Safety | PKU | Red Teaming | 334,000 | link |
DoNotAnswer | 2023 | Safety | Libr-AI | Safety Mechanisms | 939 | link |
AlignBench | 2023 | Alignment | THUDM | Alignment, Reliability | Various | link |
IFEval | 2023 | Instruction | Instruction Following | 500 Prompts | link | |
ToxiGen | 2022 | Safety | Microsoft | Toxicity Detection | 274,000 | link |
HHH | 2022 | Safety | Anthropic | Human Preferences | 44,849 | link |
RedTeam | 2022 | Safety | Anthropic | Red Teaming | 38,921 | link |
BOLD | 2021 | Bias | Amazon | Bias in Generation | 23,679 | link |
BBQ | 2021 | Bias | NYU | Social Bias | 58,492 | link |
StereoSet | 2020 | Bias | McGill | Stereotype Detection | 4,229 | link |
ETHICS | 2020 | Ethics | Berkeley | Moral Judgement | 134,400 | link |
ToxicityPrompt | 2020 | Safety | AllenAI | Toxicity Assessment | 99,442 | link |
CrowS-Pairs | 2020 | Bias | NYU | Stereotype Measurement | 1,508 | link |
SEAT | 2019 | Bias | Princeton | Encoder Bias | n/a | link |
WinoGender | 2018 | Bias | UMass | Gender Bias | 720 | link |
Name | Organization | Website | Description |
---|---|---|---|
prometheus-eval | prometheus-eval | prometheus-eval | PROMETHEUS Open Evaluation Dedicated Language Model, which is more powerful than its predecessor. It can closely imitate the judgments of humans and GPT-4. Additionally, it can handle both direct evaluation and pairwise ranking formats, and can be used with user-defined evaluation criteria. On four direct evaluation benchmarks and four pairwise ranking benchmarks, PROMETHEUS 2 achieves the highest correlation and consistency with human evaluators and proprietary language models among all tested open-source evaluation language models (2024-05-04). |
athina-evals | athina-ai | athina-ai | Athina-ai is an open-source library that provides plug-and-play preset evaluations and a modular, extensible framework for writing and running evaluations. It helps engineers systematically improve the reliability and performance of their large language models through evaluation-driven development. Athina-ai offers a system for evaluation-driven development, overcoming the limitations of traditional workflows, enabling rapid experimentation, and providing customizable evaluators with consistent metrics. |
LeaderboardFinder | Huggingface | LeaderboardFinder | LeaderboardFinder helps you find suitable leaderboards for specific scenarios, a leaderboard of leaderboards (2024-04-02). |
LightEval | Huggingface | lighteval | LightEval is a lightweight framework developed by Hugging Face for evaluating large language models (LLMs). Originally designed as an internal tool for assessing Hugging Face's recently released LLM data processing library datatrove and LLM training library nanotron, it is now open-sourced for community use and improvement. Key features of LightEval include: (1) lightweight design, making it easy to use and integrate; (2) an evaluation suite supporting multiple tasks and models; (3) compatibility with evaluation on CPUs or GPUs, and integration with Hugging Face's acceleration library (Accelerate) and frameworks like Nanotron; (4) support for distributed evaluation, which is particularly useful for evaluating large models; (5) applicability to all benchmarks on the Open LLM Leaderboard; and (6) customizability, allowing users to add new metrics and tasks to meet specific evaluation needs (2024-02-08). |
LLM Comparator | LLM Comparator | A visual analytical tool for comparing and evaluating large language models (LLMs). Compared to traditional human evaluation methods, this tool offers a scalable automated approach to comparative evaluation. It leverages another LLM as an evaluator to demonstrate quality differences between models and provide reasons for these differences. Through interactive tables and summary visualizations, the LLM Comparator helps users understand why models perform well or poorly in specific contexts, as well as the qualitative differences between model responses. Developed in collaboration with Google researchers and engineers, this tool has been widely used internally at Google, attracting over 400 users and evaluating more than 1,000 experiments within three months (2024-02-16). | |
Arthur Bench | Arthur-AI | Arthur Bench | Arthur Bench is an open-source evaluation tool designed to compare and analyze the performance of large language models (LLMs). It supports various evaluation tasks, including question answering, summarization, translation, and code generation, and provides detailed reports on LLM performance across these tasks. Key features and advantages of Arthur Bench include: (1) model comparison, enabling the evaluation of different suppliers, versions, and training datasets of LLMs; (2) prompt and hyperparameter evaluation, assessing the impact of different prompts on LLM performance and testing the control of model behavior through various hyperparameter settings; (3) task definition and model selection, allowing users to define specific evaluation tasks and select evaluation targets from a range of supported LLM models; (4) parameter configuration, enabling users to adjust prompts and hyperparameters to finely control LLM behavior; (5) automated evaluation workflows, simplifying the execution of evaluation tasks; and (6) application scenarios such as model selection and validation, budget and privacy optimization, and the transformation of academic benchmarks into real-world performance evaluations. Additionally, it offers comprehensive scoring metrics, supports both local and cloud versions, and encourages community collaboration and project development (2023-10-06). |
llm-benchmarker-suite | FormulaMonks | llm-benchmarker-suite | This open-source initiative aims to address fragmentation and ambiguity in LLM benchmarking. The suite provides a structured methodology, a collection of diverse benchmarks, and toolkits to streamline the assessment of LLM performance. By offering a common platform, this project seeks to promote collaboration, transparency, and high-quality research in NLP. |
autoevals | braintrust | autoevals | AutoEvals is an AI model output evaluation tool that leverages best practices to quickly and easily assess AI model outputs. It integrates multiple automatic evaluation methods, supports customizable evaluation prompts and custom scorers, and simplifies the evaluation process of model outputs. Autoevals incorporates model-graded evaluation for various subjective tasks, including fact-checking, safety, and more. Many of these evaluations are adapted from OpenAI's excellent evals project but are implemented in a flexible way to allow users to tweak prompts and debug outputs. |
EVAL | OPENAI | EVAL | EVAL is a tool developed by OpenAI for evaluating large language models (LLMs). It can test the performance and generalization capabilities of models across different tasks and datasets. |
lm-evaluation-harness | EleutherAI | lm-evaluation-harness | lm-evaluation-harness is a tool developed by EleutherAI for evaluating large language models (LLMs). It can test the performance and generalization capabilities of models across different tasks and datasets. |
lm-evaluation | AI21Labs | lm-evaluation | Evaluations and reproducing the results from the Jurassic-1 Technical Paper, with current support for running tasks via both the AI21 Studio API and OpenAI's GPT-3 API. |
OpenCompass | Shanghai AI Lab | OpenCompass | OpenCompass is a one-stop platform for evaluating large models. Its main features include: open-source and reproducible evaluation schemes; comprehensive capability dimensions covering five major areas with over 50 datasets and approximately 300,000 questions to assess model capabilities; support for over 20 Hugging Face and API models; distributed and efficient evaluation with one-line task splitting and distributed evaluation, enabling full evaluation of trillion-parameter models within hours; diverse evaluation paradigms supporting zero-shot, few-shot, and chain-of-thought evaluations with standard or conversational prompt templates to easily elicit peak model performance. |
Large language model evaluation and workflow framework from Phase AI | wgryc | phasellm | A framework provided by Phase AI for evaluating and managing LLMs, helping users select appropriate models, datasets, and metrics, as well as visualize and analyze results. |
Evaluation benchmark for LLM | FreedomIntelligence | LLMZoo | LLMZoo is an evaluation benchmark for LLMs developed by FreedomIntelligence, featuring multiple domain and task datasets, metrics, and pre-trained models with results. |
Holistic Evaluation of Language Models (HELM) | Stanford | HELM | HELM is a comprehensive evaluation method for LLMs proposed by the Stanford research team, considering multiple aspects such as model language ability, knowledge, reasoning, fairness, and safety. |
A lightweight evaluation tool for question-answering | Langchain | auto-evaluator | auto-evaluator is a lightweight tool developed by Langchain for evaluating question-answering systems. It can automatically generate questions and answers and calculate metrics such as model accuracy, recall, and F1 score. |
PandaLM | WeOpenML | PandaLM | PandaLM is an LLM assessment tool developed by WeOpenML for automated and reproducible evaluation. It allows users to select appropriate datasets, metrics, and models based on their needs and preferences, and generates reports and charts. |
FlagEval | Tsinghua University | FlagEval | FlagEval is an evaluation platform for LLMs developed by Tsinghua University, offering multiple tasks and datasets, as well as online testing, leaderboards, and analysis functions. |
AlpacaEval | tatsu-lab | alpaca_eval | AlpacaEval is an evaluation tool for LLMs developed by tatsu-lab, capable of testing models across various languages, domains, and tasks, and providing explainability, robustness, and credibility metrics. |
Prompt flow | Microsoft | promptflow | A set of development tools designed by Microsoft to simplify the end-to-end development cycle of AI applications based on LLMs, from conception, prototyping, testing, and evaluation to production deployment and monitoring. It makes prompt engineering easier and enables the development of product-level LLM applications. |
DeepEval | mr-gpt | DeepEval | DeepEval is a simple-to-use, open-source LLM evaluation framework. Similar to Pytest but specialized for unit testing LLM outputs, it incorporates the latest research to evaluate LLM outputs based on metrics such as G-Eval, hallucination, answer relevance, RAGAS, etc., utilizing LLMs and various other NLP models that run locally on your machine for evaluation. |
CONNER | Tencent AI Lab | CONNER | CONNER is a comprehensive large model knowledge evaluation framework designed to systematically and automatically assess the information generated from six critical perspectives: factuality, relevance, coherence, informativeness, usefulness, and validity. |
Name | Organization | Website | Description |
---|---|---|---|
MMLU-Pro | TIGER-AI-Lab | MMLU-Pro | MMLU-Pro is an improved version of the MMLU dataset. MMLU has long been a reference for multiple-choice knowledge datasets. However, recent studies have shown that it contains noise (some questions are unanswerable) and is too easy (due to the evolution of model capabilities and increased contamination). MMLU-Pro provides ten options instead of four, requires reasoning on more questions, and has undergone expert review to reduce noise. It is of higher quality and more challenging than the original. MMLU-Pro reduces the impact of prompt variations on model performance, a common issue with its predecessor MMLU. Research indicates that models using "Chain of Thought" reasoning perform better on this new benchmark, suggesting that MMLU-Pro is better suited for evaluating the subtle reasoning abilities of AI. (2024-05-20) |
TrustLLM Benchmark | TrustLLM | TrustLLM | TrustLLM is a benchmark for evaluating the trustworthiness of large language models. It covers six dimensions of trustworthiness and includes over 30 datasets to comprehensively assess the functional capabilities of LLMs, ranging from simple classification tasks to complex generative tasks. Each dataset presents unique challenges and has benchmarked 16 mainstream LLMs (including commercial and open-source models). |
DyVal | Microsoft | DyVal | Concerns have been raised about the potential data contamination in the vast training corpora of LLMs. Additionally, the static nature and fixed complexity of current benchmarks may not adequately measure the evolving capabilities of LLMs. DyVal is a general and flexible protocol for dynamically evaluating LLMs. Leveraging the advantages of directed acyclic graphs, DyVal dynamically generates evaluation samples with controllable complexity. It has created challenging evaluation sets for reasoning tasks such as mathematics, logical reasoning, and algorithmic problems. Various LLMs, from Flan-T5-large to GPT-3.5-Turbo and GPT-4, have been evaluated. Experiments show that LLMs perform worse on DyVal-generated samples of different complexities, highlighting the importance of dynamic evaluation. The authors also analyze failure cases and results of different prompting methods. Furthermore, DyVal-generated samples not only serve as evaluation sets but also aid in fine-tuning to enhance LLM performance on existing benchmarks. (2024-04-20) |
RewardBench | AIAI | RewardBench | RewardBench is an evaluation benchmark for language model reward models, assessing the strengths and weaknesses of various models. It reveals that existing models still exhibit significant shortcomings in reasoning and instruction following. It includes a Leaderboard, Code, and Dataset (2024-03-20). |
LV-Eval | Infinigence-AI | LVEval | LV-Eval is a long-text evaluation benchmark featuring five length tiers (16k, 32k, 64k, 128k, and 256k), with a maximum text test length of 256k. The average text length of LV-Eval is 102,380 characters, with a minimum/maximum text length of 11,896/387,406 characters. LV-Eval primarily consists of two types of evaluation tasks: single-hop QA and multi-hop QA, encompassing 11 sub-datasets in Chinese and English. During its design, LV-Eval introduced three key technologies: Confusion Facts Insertion (CFI) to enhance challenge, Keyword and Phrase Replacement (KPR) to reduce information leakage, and Answer Keywords (AK) based evaluation metrics (combining answer keywords and word blacklists) to improve the objectivity of evaluation results (2024-02-06). |
LLM-Uncertainty-Bench | Tencent | LLM-Uncertainty-Bench | A new benchmark method for LLMs has been introduced, incorporating uncertainty quantification. Based on nine LLMs tested across five representative NLP tasks, it was found that: I) More accurate LLMs may exhibit lower certainty; II) Larger-scale LLMs may display greater uncertainty than smaller models; III) Instruction fine-tuning tends to increase the uncertainty of LLMs. These findings underscore the importance of including uncertainty in LLM evaluations (2024-01-22). |
Psychometrics Eval | Microsoft Research Asia | Psychometrics Eval | Microsoft Research Asia has proposed a generalized evaluation method for AI based on psychometrics, aiming to address limitations in traditional evaluation methods concerning predictive power, information volume, and test tool quality. This approach draws on psychometric theories to identify key psychological constructs of AI, design targeted tests, and apply Item Response Theory for precise scoring. It also introduces concepts of reliability and validity to ensure evaluation reliability and accuracy. This framework extends psychometric methods to assess AI performance in handling unknown complex tasks but also faces open questions such as distinguishing between AI "individuals" and "populations," addressing prompt sensitivity, and evaluating differences between human and AI constructs (2023-10-19). |
CommonGen-Eval | AllenAI | CommonGen-Eval | A study using the CommonGen-lite dataset to evaluate LLMs, employing GPT-4 for assessment and comparing the performance of different models, with results listed on the leaderboard (2024-01-04). |
felm | HKUST | felm | FELM is a meta-benchmark for evaluating the factual assessment of large language models. The benchmark comprises 847 questions spanning five distinct domains: world knowledge, science/technology, writing/recommendation, reasoning, and mathematics. Prompts corresponding to each domain are gathered from various sources, including standard datasets like TruthfulQA, online platforms like GitHub repositories, ChatGPT-generated prompts, or those drafted by authors. For each response, fine-grained annotation at the segment level is employed, including reference links, identified error types, and reasons behind these errors as provided by annotators (2023-10-03). |
just-eval | AI2 Mosaic | just-eval | A GPT-based evaluation tool for multi-faceted and explainable assessment of LLMs, capable of evaluating aspects such as helpfulness, clarity, factuality, depth, and engagement (2023-12-05). |
EQ-Bench | EQ-Bench | EQ-Bench | A benchmark for evaluating the emotional intelligence of language models, featuring 171 questions (compared to 60 in v1) and a new scoring system that better distinguishes performance differences among models (2023-12-20). |
CRUXEval | MIT CSAIL | CRUXEval | CRUXEval is a benchmark for evaluating code reasoning, understanding, and execution. It includes 800 Python functions and their input-output pairs, testing input prediction and output prediction tasks. Many models that perform well on HumanEval underperform on CRUXEval, highlighting the need for improved code reasoning capabilities. The best model, GPT-4 with chain-of-thought (CoT), achieved pass@1 rates of 75% and 81% for input prediction and output prediction, respectively. The benchmark exposes gaps between open-source and closed-source models. GPT-4 failed to fully pass CRUXEval, providing insights into its limitations and directions for improvement (2024-01-05). |
MLAgentBench | snap-stanford | MLAgentBench | MLAgentBench is a suite of end-to-end machine learning (ML) research tasks for benchmarking AI research agents. These agents aim to autonomously develop or improve an ML model based on a given dataset and ML task description. Each task represents an interactive environment that directly reflects what human researchers encounter. Agents can read available files, run multiple experiments on compute clusters, and analyze results to achieve the specified research objectives. Specifically, it includes 15 diverse ML engineering tasks that can be accomplished by attempting different ML methods, data processing, architectures, and training processes (2023-10-05). |
AlignBench | THUDM | AlignBench | AlignBench is a comprehensive and multi-dimensional benchmark for evaluating the alignment performance of Chinese large language models. It constructs a human-in-the-loop data creation process to ensure dynamic data updates. AlignBench employs a multi-dimensional, rule-based model evaluation method (LLM-as-Judge) and combines chain-of-thought (CoT) to generate multi-dimensional analyses and final comprehensive scores for model responses, enhancing the reliability and explainability of evaluations (2023-12-01). |
UltraEval | OpenBMB | UltraEval | UltraEval is an open-source foundational model capability evaluation framework offering a lightweight and easy-to-use evaluation system that supports mainstream large model performance assessments. Its key features include: (1) a lightweight and user-friendly evaluation framework with intuitive design, minimal dependencies, easy deployment, and good scalability for various evaluation scenarios; (2) flexible and diverse evaluation methods with unified prompt templates and rich evaluation metrics, supporting customization; (3) efficient and rapid inference deployment supporting multiple model deployment solutions, including torch and vLLM, and enabling multi-instance deployment to accelerate the evaluation process; (4) a transparent and open leaderboard with publicly accessible, traceable, and reproducible evaluation results driven by the community to ensure transparency; and (5) official and authoritative evaluation data using widely recognized official datasets to guarantee evaluation fairness and standardization, ensuring result comparability and reproducibility (2023-11-24). |
IFEval | google-research | Instruction Following Eval | Following natural language instructions is a core capability of large language models. However, the evaluation of this capability lacks standardization: human evaluation is expensive, slow, and lacks objective reproducibility, while automated evaluation based on LLMs may be biased by the evaluator LLM's capabilities or limitations. To address these issues, researchers at Google introduced Instruction Following Evaluation (IFEval), a simple and reproducible benchmark focusing on a set of "verifiable instructions," such as "write over 400 words" and "mention the AI keyword at least 3 times." IFEval identifies 25 such verifiable instructions and constructs approximately 500 prompts, each containing one or more verifiable instructions (2023-11-15). |
LLMBar | princeton-nlp | LLMBar | LLMBar is a challenging meta-evaluation benchmark designed to test the ability of LLM evaluators to identify instruction-following outputs. It contains 419 instances, each consisting of an instruction and two outputs: one faithfully and correctly following the instruction, and the other deviating from it. Each instance also includes a gold label indicating which output is objectively better (2023-10-29). |
HalluQA | Fudan, Shanghai AI Lab | HalluQA | HalluQA is a Chinese LLM hallucination evaluation benchmark, featuring 450 data points including 175 misleading entries, 69 hard misleading entries, and 206 knowledge-based entries. Each question has an average of 2.8 correct and incorrect answers annotated. To enhance the usability of HalluQA, the authors designed a GPT-4-based evaluation method. Specifically, hallucination criteria and correct answers are input as instructions to GPT-4, which evaluates whether the model's response contains hallucinations. |
FMTI | Stanford | FMTI | The Foundation Model Transparency Index (FMTI) evaluates the transparency of developers in model training and deployment across 100 indicators, including data, computational resources, and labor. Evaluations of flagship models from 10 companies reveal an average transparency score of only 37/100, indicating significant room for improvement. |
ColossalEval | Colossal-AI | ColossalEval | A project by Colossal-AI offering a unified evaluation workflow for assessing language models on public datasets or custom datasets using traditional metrics and GPT-assisted evaluations. |
LLMEval²-WideDeep | Alibaba Research | LLMEval² | Constructed as the largest and most diverse English evaluation benchmark for LLM evaluators, featuring 15 tasks, 8 capabilities, and 2,553 samples. Experimental results indicate that a wider network (involving many reviewers) with two layers (one round of discussion) performs best, improving the Kappa correlation coefficient from 0.28 to 0.34. WideDeep is also utilized to assist in evaluating Chinese LLMs, accelerating the evaluation process by 4.6 times and reducing costs by 60%. |
Aviary | Ray Project | Aviary | Enables interaction with various large language models (LLMs) in one place. Direct comparison of different model outputs, ranking by quality, and obtaining cost and latency estimates are supported. It particularly supports models hosted on Hugging Face and in many cases, also supports DeepSpeed inference acceleration. |
Do-Not-Answer | Libr-AI | Do-Not-Answer | An open-source dataset designed to evaluate the safety mechanisms of LLMs at a low cost. It consists of prompts that responsible language models should not respond to. In addition to human annotations, it implements model-based evaluation, where a BERT-like evaluator fine-tuned 600 million times achieves results comparable to humans and GPT-4. |
LucyEval | Oracle | LucyEval | Chinese LLM maturity evaluation—LucyEval can objectively test various aspects of model capabilities, identify model shortcomings, and help designers and engineers more accurately adjust and train models, aiding LLMs in advancing toward greater intelligence. |
Zhujiu | Institute of Automation, CAS | Zhujiu | Covers seven capability dimensions and 51 tasks; employs three complementary evaluation methods; offers comprehensive Chinese benchmarking with English evaluation capabilities. |
ChatEval | THU-NLP | ChatEval | ChatEval aims to simplify the human evaluation process of generated text. Given different text fragments, roles (played by master's students) in ChatEval can autonomously discuss nuances and differences, providing judgments based on their designated roles. |
FlagEval | Zhiyuan/Tsinghua | FlagEval | Produced by Zhiyuan, combining subjective and objective scoring to offer LLM score rankings. |
InfoQ Comprehensive LLM Evaluation | InfoQ | InfoQ Evaluation | Chinese-oriented ranking: ChatGPT > Wenxin Yiyang > Claude > Xinghuo. |
Chain-of-Thought Evaluation | Yao Fu | COT Evaluation | Includes rankings for GSM8k and MATH complex problems. |
Z-Bench | True Fund | Z-Bench | Indicates that domestic Chinese models have relatively low programmability, with minimal performance differences between models. The two versions of ChatGLM show significant improvement. |
CMU Chatbot Evaluation | CMU | zeno-build | In conversational training scenarios, rankings show ChatGPT > Vicuna > others. |
lmsys-arena | Berkeley | lmsys Ranking | Utilizes Elo scoring mechanism, with rankings showing GPT4 > Claude > GPT3.5 > Vicuna > others. |
Huggingface Open LLM Leaderboard | Huggingface | HF Open LLM Leaderboard | Organized by Huggingface, this leaderboard evaluates multiple mainstream open-source LLMs. Evaluations focus on four datasets: AI2 Reasoning Challenge, HellaSwag, MMLU, and TruthfulQA, primarily in English. |
AlpacaEval | tatsu-lab | AlpacaEval | Open-source model leader where Vicuna, OpenChat, and WizardLM lead based on LLM-based automatic evaluations. |
Chinese-LLM-Benchmark | jeinlee1991 | llm-benchmark | Chinese LLM capability evaluation rankings covering Baidu Ernie Bot, ChatGPT, Alibaba Tongyi Qianwen, iFLYTEK Xinghuo, and open-source models like Belle and ChatGLM6B. It provides capability score rankings and original model output results. |
Open LLM Leaderboard | HuggingFace | Leaderboard | Organized by HuggingFace to evaluate multiple mainstream open-source LLMs. Evaluations primarily focus on four datasets: AI2 Reasoning Challenge, HellaSwag, MMLU, and TruthfulQA, mainly in English. |
Stanford Question Answering Dataset (SQuAD) | Stanford NLP Group | SQuAD | Evaluates model performance on reading comprehension tasks. |
Multi-Genre Natural Language Inference (MultiNLI) | New York University, DeepMind, Facebook AI Research, Allen Institute for AI, Google AI Language | MultiNLI | Evaluates the model's ability to understand sentence relationships across different text genres. |
LogiQA | Tsinghua University and Microsoft Research Asia | LogiQA | Evaluates the model's logical reasoning capabilities. |
HellaSwag | University of Washington and Allen Institute for AI | HellaSwag | Evaluates the model's reasoning capabilities. |
The LAMBADA Dataset | University of Trento and Fondazione Bruno Kessler | LAMBADA | Evaluates the model's ability to predict the last word of a paragraph, reflecting long-term understanding capabilities. |
CoQA | Stanford NLP Group | CoQA | Evaluates the model's ability to understand text paragraphs and answer a series of interrelated questions in conversational settings. |
ParlAI | Facebook AI Research | ParlAI | Evaluates model performance in accuracy, F1 score, perplexity (the model's ability to predict the next word in a sequence), human evaluation (relevance, fluency, and coherence), speed and resource utilization, robustness (model performance under varying conditions such as noisy inputs, adversarial attacks, or changes in data quality), and generalization capabilities. |
Language Interpretability Tool (LIT) | LIT | Provides a platform for evaluating models based on user-defined metrics, analyzing model strengths, weaknesses, and potential biases. | |
Adversarial NLI (ANLI) | Facebook AI Research, New York University, Johns Hopkins University, University of Maryland, Allen Institute for AI | Adversarial NLI (ANLI) | Evaluates the model's robustness, generalization capabilities, reasoning explanation abilities, consistency, and resource efficiency (memory usage, inference time, and training time). |
Name | Institution | Field | URL | Introduction |
---|---|---|---|---|
Seismometer | Epic | Healthcare | seismomete | Seismometer is an AI model performance evaluation tool for the healthcare field, providing standardized evaluation criteria to help make decisions based on local data and workflows. It supports continuous monitoring of model performance. Although it can be used for models in any field, it was designed with a focus on validation for healthcare AI models where local validation requires cross-referencing data about patients (such as demographics, clinical interventions, and patient outcomes) and model performance. (2024-05-22) |
Medbench | OpenMEDLab | Healthcare | medbench | MedBench is committed to creating a scientific, fair, and rigorous evaluation system and open platform for Chinese medical large models. Based on authoritative medical standards, it continuously updates and maintains high-quality medical datasets to comprehensively and multi-dimensionally quantify the capabilities of models across various medical dimensions. MedBench comprises 40,041 questions sourced from authentic examination exercises and medical reports of diverse branches of medicine. It is composed of four key components: the Chinese Medical Licensing Examination, the Resident Standardization Training Examination, the Doctor In-Charge Qualification Examination, and real-world clinic cases encompassing examinations, diagnoses, and treatments. (2023-12-20) |
Fin-Eva | Ant Group, Shanghai University of Finance and Economics | Finance | Fin-Eva | Fin-Eva Version 1.0, jointly launched by Ant Group and Shanghai University of Finance and Economics, covers multiple financial scenarios such as wealth management, insurance, and investment research, as well as financial specialty disciplines, with a total of over 13,000 evaluation questions. Ant’s data sources include data from various business fields and publicly available internet data. After processes such as data desensitization, text clustering, corpus screening, and data rewriting, it is combined with reviews from financial experts to construct the dataset. Shanghai University of Finance and Economics’ data sources are primarily based on real questions and simulated questions from authoritative exams in relevant fields, following the requirements of knowledge outlines. Ant’s section covers five major capabilities in finance cognition, financial knowledge, financial logic, content generation, and safety compliance, with 33 sub-dimensions and 8,445 evaluation questions; Shanghai University of Finance and Economics’ section covers four major areas: finance, economics, accounting, and certificates, including 4,661 questions across 34 different disciplines. Fin-Eva Version 1.0 adopts multiple-choice questions with fixed answers, accompanied by corresponding instructions to enable models to output in a standard format (2023-12-20) |
GenMedicalEval | SJTU | Healthcare | GenMedicalEval | 1. Large-scale comprehensive performance evaluation: GenMedicalEval constructs a total of over 100,000 medical evaluation data covering 16 major departments, 3 stages of physician training, 6 medical clinical application scenarios, based on over 40,000 medical examination real questions and over 55,000 patient medical records from top-tier hospitals. This dataset comprehensively evaluates the overall performance of large models in real medical complex scenarios from aspects such as medical basic knowledge, clinical application, and safety standards, addressing the shortcomings of existing evaluation benchmarks that fail to cover many practical challenges in medical practice. 2. In-depth multi-dimensional scenario evaluation: GenMedicalEval integrates physicians’ clinical notes and medical imaging materials, building a series of diverse and theme-rich generative evaluation questions around key medical scenarios such as examination, diagnosis, and treatment. This provides a strong supplement to existing question-and-answer based evaluations that simulate real clinical environments for open diagnostic processes. 3. Innovative open evaluation metrics and automated evaluation models: To address the challenge of lacking effective evaluation metrics for open generative tasks, GenMedicalEval employs advanced structured extraction and terminology alignment techniques to build an innovative generative evaluation metric system. This system accurately measures the medical knowledge accuracy of generated answers. Furthermore, it trains a medical automatic evaluation model based on its self-built knowledge base, which has a high correlation with human evaluations. The model provides multi-dimensional medical scores and evaluation reasons. Its features include no data leakage and being controllable, giving it unique advantages compared to other models like GPT-4 (2023-12-08) |
OpenFinData | Shanghai Artificial Intelligence Laboratory | Finance | OpenFinData | OpenFinData, the first full-scenario financial evaluation dataset based on the "OpenCompass" framework, released by the Shanghai Artificial Intelligence Laboratory, comprises six modules and nineteen financial task dimensions, covering multi-level data types and diverse financial scenarios. Each piece of data originates from actual financial business scenarios (2024-01-04) |
LAiW | Sichuan University | Legal | LAiW | From a legal perspective and feasibility, LAiW categorizes the capabilities of legal NLP into three major abilities, totaling 13 basic tasks: (1) Legal NLP basic abilities: evaluates the capabilities of legal basic tasks, NLP basic tasks, and legal information extraction, including legal clause recommendation, element recognition, named entity recognition, judicial point summarization, and case identification, five basic tasks; (2) Basic legal application abilities: evaluates the basic application capabilities of large models in the legal field, including争议焦点挖掘, case matching, criminal judgment prediction, civil judgment prediction, and legal Q&A, five basic tasks; (3) Complex legal application abilities: evaluates the complex application capabilities of large models in the legal field, including judicial reasoning generation, case understanding, and legal consultation, three basic tasks (2023-10-08) |
LawBench | Nanjing University | Legal | LawBench | LawBench is meticulously designed to precisely evaluate the legal capabilities of large language models. When designing test tasks, it simulates three dimensions of judicial cognition and selects 20 tasks to assess the capabilities of large models. Compared to some existing benchmarks that only have multiple-choice questions, LawBench includes more task types closely related to real-world applications, such as legal entity recognition, reading comprehension, crime amount calculation, and consultation. LawBench recognizes that the current safety strategies of large models may lead to models refusing to respond to certain legal inquiries or encountering difficulties in understanding instructions, resulting in a lack of responses. Therefore, LawBench has developed a separate evaluation metric, the "abstention rate," to measure the frequency of models refusing to provide answers or failing to correctly understand instructions. Researchers have evaluated the performance of 51 large language models on LawBench, including 20 multilingual models, 22 Chinese models, and 9 legal-specific large language models (2023-09-28) |
PsyEval | SJTU | Psychological | PsyEval | In mental health research, the use of large language models (LLMs) is gaining increasing attention, especially their significant capabilities in disease detection. Researchers have custom-designed the first comprehensive benchmark for the mental health field to systematically evaluate the capabilities of LLMs in this domain. This benchmark includes six sub-tasks covering three dimensions to comprehensively assess the capabilities of LLMs in mental health. Corresponding concise prompts have been designed for each sub-task, and eight advanced LLMs have been comprehensively evaluated (2023-11-15) |
PPTC | Microsoft, PKU | Office | PPTC | PPTC is a benchmark for testing the capabilities of large models in PPT generation, comprising 279 multi-turn conversations covering different topics and hundreds of instructions involving multi-modal operations. The research team has also proposed the PPTX-Match evaluation system, which assesses whether large language models have completed instructions based on predicted files rather than label API sequences. Therefore, it supports various LLM-generated API sequences. Currently, PPT generation faces three main challenges: error accumulation in multi-turn conversations, processing long PPT templates, and multi-modal perception issues (2023-11-04) |
LLMRec | Alibaba | Recommendation | LLMRec | Benchmark testing of popular LLMs (such as ChatGPT, LLaMA, ChatGLM, etc.) has been conducted on five recommendation-related tasks, including rating prediction, sequential recommendation, direct recommendation, explanation generation, and review summarization. Additionally, the effectiveness of supervised fine-tuning to enhance the instruction-following capabilities of LLMs has been studied (2023-10-08) |
LAiW | Dai-shen | Legal | LAiW | In response to the rapid development of legal large language models, the first Chinese legal large language model benchmark based on legal capabilities has been proposed. Legal capabilities are divided into three levels: basic legal natural language processing capabilities, basic legal application capabilities, and complex legal application capabilities. The first phase of evaluation has been completed, focusing on the assessment of basic legal natural language processing capabilities. The evaluation results show that while some legal large language models perform better than their base models, there is still a gap compared to ChatGPT (2023-10-25) |
OpsEval | Tsinghua University | AIOps | OpsEval | OpsEval is a comprehensive task-oriented AIOps benchmark test for large language models, assessing the proficiency of LLMs in three key scenarios: wired network operations, 5G communication operations, and database operations. These scenarios involve different capability levels, including knowledge recall, analytical thinking, and practical application. The benchmark comprises 7,200 questions in multiple-choice and Q&A formats, supporting both English and Chinese (2023-10-02) |
SWE-bench | princeton-nlp | Software | SWE-bench | SWE-bench is a benchmark for evaluating the performance of large language models on real software issues collected from GitHub. Given a code repository and a problem, the task of the language model is to generate a patch that can solve the described problem |
BLURB | Mindrank AI | Healthcare | BLURB | BLURB includes a comprehensive benchmark test for biomedical natural language processing applications based on PubMed, as well as a leaderboard for tracking community progress. BLURB comprises six diverse tasks and thirteen publicly available datasets. To avoid overemphasizing tasks with many available datasets (e.g., named entity recognition NER), BLURB reports the macro-average across all tasks as the primary score. The BLURB leaderboard is model-agnostic; any system that can generate test predictions using the same training and development data can participate. The primary goal of BLURB is to lower the barrier to participation in biomedical natural language processing and help accelerate progress in this important field that has a positive impact on society and humanity |
SmartPlay | microsoft | Gaming | SmartPlay | SmartPlay is a large language model (LLM) benchmark designed for ease of use, offering a variety of games for testing |
FinEval | SUFE-AIFLM-Lab | Finance | FinEval | FinEval: A collection of high-quality multiple-choice questions covering fields such as finance, economics, accounting, and certificates |
GSM8K | OpenAI | Mathematics | GSM8K | GSM8K is a dataset of 8.5K high-quality linguistically diverse elementary school math word problems. GSM8K divides them into 7.5K training problems and 1K test problems. These problems require 2 to 8 steps to solve, with solutions primarily involving performing a series of basic arithmetic operations (+ - / *) to reach the final answer |
Name | Institution | URL | Introduction |
---|---|---|---|
BERGEN | NAVER | BERGEN | BERGEN: A benchmarking library for RAG systems focusing on question-answering (QA) to enhance the understanding and comparison of the impact of each component in a RAG pipeline. It simplifies the reproducibility and integration of new datasets and models through HuggingFace. BERGEN (BEnchmarking Retrieval-augmented GENeration) is a library to benchmark RAG systems, focusing on question-answering (QA). Inconsistent benchmarking poses a major challenge in comparing approaches and understanding the impact of each component in a RAG pipeline. BERGEN was designed to ease the reproducibility and integration of new datasets and models thanks to HuggingFace (2024-05-31) |
CRAG | Meta Reality Labs | CRAG | CRAG is an RAG benchmark comprising nearly 4,500 QA pairs and mock APIs, covering a wide range of domains and question types to inspire researchers to improve the reliability and accuracy of QA systems. It is a factual question-answering benchmark of 4,409 question-answer pairs and mock APIs to simulate web and Knowledge Graph (KG) search. CRAG is designed to encapsulate a diverse array of questions across five domains and eight question categories, reflecting varied entity popularity from popular to long-tail, and temporal dynamisms ranging from years to seconds (2024-06-07) |
raga-llm-hub | RAGA-AI | raga-llm-hub | raga-llm-hub is a comprehensive evaluation toolkit for language and learning models (LLMs). With over 100 carefully designed evaluation metrics, it is the most comprehensive platform allowing developers and organizations to effectively evaluate and compare LLMs, and establish basic safeguards for LLM and retrieval-augmented generation (RAG) applications. These tests assess various aspects such as relevance and understanding, content quality, hallucination, safety and bias, context relevance, safeguards, and vulnerability scanning, while providing a series of metric-based tests for quantitative analysis (2024-03-10) |
ARES | Stanford | ARES | ARES is an automatic evaluation framework for retrieval-augmented generation systems, comprising three components: (1) A set of annotated query-document-answer triplets with human preference validations for evaluation criteria such as context relevance, answer faithfulness, and/or answer relevance. There should be at least 50 examples, but preferably several hundred. (2) A small set of examples for scoring context relevance, answer faithfulness, and/or answer relevance in your system. (3) A large number of unannotated query-document-answer triplets generated by your RAG system for scoring. The ARES training process includes three steps: (1) Generating synthetic queries and answers from domain-specific paragraphs. (2) Fine-tuning LLM evaluators for scoring RAG systems by training on the synthetic data. (3) Deploying the prepared LLM evaluators to assess the performance of your RAG system on key metrics (2023-09-27) |
RGB | CAS | RGB | RGB is a new corpus/benchmark for evaluating RAG in English and Chinese (RGB). It analyzes the performance of different large language models in the four basic capabilities required for RAG, including noise robustness, negative rejection, information integration, and counterfactual robustness. RGB divides the instances in the benchmark into four independent test sets based on these basic capabilities to address cases. Then, six representative LLMs were evaluated in RGB to diagnose the challenges faced by current LLMs when applying RAG. The evaluation shows that while LLMs demonstrate a certain level of noise robustness, they still face significant difficulties in negative rejection, information integration, and handling false information. The above evaluation results indicate that there is still a long way to go in effectively applying RAG to LLMs (2023-09-04) |
tvalmetrics | TonicAI | tvalmetrics | The metrics in Tonic Validate Metrics use LLM-assisted evaluation, meaning they use an LLM (e.g., gpt-4) to score different aspects of RAG application outputs. The metrics in Tonic Validate Metrics use these objects and LLM-assisted evaluation to answer questions about RAG applications. (1) Answer similarity score: How well should the RAG answer match the answer? (2) Retrieval precision: Is the retrieved context relevant to the question? (3) Augmentation precision: Does the answer contain retrieved context relevant to the question? (4) Augmentation accuracy: What is the proportion of retrieved context in the answer? (5) Answer consistency (binary): Does the answer contain any information outside the retrieved context? (6) Retrieval k-recall: For the top k context vectors, is the retrieved context a subset of the top k context vectors, and are all relevant contexts in the retrieved context part of the top k context vectors? (2023-11-11) |
Name | Institution | URL | Introduction |
---|---|---|---|
SuperCLUE-Agent | CLUE | SuperCLUE-Agent | SuperCLUE-Agent is a multi-dimensional benchmark focusing on Agent capabilities, covering three core capabilities and ten basic tasks. It can be used to evaluate the performance of large language models in core Agent capabilities, including tool usage, task planning, and long- and short-term memory. Evaluation of 16 Chinese-supporting large language models found that the GPT-4 model leads significantly in core Agent capabilities for Chinese tasks. Meanwhile, representative domestic models, including open-source and closed-source models, are approaching the level of GPT-3.5 (2023-10-20) |
AgentBench | Tsinghua University | AgentBench | AgentBench is a systematic benchmark evaluation tool for assessing LLMs as intelligent agents, highlighting the performance gap between commercial LLMs and open-source competitors (2023-08-01) |
AgentBench Reasoning and Decision-making Evaluation Leaderboard | THUDM | AgentBench | Jointly launched by Tsinghua and multiple universities, it covers the reasoning and decision-making capabilities of models in different task environments, such as shopping, home, and operating systems |
ToolBench Tool Invocation Evaluation | Zhiyuan/Tsinghua | ToolBench | Compares with tool fine-tuned models and ChatGPT to provide evaluation scripts |
Name | Institution | URL | Introduction |
---|---|---|---|
McEval | Beihang | McEval | To more comprehensively explore the code capabilities of large language models, this work proposes a large-scale multi-lingual multi-task code evaluation benchmark (McEval) covering 40 programming languages with 16,000 test samples. Evaluation results show that open-source models still have significant gaps compared to GPT-4 in multi-lingual programming capabilities, with most open-source models unable to surpass even GPT-3.5. Additionally, tests indicate that open-source models such as Codestral, DeepSeek-Coder, CodeQwen, and some derivative models also exhibit excellent multi-lingual capabilities. McEval is a massively multilingual code benchmark covering 40 programming languages with 16K test samples, substantially pushing the limits of code LLMs in multilingual scenarios. The benchmark includes challenging code completion, understanding, and generation evaluation tasks with finely curated massively multilingual instruction corpora McEval-Instruct. The McEval leaderboard can be found here (2024-06-11) |
HumanEval-XL | FloatAI | SuperCLUE-Agent | Existing benchmarks primarily focus on translating English prompts into multi-lingual code or are limited to very restricted natural languages. These benchmarks overlook the broad field of large-scale multi-lingual NL to multi-lingual code generation, leaving an important gap in evaluating multi-lingual LLMs. To address this challenge, the authors propose HumanEval-XL, a large-scale multi-lingual code generation benchmark aimed at filling this gap. HumanEval-XL establishes connections between 23 natural languages and 12 programming languages, comprising 22,080 prompts with an average of 8.33 test cases per prompt. By ensuring parallel data across multiple NLs and PLs, HumanEval-XL provides a comprehensive evaluation platform for multi-lingual LLMs, enabling the assessment of understanding of different NLs. This work represents a pioneering step in addressing the gap in NL generalization evaluation for multi-lingual code generation (2024-02-26) |
DebugBench | Tsinghua University | DebugBench | DebugBench is an LLM debugging benchmark comprising 4,253 instances, covering four major vulnerability categories and 18 minor categories in C++, Java, and Python. To construct DebugBench, the authors collected code snippets from the LeetCode community, implanted vulnerabilities into the source data using GPT-4, and ensured strict quality checks (2024-01-09) |
Name | Institution | URL | Introduction |
---|---|---|---|
ChartVLM | Shanghai AI Lab | ChartVLM | ChartX is a multi-modal evaluation set comprising 18 types of charts, 7 chart tasks, 22 subject themes, and high-quality chart data. Additionally, the authors of this paper have developed ChartVLM, offering a new perspective for handling multi-modal tasks dependent on explainable patterns, such as reasoning tasks in the fields of charts or geometric images (2024-02-19) |
ReForm-Eval | FudanDISC | ReForm-Eval | ReForm-Eval is a benchmark dataset for comprehensively evaluating large visual language models. By reconstructing existing multi-modal benchmark datasets with different task formats, ReForm-Eval constructs a benchmark dataset with a unified format suitable for large model evaluation. The constructed ReForm-Eval has the following features: it spans eight evaluation dimensions, providing sufficient evaluation data for each dimension (averaging over 4,000 entries per dimension); it has a unified evaluation question format (including multiple-choice and text generation questions); it is convenient and easy to use, with reliable and efficient evaluation methods that do not rely on external services like ChatGPT; it efficiently utilizes existing data resources without requiring additional manual annotation and can be further expanded to more datasets (2023-10-24) |
LVLM-eHub | OpenGVLab | LVLM-eHub | "Multi-Modality Arena" is an evaluation platform for large multi-modal models. Following Fastchat, two anonymous models are compared side-by-side on visual question answering tasks. "Multi-Modality Arena" allows side-by-side benchmarking of visual-language models while providing image input. It supports various models such as MiniGPT-4, LLaMA-Adapter V2, LLaVA, and BLIP-2 |
Name | Institution | URL | Introduction |
---|---|---|---|
InfiniteBench | OpenBMB | InfiniteBench | Understanding and processing long text is an essential capability for large models to advance to a deeper level of understanding and interaction. While some large models claim to handle sequences of 100k+, there is a lack of standardized benchmark datasets. InfiniteBench addresses this by constructing a benchmark for sequences exceeding 100k+, focusing on five key capabilities of large models in handling long text: retrieval, mathematics, coding, question answering, and summarization. (1) Long Context: The average context length in InfiniteBench test data is 195k, far exceeding existing benchmarks. (2) Multi-domain and Multi-language: The benchmark includes 12 tasks in both Chinese and English, covering the five domains mentioned above. (3) Forward-looking and Challenging: The tasks in InfiniteBench are designed to match the capabilities of the strongest current models such as GPT-4 and Claude 2. (4) Realistic and Synthetic Scenarios: InfiniteBench incorporates both real-world data to test the model’s ability to handle practical problems and synthetic data to facilitate the expansion of context windows for testing. InfiniteBench is the first LLM benchmark featuring an average data length surpassing 100K tokens. It comprises synthetic and realistic tasks spanning diverse domains in both English and Chinese. The tasks in InfiniteBench require a thorough understanding of long dependencies in contexts, making the simple retrieval of a limited number of passages from contexts insufficient for these tasks. (2024-03-19) |
Name | Institution | URL | Introduction |
---|---|---|---|
llmperf | Ray | llmperf | A library for inspecting and benchmarking LLM performance. It measures metrics such as Time to First Token (TTFT), Inter-Token Latency (ITL), and the number of requests with no data returned within 3 seconds. It also validates the correctness of LLM outputs, primarily checking for cross-requests (e.g., Request A receiving the response of Request B). Variations in input and output token lengths are considered in the design to better reflect real-world scenarios. Currently supported endpoints include OpenAI-compatible endpoints (e.g., Anyscale endpoints, private endpoints, OpenAI, Fireworks, etc.), Together, Vertex AI, and SageMaker. (2023-11-03) |
llm-analysis | Databricks | llm-analysis | Latency and Memory Analysis of Transformer Models for Training and Inference. |
llm-inference-benchmark | Nankai University | llm-inference-benchmark | LLM Inference framework benchmark. |
llm-inference-bench | CentML | llm-inference-bench | This benchmark operates entirely external to any serving framework and can be easily extended and modified. It provides a variety of statistics and profiling modes. Designed as a standalone tool, it enables precise benchmarking with statistically significant results for specific input/output distributions. Each request consists of a single prompt and a single decoding step. |
GPU-Benchmarks-on-LLM-Inference | UIUC | GPU-Benchmarks-on-LLM-Inference | Uses llama.cpp to test the inference speed of LLaMA models on different GPUs, including RunPod, 16-inch M1 Max MacBook Pro, M2 Ultra Mac Studio, 14-inch M3 MacBook Pro, and 16-inch M3 Max MacBook Pro. |
Name | Institution | URL | Introduction |
---|---|---|---|
LLM-QBench | Beihang/SenseTime | LLM-QBench | LLM-QBench is a benchmark for post-training quantization of large language models and serves as an efficient LLM compression tool with various advanced compression methods. It supports multiple inference backends. (2024-05-09) |
- Chat Arena: anonymous models side-by-side and vote for which one is better - An open-source AI LLM "anonymous" arena! Here, you can become a judge, score two model responses without knowing their identities, and after scoring, the true identities of the models will be revealed. Participants include Vicuna, Koala, OpenAssistant (oasst), Dolly, ChatGLM, StableLM, Alpaca, LLaMA, and more.
Platform | Access |
---|---|
ACLUE | [Source |
AgentBench | [Source] |
AlpacaEval | [Source] |
ANGO | [Source] |
BeHonest | [Source] |
Big Code Models Leaderboard | [Source] |
Chatbot Arena | [Source] |
Chinese Large Model Leaderboard | [Source] |
CLEVA | [Source] |
CompassRank | [Source] |
CompMix | [Source] |
C-Eval | [Source] |
DreamBench++ | [Source] |
FELM | [Source] |
FlagEval | [Source] |
Hallucination Leaderboard | [Source] |
HELM | [Source] |
Huggingface Open LLM Leaderboard | [Source] |
Huggingface LLM Perf Leaderboard | [Source] |
Indico LLM Leaderboard | [Source] |
InfiBench | [Source] |
InterCode | [Source] |
LawBench | [Source] |
LLMEval | [Source] |
LLM Rankings | [Source] |
LLM Use Case Leaderboard | [Source] |
LucyEval | [Source] |
M3CoT | [Source] |
MMLU by Task Leaderboard | [Source] |
MMToM-QA | [Source] |
MathEval | [Source] |
OlympicArena | [Source] |
OpenEval | [Source] |
Open Multilingual LLM Eval | [Source] |
PubMedQA | [Source] |
SafetyBench | [Source] |
SciBench | [Source] |
SciKnowEval | [Source] |
SEED-Bench | [Source] |
SuperBench | [Source] |
SuperCLUE | [Source] |
SuperGLUE | [Source] |
SuperLim | [Source] |
TAT-DQA | [Source] |
TAT-QA | [Source] |
TheoremOne LLM Benchmarking Metrics | [Source] |
Toloka | [Source] |
Toolbench | [Source] |
VisualWebArena | [Source] |
We-Math | [Source] |
WHOOPS! | [Source] |
Provider (link to pricing) | OpenAI | OpenAI | Anthropic | Replicate | DeepSeek | Mistral | Anthropic | Mistral | Cohere | Anthropic | Mistral | Replicate | Mistral | OpenAI | Groq | OpenAI | Mistral | Anthropic | Groq | Anthropic | Anthropic | Microsoft | Microsoft | Mistral | |||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Model name | GPT-4o | GPT-4 Turbo | Claude 3 Opus | Gemini 1.5 Pro | Llama 3 70B | DeepSeek-V2 | Mixtral 8x22B | Claude 3 Sonnet | Gemini 1.5 Flash | Mistral Large | Command R+ | Claude 3 Haiku | Mistral Small | Llama 3 8B | Mixtral 8x7B | GPT-3.5 Turbo | Llama 3 70B (Groq) | GPT-4 | Mistral Medium | Claude 2.0 | Mixtral 8x7B (Groq) | Claude 2.1 | Claude Instant | Phi-Medium 4k | Phi-3-Small 8k | Mistral 7B | |
Column Last Updated | 5/14/2024 | 5/14/2024 | 5/14/2024 | 5/14/2024 | 5/20/2024 | 5/20/2024 | 5/20/2024 | 5/14/2024 | 5/14/2024 | 5/20/2024 | 5/20/2024 | 5/14/2024 | 5/20/2024 | 5/21/2024 | 5/14/2024 | 5/14/2024 | 5/14/2024 | 5/14/2024 | 5/20/2024 | 5/14/2024 | 5/14/2024 | 5/14/2024 | 5/14/2024 | 5/21/2024 | 5/21/2024 | 5/22/2024 | |
CAPABILITY | |||||||||||||||||||||||||||
Artificial Analysis Index | 100 | 94 | 94 | 88 | 88 | 82 | 81 | 78 | 76 | 75 | 74 | 72 | 71 | 65 | 65 | 65 | 88 | 83 | 73 | 69 | 65 | 63 | 63 | 39 | |||
LMSys Chatbot Arena ELO | 1310 | 1257 | 1256 | 1249 | 1208 | 1204 | 1158 | 1193 | 1182 | 1154 | 1114 | 1102 | 1208 | 1189 | 1148 | 1126 | 1114 | 1115 | 1104 | 1006 | |||||||
General knowledge: | |||||||||||||||||||||||||||
MMLU | 88.70 | 86.40 | 86.80 | 81.90 | 82.00 | 78.50 | 77.75 | 79.00 | 78.90 | 81.20 | 75.70 | 75.20 | 72.20 | 68.40 | 70.60 | 70.00 | 82.00 | 86.40 | 75.30 | 78.50 | 70.60 | 73.40 | 78.00 | 75.70 | 62.50 | ||
Math: | |||||||||||||||||||||||||||
MATH | 76.60 | 73.40 | 60.10 | 58.50 | 50.40 | 43.10 | 54.90 | 45.00 | 38.90 | 30.00 | 34.10 | 50.40 | 52.90 | ||||||||||||||
MGSM / GSM8K | 90.50 | 88.60 | 95.00 | 93.00 | 92.30 | 88.90 | 79.60 | 57.10 | 93.00 | 92.00 | 91.00 | 89.60 | |||||||||||||||
Reasoning: | |||||||||||||||||||||||||||
GPQA | 53.60 | 49.10 | 50.40 | 41.50 | 39.50 | 40.40 | 39.50 | 33.30 | 34.20 | 28.10 | 39.50 | 35.70 | |||||||||||||||
BIG-BENCH-HARD | 86.80 | 84.00 | 82.90 | 85.50 | 73.70 | 66.60 | 83.10 | ||||||||||||||||||||
DROP, F1 Score | 83.40 | 85.40 | 83.10 | 78.90 | 78.40 | 64.10 | 80.90 | ||||||||||||||||||||
HellaSwag | 95.40 | 89.00 | 89.20 | 85.90 | 86.90 | 86.70 | 85.50 | 95.30 | 88.00 | 86.70 | 82.40 | 77.00 | |||||||||||||||
Code: | |||||||||||||||||||||||||||
HumanEval | 90.20 | 87.60 | 84.90 | 71.90 | 81.70 | 73.00 | 75.90 | 62.20 | 48.10 | 81.70 | 67.00 | 62.20 | 61.00 | ||||||||||||||
Natural2Code | 77.70 | 77.20 | |||||||||||||||||||||||||
Conversational: | |||||||||||||||||||||||||||
MT Bench | 93.20 | 83.00 | 83.90 | 86.10 | 80.60 | 83.00 | 81.80 | 78.50 | 68.40 | ||||||||||||||||||
Benchmark Avg Not useful - selection bias has significant impact. | 80.50 | 80.53 | 80.31 | 69.25 | 69.32 | 78.50 | 77.75 | 72.33 | 67.20 | 71.80 | 75.70 | 68.78 | 79.55 | 54.88 | 80.10 | 59.72 | 69.32 | 74.16 | 83.13 | 79.55 | 80.10 | 81.80 | 75.95 | 78.40 | 75.83 | 65.45 | |
THROUGHPUT | |||||||||||||||||||||||||||
Throughput (median tokens/sec) | 90.40 | 21.10 | 25.60 | 46.20 | 26.30 | 15.60 | 76.50 | 61.30 | 161.70 | 32.30 | 42.20 | 116.70 | 80.70 | 75.40 | 60.00 | 58.60 | 305.30 | 28.30 | 18.30 | 37.20 | 477.10 | 42.10 | 85.70 | 62.20 | |||
Throughput (median seconds per 1K tokens) | 11.06 | 47.39 | 39.06 | 21.65 | 38.02 | 64.10 | 13.07 | 16.31 | 6.18 | 30.96 | 23.70 | 8.57 | 12.39 | 13.26 | 16.67 | 17.06 | 3.28 | 35.34 | 54.64 | 26.88 | 2.10 | 23.75 | 11.67 | 16.08 | |||
COST | |||||||||||||||||||||||||||
Cost Input (1M tokens) aka "context window tokens" | $5.00 | $10.00 | $15.00 | $7.00 | $0.65 | $0.14 | $2.00 | $3.00 | $0.35 | $4.00 | $3.00 | $0.25 | $1.00 | $0.05 | $0.70 | $0.50 | $0.59 | $30.00 | $2.70 | $8.00 | $0.27 | $8.00 | $0.80 | $0.25 | |||
Cost Output (1M tokens) | $15.00 | $30.00 | $75.00 | $21.00 | $2.75 | $0.28 | $6.00 | $15.00 | $0.53 | $12.00 | $15.00 | $1.25 | $3.00 | $0.25 | $0.70 | $1.50 | $0.79 | $60.00 | $8.10 | $24.00 | $0.27 | $24.00 | $2.40 | $0.25 | |||
Cost 1M Input + 1M Output tokens | $20.00 | $40.00 | $90.00 | $28.00 | $3.40 | $0.42 | $8.00 | $18.00 | $0.88 | $16.00 | $18.00 | $1.50 | $4.00 | $0.30 | $1.40 | $2.00 | $1.38 | $90.00 | $10.80 | $32.00 | $0.54 | $32.00 | $3.20 | $0.50 | |||
COST VS PERFORMANCE | |||||||||||||||||||||||||||
Cost 1M+1M IO tokens per AA Index point | $0.20 | $0.43 | $0.96 | $0.32 | $0.04 | $0.01 | $0.10 | $0.23 | $0.01 | $0.21 | $0.24 | $0.02 | $0.06 | $0.00 | $0.02 | $0.03 | $0.02 | $1.08 | $0.15 | $0.46 | $0.01 | $0.51 | $0.05 | $0.01 | |||
Cost 1M+1M IO tokens per Chatbot ELO point | $0.02 | $0.03 | $0.07 | $0.02 | $0.00 | #DIV/0! | #DIV/0! | $0.01 | #DIV/0! | $0.01 | $0.02 | $0.00 | #DIV/0! | $0.00 | $0.00 | $0.00 | $0.00 | $0.08 | $0.01 | $0.03 | $0.00 | $0.03 | $0.00 | $0.00 | |||
Cost 1M+1M IO tokens per Throughput (tokens/sec) | $0.22 | $1.90 | $3.52 | $0.61 | $0.13 | $0.03 | $0.10 | $0.29 | $0.01 | $0.50 | $0.43 | $0.01 | $0.05 | $0.00 | $0.02 | $0.03 | $0.00 | $3.18 | $0.59 | $0.86 | $0.00 | $0.76 | $0.04 | $0.01 | |||
SPECS | |||||||||||||||||||||||||||
Context Window (k) | 128 | 128 | 200 | 1,000 | 8 | 32 | 65 | 200 | 1,000 | 32 | 128 | 200 | 32 | 8 | 32 | 16 | 8 | 32 | 100 | 32 | 200 | 100 | 4 | 8 | 33 | ||
Max Output Tokens (k) | 4 | 4 | 4 | 8 | 4 | 8 | 4 | 4 | 4 | 4 | 4 | ||||||||||||||||
Rate Limit (requests / minute) | tiered | tiered | tiered | 5 | 600 | tiered | 360 | tiered | 600 | tiered | 30 | tiered | tiered | 30 | tiered | tiered | |||||||||||
Rate Limit (requests / day) | tiered | tiered | tiered | 2,000 | tiered | 10,000 | tiered | tiered | 14,400 | tiered | tiered | 14,400 | tiered | tiered | |||||||||||||
Rate Limit (tokens / minute) | tiered | tiered | tiered | 10,000,000 | tiered | 10,000,000 | tiered | tiered | 6,000 | tiered | tiered | 5,000 | tiered | tiered |
Beyond Factuality: A Comprehensive Evaluation of Large Language Models as Knowledge Generators,
by Liang Chen, Yang Deng, Yatao Bian et al.A Closer Look into Automatic Evaluation Using Large Language Models,
by Cheng-han Chiang, Hungyi LiA Survey on Evaluation of Large Language Models,
by Yupeng Chang, Xu Wang, Jindong Wang, Yuan Wu, Linyi Yang, Kaijie Zhu, Hao Chen, Xiaoyuan Yi et al.G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment,
by Yang Liu, Dan Iter, Yichong Xu, Shuohang Wang, Ruochen Xu, Chenguang ZhuA Multitask, Multilingual, Multimodal Evaluation of ChatGPT on Reasoning, Hallucination, and Interactivity,
by Yejin Bang, Samuel Cahyawijaya, Nayeon Lee, Wenliang Dai, Dan Su, Bryan Wilie, Holy Lovenia, Ziwei Ji et al.Beyond Factuality: A Comprehensive Evaluation of Large Language Models as Knowledge Generators,
by Liang Chen, Yang Deng, Yatao Bian, Zeyu Qin, Bingzhe Wu, Tat-Seng Chua, Kam-Fai WongIs ChatGPT a General-Purpose Natural Language Processing Task Solver?,
by Qin, Chengwei, Zhang, Aston, Zhang, Zhuosheng, Chen, Jiaao, Yasunaga, Michihiro and Yang, DiyiChatGPT versus Traditional Question Answering for Knowledge Graphs: Current Status and Future Directions Towards Knowledge Graph Chatbots,
by Reham Omar, Omij Mangukiya, Panos Kalnis and Essam MansourMathematical Capabilities of ChatGPT,
by Simon Frieder, Luca Pinchetti, Ryan-Rhys Griffiths, Tommaso Salvatori, Thomas Lukasiewicz, Philipp Christian Petersen, Alexis Chevalier and Julius BernerExploring the Limits of ChatGPT for Query or Aspect-based Text Summarization,
by Xianjun Yang, Yan Li, Xinlu Zhang, Haifeng Chen and Wei ChengOn the Robustness of ChatGPT: An Adversarial and Out-of-distribution Perspective,
by Jindong Wang, Xixu Hu, Wenxin Hou, Hao Chen, Runkai Zheng, Yidong Wang, Linyi Yang, Haojun Huang et al.ChatGPT is not all you need. A State of the Art Review of large Generative AI models,
by Roberto Gozalo-Brizuela and Eduardo C. Garrido-Merch'anCan ChatGPT Understand Too? A Comparative Study on ChatGPT and Fine-tuned BERT,
by Qihuang Zhong, Liang Ding, Juhua Liu, Bo Du and Dacheng TaoEvaluation of ChatGPT as a Question Answering System for Answering Complex Questions,
by Yiming Tan, Dehai Min, Yu Li, Wenbo Li, Nan Hu, Yongrui Chen and Guilin QiChatGPT is a Knowledgeable but Inexperienced Solver: An Investigation of Commonsense Problem in Large Language Models,
by Ning Bian, Xianpei Han, Le Sun, Hongyu Lin, Yaojie Lu and Ben HeHolistic Evaluation of Language Models,
by Percy Liang, Rishi Bommasani, Tony Lee, Dimitris Tsipras, Dilara Soylu, Michihiro Yasunaga, Yian Zhang, Deepak Narayanan et al.Evaluating the Text-to-SQL Capabilities of Large Language Models,
by Nitarshan Rajkumar, Raymond Li and Dzmitry BahdanauAre Visual-Linguistic Models Commonsense Knowledge Bases?,
by Hsiu-Yu Yang and Carina SilbererIs GPT-3 a Psychopath? Evaluating Large Language Models from a Psychological Perspective,
by Xingxuan Li, Yutong Li, Linlin Liu, Lidong Bing and Shafiq R. JotyGeoMLAMA: Geo-Diverse Commonsense Probing on Multilingual Pre-Trained Language Models,
by Da Yin, Hritik Bansal, Masoud Monajatipoor, Liunian Harold Li and Kai-Wei ChangRobustLR: A Diagnostic Benchmark for Evaluating Logical Robustness of Deductive Reasoners,
by Soumya Sanyal, Zeyi Liao and Xiang RenA Systematic Evaluation of Large Language Models of Code,
by Frank F. Xu, Uri Alon, Graham Neubig and Vincent J. HellendoornEvaluating Large Language Models Trained on Code,
by Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Pond'e de Oliveira Pinto, Jared Kaplan, Harrison Edwards, Yuri Burda et al.GLGE: A New General Language Generation Evaluation Benchmark,
by Dayiheng Liu, Yu Yan, Yeyun Gong, Weizhen Qi, Hang Zhang, Jian Jiao, Weizhu Chen, Jie Fu et al.Evaluating Pre-Trained Models for User Feedback Analysis in Software Engineering: A Study on Classification of App-Reviews,
by Mohammad Abdul Hadi and Fatemeh H. FardDo Language Models Perform Generalizable Commonsense Inference?,
by Peifeng Wang, Filip Ilievski, Muhao Chen and Xiang RenRICA: Evaluating Robust Inference Capabilities Based on Commonsense Axioms,
by Pei Zhou, Rahul Khanna, Seyeon Lee, Bill Yuchen Lin, Daniel Ho, Jay Pujara and Xiang RenEvaluation of Text Generation: A Survey,
by Asli Celikyilmaz, Elizabeth Clark and Jianfeng GaoNeural Language Generation: Formulation, Methods, and Evaluation,
by Cristina Garbacea and Qiaozhu MeiBERTScore: Evaluating Text Generation with BERT,
by Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q. Weinberger and Yoav Artzi
模型 | Parameter | Layers | Atten heads | Dimension | Learning rate | batch size | train tokens |
---|---|---|---|---|---|---|---|
LLaMA2 | 6.7B | 32 | 32 | 4096 | 3.00E-04 | 400万 | 1.0万亿 |
LLaMA2 | 13.0B | 40 | 40 | 5120 | 3.00E-04 | 400万 | 1.0万亿 |
LLaMA2 | 32.5B | 60 | 52 | 6656 | 1.50E-04 | 400万 | 1.4万亿 |
LLaMA2 | 65.2B | 80 | 64 | 8192 | 1.50E-04 | 400万 | 1.4万亿 |
nano-GPT | 85,584 | 3 | 3 | 768 | 3.00E-04 | ||
GPT2-small | 0.12B | 12 | 12 | 768 | 2.50E-04 | ||
GPT2-XL | 1.5B | 48 | 25 | 1600 | 1.50E-04 | ||
GPT3 | 175B | 96 | 96 | 12288 | 1.50E-04 | 0.5万亿 |
Model | Size | Architecture | Access | Date | Origin |
---|---|---|---|---|---|
Switch Transformer | 1.6T | Decoder(MOE) | - | 2021-01 | Paper |
GLaM | 1.2T | Decoder(MOE) | - | 2021-12 | Paper |
PaLM | 540B | Decoder | - | 2022-04 | Paper |
MT-NLG | 530B | Decoder | - | 2022-01 | Paper |
J1-Jumbo | 178B | Decoder | api | 2021-08 | Paper |
OPT | 175B | Decoder | api | ckpt | 2022-05 | Paper |
BLOOM | 176B | Decoder | api | ckpt | 2022-11 | Paper |
GPT 3.0 | 175B | Decoder | api | 2020-05 | Paper |
LaMDA | 137B | Decoder | - | 2022-01 | Paper |
GLM | 130B | Decoder | ckpt | 2022-10 | Paper |
YaLM | 100B | Decoder | ckpt | 2022-06 | Blog |
LLaMA | 65B | Decoder | ckpt | 2022-09 | Paper |
GPT-NeoX | 20B | Decoder | ckpt | 2022-04 | Paper |
UL2 | 20B | agnostic | ckpt | 2022-05 | Paper |
T5 | 11B | Encoder-Decoder | ckpt | 2019-10 | Paper |
CPM-Bee | 10B | Decoder | api | 2022-10 | Paper |
rwkv-4 | 7B | RWKV | ckpt | 2022-09 | Github |
GPT-J | 6B | Decoder | ckpt | 2022-09 | Github |
GPT-Neo | 2.7B | Decoder | ckpt | 2021-03 | Github |
GPT-Neo | 1.3B | Decoder | ckpt | 2021-03 | Github |
Model | Size | Architecture | Access | Date | Origin |
---|---|---|---|---|---|
Flan-PaLM | 540B | Decoder | - | 2022-10 | Paper |
BLOOMZ | 176B | Decoder | ckpt | 2022-11 | Paper |
InstructGPT | 175B | Decoder | api | 2022-03 | Paper |
Galactica | 120B | Decoder | ckpt | 2022-11 | Paper |
OpenChatKit | 20B | - | ckpt | 2023-3 | - |
Flan-UL2 | 20B | Decoder | ckpt | 2023-03 | Blog |
Gopher | - | - | - | - | - |
Chinchilla | - | - | - | - | - |
Flan-T5 | 11B | Encoder-Decoder | ckpt | 2022-10 | Paper |
T0 | 11B | Encoder-Decoder | ckpt | 2021-10 | Paper |
Alpaca | 7B | Decoder | demo | 2023-03 | Github |
Model | Size | Architecture | Access | Date | Origin |
---|---|---|---|---|---|
GPT 4 | - | - | - | 2023-03 | Blog |
ChatGPT | - | Decoder | demo|api | 2022-11 | Blog |
Sparrow | 70B | - | - | 2022-09 | Paper |
Claude | - | - | demo|api | 2023-03 | Blog |
-
LLaMA - A foundational, 65-billion-parameter large language model. LLaMA.cpp Lit-LLaMA
- Alpaca - A model fine-tuned from the LLaMA 7B model on 52K instruction-following demonstrations. Alpaca.cpp Alpaca-LoRA
- Flan-Alpaca - Instruction Tuning from Humans and Machines.
- Baize - Baize is an open-source chat model trained with LoRA. It uses 100k dialogs generated by letting ChatGPT chat with itself.
- Cabrita - A portuguese finetuned instruction LLaMA.
- Vicuna - An Open-Source Chatbot Impressing GPT-4 with 90% ChatGPT Quality.
- Llama-X - Open Academic Research on Improving LLaMA to SOTA LLM.
- Chinese-Vicuna - A Chinese Instruction-following LLaMA-based Model.
- GPTQ-for-LLaMA - 4 bits quantization of LLaMA using GPTQ.
- GPT4All - Demo, data, and code to train open-source assistant-style large language model based on GPT-J and LLaMa.
- Koala - A Dialogue Model for Academic Research
- BELLE - Be Everyone's Large Language model Engine
- StackLLaMA - A hands-on guide to train LLaMA with RLHF.
- RedPajama - An Open Source Recipe to Reproduce LLaMA training dataset.
- Chimera - Latin Phoenix.
-
BLOOM - BigScience Large Open-science Open-access Multilingual Language Model BLOOM-LoRA
- BLOOMZ&mT0 - a family of models capable of following human instructions in dozens of languages zero-shot.
- Phoenix
-
T5 - Text-to-Text Transfer Transformer
- T0 - Multitask Prompted Training Enables Zero-Shot Task Generalization
-
OPT - Open Pre-trained Transformer Language Models.
-
UL2 - a unified framework for pretraining models that are universally effective across datasets and setups.
-
GLM- GLM is a General Language Model pretrained with an autoregressive blank-filling objective and can be finetuned on various natural language understanding and generation tasks.
-
RWKV - Parallelizable RNN with Transformer-level LLM Performance.
- ChatRWKV - ChatRWKV is like ChatGPT but powered by my RWKV (100% RNN) language model.
-
StableLM - Stability AI Language Models.
-
YaLM - a GPT-like neural network for generating and processing text. It can be used freely by developers and researchers from all over the world.
-
GPT-Neo - An implementation of model & data parallel GPT3-like models using the mesh-tensorflow library.
-
GPT-J - A 6 billion parameter, autoregressive text generation model trained on The Pile.
- Dolly - a cheap-to-build LLM that exhibits a surprising degree of the instruction following capabilities exhibited by ChatGPT.
-
Pythia - Interpreting Autoregressive Transformers Across Time and Scale
- Dolly 2.0 - the first open source, instruction-following LLM, fine-tuned on a human-generated instruction dataset licensed for research and commercial use.
-
OpenFlamingo - an open-source reproduction of DeepMind's Flamingo model.
-
Cerebras-GPT - A Family of Open, Compute-efficient, Large Language Models.
-
GALACTICA - The GALACTICA models are trained on a large-scale scientific corpus.
- GALPACA - GALACTICA 30B fine-tuned on the Alpaca dataset.
-
Palmyra - Palmyra Base was primarily pre-trained with English text.
-
Camel - a state-of-the-art instruction-following large language model designed to deliver exceptional performance and versatility.
-
PanGu-α - PanGu-α is a 200B parameter autoregressive pretrained Chinese language model develped by Huawei Noah's Ark Lab, MindSpore Team and Peng Cheng Laboratory.
-
Open-Assistant - a project meant to give everyone access to a great chat based large language model.
-
HuggingChat - Powered by Open Assistant's latest model – the best open source chat model right now and @huggingface Inference API.
-
Baichuan - An open-source, commercially available large-scale language model developed by Baichuan Intelligent Technology following Baichuan-7B, containing 13 billion parameters. (20230715)
-
Qwen - Qwen-7B is the 7B-parameter version of the large language model series, Qwen (abbr. Tongyi Qianwen), proposed by Alibaba Cloud. Qwen-7B is a Transformer-based large language model, which is pretrained on a large volume of data, including web texts, books, codes, etc. (20230803)
Model | #Author | #Link | #Parameter | Base Model | #Layer | #Encoder | #Decoder | #Pretrain Tokens | #IFT Sample | RLHF |
---|---|---|---|---|---|---|---|---|---|---|
GPT3-Ada | brown2020language | https://platform.openai.com/docs/models/gpt-3 | 0.35B | - | 24 | - | 24 | - | - | - |
Pythia-1B | biderman2023pythia | https://huggingface.co/EleutherAI/pythia-1b | 1B | - | 16 | - | 16 | 300B tokens | - | - |
GPT3-Babbage | brown2020language | https://platform.openai.com/docs/models/gpt-3 | 1.3B | - | 24 | - | 24 | - | - | - |
GPT2-XL | radford2019language | https://huggingface.co/gpt2-xl | 1.5B | - | 48 | - | 48 | 40B tokens | - | - |
BLOOM-1b7 | scao2022bloom | https://huggingface.co/bigscience/bloom-1b7 | 1.7B | - | 24 | - | 24 | 350B tokens | - | - |
BLOOMZ-1b7 | muennighoff2022crosslingual | https://huggingface.co/bigscience/bloomz-1b7 | 1.7B | BLOOM-1b7 | 24 | - | 24 | - | 8.39B tokens | - |
Dolly-v2-3b | 2023dolly | https://huggingface.co/databricks/dolly-v2-3b | 2.8B | Pythia-2.8B | 32 | - | 32 | - | 15K | - |
Pythia-2.8B | biderman2023pythia | https://huggingface.co/EleutherAI/pythia-2.8b | 2.8B | - | 32 | - | 32 | 300B tokens | - | - |
BLOOM-3b | scao2022bloom | https://huggingface.co/bigscience/bloom-3b | 3B | - | 30 | - | 30 | 350B tokens | - | - |
BLOOMZ-3b | muennighoff2022crosslingual | https://huggingface.co/bigscience/bloomz-3b | 3B | BLOOM-3b | 30 | - | 30 | - | 8.39B tokens | - |
StableLM-Base-Alpha-3B | 2023StableLM | https://huggingface.co/stabilityai/stablelm-base-alpha-3b | 3B | - | 16 | - | 16 | 800B tokens | - | - |
StableLM-Tuned-Alpha-3B | 2023StableLM | https://huggingface.co/stabilityai/stablelm-tuned-alpha-3b | 3B | StableLM-Base-Alpha-3B | 16 | - | 16 | - | 632K | - |
ChatGLM-6B | zeng2023glm-130b,du2022glm | https://huggingface.co/THUDM/chatglm-6b | 6B | - | 28 | 28 | 28 | 1T tokens | \checkmark | \checkmark |
DoctorGLM | xiong2023doctorglm | https://github.com/xionghonglin/DoctorGLM | 6B | ChatGLM-6B | 28 | 28 | 28 | - | 6.38M | - |
ChatGLM-Med | ChatGLM-Med | https://github.com/SCIR-HI/Med-ChatGLM | 6B | ChatGLM-6B | 28 | 28 | 28 | - | 8K | - |
GPT3-Curie | brown2020language | https://platform.openai.com/docs/models/gpt-3 | 6.7B | - | 32 | - | 32 | - | - | - |
MPT-7B-Chat | MosaicML2023Introducing | https://huggingface.co/mosaicml/mpt-7b-chat | 6.7B | MPT-7B | 32 | - | 32 | - | 360K | - |
MPT-7B-Instruct | MosaicML2023Introducing | https://huggingface.co/mosaicml/mpt-7b-instruct | 6.7B | MPT-7B | 32 | - | 32 | - | 59.3K | - |
MPT-7B-StoryWriter-65k+ | MosaicML2023Introducing | https://huggingface.co/mosaicml/mpt-7b-storywriter | 6.7B | MPT-7B | 32 | - | 32 | - | \checkmark | - |
Dolly-v2-7b | 2023dolly | https://huggingface.co/databricks/dolly-v2-7b | 6.9B | Pythia-6.9B | 32 | - | 32 | - | 15K | - |
h2ogpt-oig-oasst1-512-6.9b | 2023h2ogpt | https://huggingface.co/h2oai/h2ogpt-oig-oasst1-512-6.9b | 6.9B | Pythia-6.9B | 32 | - | 32 | - | 398K | - |
Pythia-6.9B | biderman2023pythia | https://huggingface.co/EleutherAI/pythia-6.9b | 6.9B | - | 32 | - | 32 | 300B tokens | - | - |
Alpaca-7B | alpaca | https://huggingface.co/tatsu-lab/alpaca-7b-wdiff | 7B | LLaMA-7B | 32 | - | 32 | - | 52K | - |
Alpaca-LoRA-7B | 2023alpacalora | https://huggingface.co/tloen/alpaca-lora-7b | 7B | LLaMA-7B | 32 | - | 32 | - | 52K | - |
Baize-7B | xu2023baize | https://huggingface.co/project-baize/baize-lora-7B | 7B | LLaMA-7B | 32 | - | 32 | - | 263K | - |
Baize Healthcare-7B | xu2023baize | https://huggingface.co/project-baize/baize-healthcare-lora-7B | 7B | LLaMA-7B | 32 | - | 32 | - | 201K | - |
ChatDoctor | yunxiang2023chatdoctor | https://github.com/Kent0n-Li/ChatDoctor | 7B | LLaMA-7B | 32 | - | 32 | - | 167K | - |
HuaTuo | wang2023huatuo | https://github.com/scir-hi/huatuo-llama-med-chinese | 7B | LLaMA-7B | 32 | - | 32 | - | 8K | - |
Koala-7B | koala_blogpost_2023 | https://huggingface.co/young-geng/koala | 7B | LLaMA-7B | 32 | - | 32 | - | 472K | - |
LLaMA-7B | touvron2023llama | https://huggingface.co/decapoda-research/llama-7b-hf | 7B | - | 32 | - | 32 | 1T tokens | - | - |
Luotuo-lora-7b-0.3 | luotuo | https://huggingface.co/silk-road/luotuo-lora-7b-0.3 | 7B | LLaMA-7B | 32 | - | 32 | - | 152K | - |
StableLM-Base-Alpha-7B | 2023StableLM | https://huggingface.co/stabilityai/stablelm-base-alpha-7b | 7B | - | 16 | - | 16 | 800B tokens | - | - |
StableLM-Tuned-Alpha-7B | 2023StableLM | https://huggingface.co/stabilityai/stablelm-tuned-alpha-7b | 7B | StableLM-Base-Alpha-7B | 16 | - | 16 | - | 632K | - |
Vicuna-7b-delta-v1.1 | vicuna2023 | https://github.com/lm-sys/FastChat\#vicuna-weights | 7B | LLaMA-7B | 32 | - | 32 | - | 70K | - |
BELLE-7B-0.2M /0.6M /1M /2M | belle2023exploring | https://huggingface.co/BelleGroup/BELLE-7B-2M | 7.1B | Bloomz-7b1-mt | 30 | - | 30 | - | 0.2M/0.6M/1M/2M | - |
BLOOM-7b1 | scao2022bloom | https://huggingface.co/bigscience/bloom-7b1 | 7.1B | - | 30 | - | 30 | 350B tokens | - | - |
BLOOMZ-7b1 /mt /p3 | muennighoff2022crosslingual | https://huggingface.co/bigscience/bloomz-7b1-p3 | 7.1B | BLOOM-7b1 | 30 | - | 30 | - | 4.19B tokens | - |
Dolly-v2-12b | 2023dolly | https://huggingface.co/databricks/dolly-v2-12b | 12B | Pythia-12B | 36 | - | 36 | - | 15K | - |
h2ogpt-oasst1-512-12b | 2023h2ogpt | https://huggingface.co/h2oai/h2ogpt-oasst1-512-12b | 12B | Pythia-12B | 36 | - | 36 | - | 94.6K | - |
Open-Assistant-SFT-4-12B | 2023openassistant | https://huggingface.co/OpenAssistant/oasst-sft-4-pythia-12b-epoch-3.5 | 12B | Pythia-12B-deduped | 36 | - | 36 | - | 161K | - |
Pythia-12B | biderman2023pythia | https://huggingface.co/EleutherAI/pythia-12b | 12B | - | 36 | - | 36 | 300B tokens | - | - |
Baize-13B | xu2023baize | https://huggingface.co/project-baize/baize-lora-13B | 13B | LLaMA-13B | 40 | - | 40 | - | 263K | - |
Koala-13B | koala_blogpost_2023 | https://huggingface.co/young-geng/koala | 13B | LLaMA-13B | 40 | - | 40 | - | 472K | - |
LLaMA-13B | touvron2023llama | https://huggingface.co/decapoda-research/llama-13b-hf | 13B | - | 40 | - | 40 | 1T tokens | - | - |
StableVicuna-13B | 2023StableLM | https://huggingface.co/CarperAI/stable-vicuna-13b-delta | 13B | Vicuna-13B v0 | 40 | - | 40 | - | 613K | \checkmark |
Vicuna-13b-delta-v1.1 | vicuna2023 | https://github.com/lm-sys/FastChat\#vicuna-weights | 13B | LLaMA-13B | 40 | - | 40 | - | 70K | - |
moss-moon-003-sft | 2023moss | https://huggingface.co/fnlp/moss-moon-003-sft | 16B | moss-moon-003-base | 34 | - | 34 | - | 1.1M | - |
moss-moon-003-sft-plugin | 2023moss | https://huggingface.co/fnlp/moss-moon-003-sft-plugin | 16B | moss-moon-003-base | 34 | - | 34 | - | 1.4M | - |
GPT-NeoX-20B | gptneox | https://huggingface.co/EleutherAI/gpt-neox-20b | 20B | - | 44 | - | 44 | 825GB | - | - |
h2ogpt-oasst1-512-20b | 2023h2ogpt | https://huggingface.co/h2oai/h2ogpt-oasst1-512-20b | 20B | GPT-NeoX-20B | 44 | - | 44 | - | 94.6K | - |
Baize-30B | xu2023baize | https://huggingface.co/project-baize/baize-lora-30B | 33B | LLaMA-30B | 60 | - | 60 | - | 263K | - |
LLaMA-30B | touvron2023llama | https://huggingface.co/decapoda-research/llama-30b-hf | 33B | - | 60 | - | 60 | 1.4T tokens | - | - |
LLaMA-65B | touvron2023llama | https://huggingface.co/decapoda-research/llama-65b-hf | 65B | - | 80 | - | 80 | 1.4T tokens | - | - |
GPT3-Davinci | brown2020language | https://platform.openai.com/docs/models/gpt-3 | 175B | - | 96 | - | 96 | 300B tokens | - | - |
BLOOM | scao2022bloom | https://huggingface.co/bigscience/bloom | 176B | - | 70 | - | 70 | 366B tokens | - | - |
BLOOMZ /mt /p3 | muennighoff2022crosslingual | https://huggingface.co/bigscience/bloomz-p3 | 176B | BLOOM | 70 | - | 70 | - | 2.09B tokens | - |
ChatGPT~(2023.05.01) | openaichatgpt | https://platform.openai.com/docs/models/gpt-3-5 | - | GPT-3.5 | - | - | - | - | \checkmark | \checkmark |
GPT-4~(2023.05.01) | openai2023gpt4 | https://platform.openai.com/docs/models/gpt-4 | - | - | - | - | - | - | \checkmark | \checkmark |
- Accelerate
- 🚀 A simple way to train and use PyTorch models with multi-GPU, TPU, mixed-precision.
- Apache MXNet
- Lightweight, Portable, Flexible Distributed/Mobile Deep Learning with Dynamic, Mutation-aware Dataflow Dep Scheduler.
- Caffe
- A fast open framework for deep learning.
- ColossalAI
- An integrated large-scale model training system with efficient parallelization techniques.
- DeepSpeed
- DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.
- Horovod
- Distributed training framework for TensorFlow, Keras, PyTorch, and Apache MXNet.
- Jax
- Autograd and XLA for high-performance machine learning research.
- Kedro
- Kedro is an open-source Python framework for creating reproducible, maintainable and modular data science code.
- Keras
- Keras is a deep learning API written in Python, running on top of the machine learning platform TensorFlow.
- LightGBM
- A fast, distributed, high performance gradient boosting (GBT, GBDT, GBRT, GBM or MART) framework based on decision tree algorithms, used for ranking, classification and many other machine learning tasks.
- MegEngine
- MegEngine is a fast, scalable and easy-to-use deep learning framework, with auto-differentiation.
- metric-learn
- Metric Learning Algorithms in Python.
- MindSpore
- MindSpore is a new open source deep learning training/inference framework that could be used for mobile, edge and cloud scenarios.
- Oneflow
- OneFlow is a performance-centered and open-source deep learning framework.
- PaddlePaddle
- Machine Learning Framework from Industrial Practice.
- PyTorch
- Tensors and Dynamic neural networks in Python with strong GPU acceleration.
- PyTorch Lightning
- Deep learning framework to train, deploy, and ship AI products Lightning fast.
- XGBoost
- Scalable, Portable and Distributed Gradient Boosting (GBDT, GBRT or GBM) Library.
- scikit-learn
- Machine Learning in Python.
- TensorFlow
- An Open Source Machine Learning Framework for Everyone.
- VectorFlow
- A minimalist neural network library optimized for sparse data and single machine environments.
Name | Stars | Description |
---|---|---|
Byzer-LLM | Byzer-LLM is a comprehensive large model infrastructure that supports capabilities related to large models, such as pre-training, fine-tuning, deployment, and serving. Byzer-Retrieval is a storage infrastructure specifically developed for large models, supporting batch import of various data sources, real-time single-item updates, and full-text, vector, and hybrid searches to facilitate data usage for Byzer-LLM. Byzer-SQL/Python offers user-friendly interactive APIs with a low barrier to entry for utilizing the aforementioned products. | |
agenta | An LLMOps platform for building powerful LLM applications. It allows for easy experimentation and evaluation of different prompts, models, and workflows to construct robust applications. | |
Arize-Phoenix | ML observability for LLMs, vision, language, and tabular models. | |
BudgetML | Deploy ML inference services on a limited budget with less than 10 lines of code. | |
CometLLM | An open-source LLMOps platform for logging, managing, and visualizing LLM prompts and chains. It tracks prompt templates, variables, duration, token usage, and other metadata. It also scores prompt outputs and visualizes chat history in a single UI. | |
deeplake | Stream large multimodal datasets to achieve near 100% GPU utilization. Query, visualize, and version control data. Access data without recalculating embeddings for model fine-tuning. | |
Dify | An open-source framework that enables developers (even non-developers) to quickly build useful applications based on large language models, ensuring they are visible, actionable, and improvable. | |
Dstack | Cost-effective LLM development in any cloud (AWS, GCP, Azure, Lambda, etc.). | |
Embedchain | A framework for creating ChatGPT-like robots on datasets. | |
GPTCache | Create semantic caches to store responses to LLM queries. | |
Haystack | Quickly build applications with LLM agents, semantic search, question answering, and more. | |
langchain | Build LLM applications through composability. | |
LangFlow | A hassle-free way to experiment with and prototype LangChain processes using drag-and-drop components and a chat interface. | |
LangKit | A ready-to-use LLM telemetry collection library that extracts profiles of LLM performance over time, as well as prompts, responses, and metadata, to identify issues at scale. | |
LiteLLM 🚅 | A simple and lightweight 100-line package for standardizing LLM API calls across OpenAI, Azure, Cohere, Anthropic, Replicate, and other API endpoints. | |
LlamaIndex | Provides a central interface to connect your LLMs with external data. | |
LLMApp | LLM App is a Python library that helps you build real-time LLM-enabled data pipelines with just a few lines of code. | |
LLMFlows | LLMFlows is a framework for building simple, clear, and transparent LLM applications, such as chatbots, question-answering systems, and agents. | |
LLMonitor | Observability and monitoring for AI applications and agents. Debug agents with robust tracking and logging. Use analytical tools to delve into request history. Developer-friendly modules that can be easily integrated into LangChain. | |
magentic | Seamlessly integrate LLMs as Python functions. Use type annotations to specify structured outputs. Combine LLM queries and function calls with regular Python code to create complex LLM-driven functionalities. | |
Pezzo 🕹️ | Pezzo is an open-source LLMOps platform built for developers and teams. With just two lines of code, you can easily troubleshoot AI operations, collaborate on and manage your prompts, and deploy changes instantly from one place. | |
promptfoo | An open-source tool for testing and evaluating prompt quality. Create test cases, automatically check output quality, and catch regressions to reduce evaluation costs. | |
prompttools | An open-source tool for testing and trying out prompts. The core idea is to enable developers to evaluate prompts using familiar interfaces such as code and notebooks. With just a few lines of code, you can test prompts and parameters across different models (whether you're using OpenAI, Anthropic, or LLaMA models). You can even evaluate the accuracy of vector database retrievals. | |
TrueFoundry | No GitHub link | Deploy LLMOps tools on your own Kubernetes (EKS, AKS, GKE, On-prem) infrastructure, including Vector DBs, embedded servers, etc. This includes open-source LLM models for deployment, fine-tuning, prompt tracking, and providing complete data security and optimal GPU management. Use best software engineering practices to train and launch your LLM applications at production scale. |
ReliableGPT 💪 | Handle OpenAI errors for your production LLM applications (overloaded OpenAI servers, rotated keys, or context window errors). | |
Weights & Biases (Prompts) | No GitHub link | A set of LLMOps tools in the developer-focused W&B MLOps platform. Use W&B Prompts to visualize and inspect LLM execution flows, track inputs and outputs, view intermediate results, and manage prompts and LLM chain configurations. |
xTuring | Build and control your personal LLMs using fast and efficient fine-tuning. | |
ZenML | An open-source framework for orchestrating, experimenting, and deploying production-grade ML solutions, with built-in langchain and llama_index integration. |
- 大语言模型课程notebooks集-Large Language Model Course - Course with a roadmap and notebooks to get into Large Language Models (LLMs).
- Full+Stack+LLM+Bootcamp - LLM相关学习/应用资源集.
- Awesome LLM - A curated list of papers about large language models.
- Awesome-Efficient-LLM - A curated list for Efficient Large Language Models.
- Awesome-production-machine-learning - A curated list of awesome open source libraries to deploy, monitor, version and scale your machine learning.
- Awesome-marketing-datascience - Curated list of useful LLM / Analytics / Datascience resources.
- Awesome-llm-tools - Curated list of useful LLM tool.
- Awesome-LLM-Compression - A curated list for Efficient LLM Compression.
- Awesome-Multimodal-Large-Language-Models - A curated list of Multimodal Large Language Models.
- Awesome-LLMOps - An awesome & curated list of the best LLMOps tools for developers.
- Awesome-MLops - An awesome list of references for MLOps - Machine Learning Operations.
- Awesome ChatGPT Prompts - A collection of prompt examples to be used with the ChatGPT model.
- awesome-chatgpt-prompts-zh - A Chinese collection of prompt examples to be used with the ChatGPT model.
- Awesome ChatGPT - Curated list of resources for ChatGPT and GPT-3 from OpenAI.
- Chain-of-Thoughts Papers - A trend starts from "Chain of Thought Prompting Elicits Reasoning in Large Language Models.
- Instruction-Tuning-Papers - A trend starts from
Natrural-Instruction
(ACL 2022),FLAN
(ICLR 2022) andT0
(ICLR 2022). - LLM Reading List - A paper & resource list of large language models.
- Reasoning using Language Models - Collection of papers and resources on Reasoning using Language Models.
- Chain-of-Thought Hub - Measuring LLMs' Reasoning Performance
- Awesome GPT - A curated list of awesome projects and resources related to GPT, ChatGPT, OpenAI, LLM, and more.
- Awesome GPT-3 - a collection of demos and articles about the OpenAI GPT-3 API.
Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.