This is a list of ChatGPT-related papers. Any feedback is welcome.
- Survey paper
- Instruction tuning
- Reinforcement learning from human feedback
- Reinforcement learning with verifiable rewards
- Reinforcement learning without verifiable rewards
- Evaluation
- Large Language Model
- External tools
- Agent
- MoE/Routing
- Technical report of open/proprietary model
- Misc.
- Finetuned Language Models Are Zero-Shot Learners
- Scaling Instruction-Finetuned Language Models
- Super-NaturalInstructions: Generalization via Declarative Instructions on 1600+ NLP Tasks
- Self-Instruct: Aligning Language Model with Self Generated Instructions [github]
- Stanford Alpaca: An Instruction-following LLaMA Model [github]
- Dolly: Democratizing the magic of ChatGPT with open models [blog] [blog]
- Vicuna: An Open-Source Chatbot Impressing GPT-4 with 90% ChatGPT Quality [github] [website]
- LaMini-LM: A Diverse Herd of Distilled Models from Large-Scale Instructions [github]
- Principle-Driven Self-Alignment of Language Models from Scratch with Minimal Human Supervision
- LIMA: Less Is More for Alignment
- Enhancing Chat Language Models by Scaling High-quality Instructional Conversations [github]
- How Far Can Camels Go? Exploring the State of Instruction Tuning on Open Resources [github]
- Faith and Fate: Limits of Transformers on Compositionality
- SAIL: Search-Augmented Instruction Learning
- The False Promise of Imitating Proprietary LLMs
- Instruction Mining: High-Quality Instruction Data Selection for Large Language Models
- SteerLM: Attribute Conditioned SFT as an (User-Steerable) Alternative to RLHF (EMNLP2023 Findings)
- Fine-Tuning Language Models from Human Preferences [github] [blog]
- Training language models to follow instructions with human feedback [github] [blog]
- WebGPT: Browser-assisted question-answering with human feedback [blog]
- Improving alignment of dialogue agents via targeted human judgements
- Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback
- OpenAssistant Conversations -- Democratizing Large Language Model Alignment [github]
- Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback
- Preference Ranking Optimization for Human Alignment
- Training Language Models with Language Feedback (ACL2022 WS)
- Direct Preference Optimization: Your Language Model is Secretly a Reward Model
- Is DPO Superior to PPO for LLM Alignment? A Comprehensive Study
- HybridFlow: A Flexible and Efficient RLHF Framewor
- A General Theoretical Paradigm to Understand Learning from Human Preferences
- DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
- DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
- Group Sequence Policy Optimization
- Critique-GRPO: Advancing LLM Reasoning with Natural Language and Numerical Feedback
- Sample More to Think Less: Group Filtered Policy Optimization for Concise Reasoning
- Absolute Zero: Reinforced Self-play Reasoning with Zero Data
- Putting the Value Back in RL: Better Test-Time Scaling by Unifying LLM Reasoners With Verifiers
- Reinforcement Learning for Reasoning in Large Language Models with One Training Example
- ProRL: Prolonged Reinforcement Learning Expands Reasoning Boundaries in Large Language Models
- REASONING GYM: Reasoning Environments for Reinforcement Learning with Verifiable Rewards
- Magistral
- R-Zero: Self-Evolving Reasoning LLM from Zero Data
- Part I: Tricks or Traps? A Deep Dive into RL for LLM Reasoning
- Reinforcing General Reasoning without Verifiers
- Learning to Reason without External Rewards
- Can Large Reasoning Models Self-Train?
- How Close is ChatGPT to Human Experts? Comparison Corpus, Evaluation, and Detection
- Is ChatGPT a General-Purpose Natural Language Processing Task Solver?
- A Multitask, Multilingual, Multimodal Evaluation of ChatGPT on Reasoning, Hallucination, and Interactivity
- Is ChatGPT Good at Search? Investigating Large Language Models as Re-Ranking Agent
- Is ChatGPT a Good Causal Reasoner? A Comprehensive Evaluation
- Is ChatGPT a Good Recommender? A Preliminary Study
- Evaluating ChatGPT's Information Extraction Capabilities: An Assessment of Performance, Explainability, Calibration, and Faithfulness
- Semantic Compression With Large Language Models
- Human-like Summarization Evaluation with ChatGPT
- Sentence Simplification via Large Language Models
- Capabilities of GPT-4 on Medical Challenge Problems
- Do Multilingual Language Models Think Better in English?
- ChatGPT or Grammarly? Evaluating ChatGPT on Grammatical Error Correction Benchmark
- ChatGPT Outperforms Crowd-Workers for Text-Annotation Tasks
- Open-Source Large Language Models Outperform Crowd Workers and Approach ChatGPT in Text-Annotation Tasks
- Can ChatGPT Reproduce Human-Generated Labels? A Study of Social Computing Tasks
- Artificial Artificial Artificial Intelligence: Crowd Workers Widely Use Large Language Models for Text Production Tasks
- Is GPT-3 a Good Data Annotator? (ACL2023)
- Measuring Massive Multitask Language Understanding (ICLR2021)
- MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark
- Are We Done with MMLU? (NAACL2025)
- Global MMLU: Understanding and Addressing Cultural and Linguistic Biases in Multilingual Evaluation (ACL2025)
- VNHSGE: VietNamese High School Graduation Examination Dataset for Large Language Models
- CMMLU: Measuring massive multitask language understanding in Chinese
- HAE-RAE Bench: Evaluation of Korean Knowledge in Language Models
- KMMLU: Measuring Massive Multitask Language Understanding in Korean (NAACL2025)
- From KMMLU-Redux to KMMLU-Pro: A Professional Korean Benchmark Suite for LLM Evaluation
- Large Language Models Only Pass Primary School Exams in Indonesia: A Comprehensive Test on IndoMMLU (EMNLP2023)
- Typhoon: Thai Large Language Models
- ArabicMMLU: Assessing Massive Multitask Language Understanding in Arabic (ACL2024 Findings)
- Khayyam Challenge (PersianMMLU): Is Your LLM Truly Wise to The Persian Language?
- IrokoBench: A New Benchmark for African Languages in the Age of Large Language Models
- TurkishMMLU: Measuring Massive Multitask Language Understanding in Turkish (EMNLP2024 Findings)
- Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models
- Holistic Evaluation of Language Models
- AGIEval: A Human-Centric Benchmark for Evaluating Foundation Models
- Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena
- MT-Bench-101: A Fine-Grained Benchmark for Evaluating Large Language Models in Multi-Turn Dialogues
- The BiGGen Bench: A Principled Benchmark for Fine-grained Evaluation of Language Models with Language Models (NAACL2025)
- GPQA: A Graduate-Level Google-Proof Q&A Benchmark
- SuperGPQA: Scaling LLM Evaluation across 285 Graduate Disciplines
- LiveBench: A Challenging, Contamination-Limited LLM Benchmark
- Measuring short-form factuality in large language models
- Humanity's Last Exam
- MultiChallenge: A Realistic Multi-Turn Conversation Evaluation Benchmark Challenging to Frontier LLMs
- Evaluating Large Language Models Trained on Code
- HumanEval-XL: A Multilingual Code Generation Benchmark for Cross-lingual Natural Language Generalization (LREC-COLING2024)
- Program Synthesis with Large Language Models
- MultiPL-E: A Scalable and Extensible Approach to Benchmarking Neural Code Generation
- SWE-bench: Can Language Models Resolve Real-World GitHub Issues? (ICLR2024)
- Multi-SWE-bench: A Multilingual Benchmark for Issue Resolving
- LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code
- OJBench: A Competition Level Code Benchmark For Large Language Models
- Measuring Mathematical Problem Solving With the MATH Dataset
- Training Verifiers to Solve Math Word Problems
- Omni-MATH: A Universal Olympiad Level Mathematic Benchmark For Large Language Models
- MathArena: Evaluating LLMs on Uncontaminated Math Competitions
- Instruction-Following Evaluation for Large Language Models
- Can Large Language Models Understand Real-World Complex Instructions?
- FollowBench: A Multi-level Fine-grained Constraints Following Benchmark for Large Language Models (ACL2024)
- Generalizing Verifiable Instruction Following
- CIF-Bench: A Chinese Instruction-Following Benchmark for Evaluating the Generalizability of Large Language Models (ACL2024)
- Multi-IF: Benchmarking LLMs on Multi-Turn and Multilingual Instructions Following
- MaXIFE: Multilingual and Cross-lingual Instruction Following Evaluation (ACL2025)
- RULER: What's the Real Context Size of Your Long-Context Language Models? (COLM2024)
- LongBench v2: Towards Deeper Understanding and Reasoning on Realistic Long-context Multitasks (ACL2025)
- One ruler to measure them all: Benchmarking multilingual long-context language models
- Toolformer: Language Models Can Teach Themselves to Use Tools
- Large Language Models as Tool Makers
- CREATOR: Disentangling Abstract and Concrete Reasonings of Large Language Models through Tool Creation
- ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs
- A Survey on Large Language Model based Autonomous Agents
- The Rise and Potential of Large Language Model Based Agents: A Survey
- Large Language Model Agent: A Survey on Methodology, Applications and Challenges
- A Survey of Self-Evolving Agents: On Path to Artificial Super Intelligence
- A Comprehensive Survey of Self-Evolving AI Agents: A New Paradigm Bridging Foundation Models and Lifelong Agentic Systems
- Survey on Evaluation of LLM-based Agents
- RLVMR: Reinforcement Learning with Verifiable Meta-Reasoning Rewards for Robust Long-Horizon Agents
- Routing to the Expert: Efficient Reward-guided Ensemble of Large Language Models
- Mixtral of Experts
- Knowledge Fusion of Large Language Models (ICLR2024)
- LLaMA: Open and Efficient Foundation Language Models
- Llama 2: Open Foundation and Fine-Tuned Chat Models
- The Llama 3 Herd of Models
- Qwen Technical Report
- Qwen2.5 Technical Report
- Nemotron-4 15B Technical Report
- Nemotron-4 340B Technical Report
- PaLM 2 Technical Report
- Kimi k1.5: Scaling Reinforcement Learning with LLMs
- Kimi K2: Open Agentic Intelligence
- Hunyuan-A13B Technical Report
- ERNIE 4.5 Technical Report
- ESTIMATING WORST-CASE FRONTIER RISKS OF OPEN-WEIGHT LLMS
- GLM-4.5: Agentic, Reasoning, and Coding (ARC) Foundation Models