UQLM: Uncertainty Quantification for Language Models, is a Python package for UQ-based LLM hallucination detection
-
Updated
Aug 5, 2025 - Python
UQLM: Uncertainty Quantification for Language Models, is a Python package for UQ-based LLM hallucination detection
DeepTeam is a framework to red team LLMs and LLM systems.
Decrypted Generative Model safety files for Apple Intelligence containing filters
[NDSS'25 Best Technical Poster] A collection of automated evaluators for assessing jailbreak attempts.
Attack to induce LLMs within hallucinations
Official repository for the paper "ALERT: A Comprehensive Benchmark for Assessing Large Language Models’ Safety through Red Teaming"
Restore safety in fine-tuned language models through task arithmetic
DuoGuard: A Two-Player RL-Driven Framework for Multilingual LLM Guardrails
NeurIPS'24 - LLM Safety Landscape
Code and dataset for the paper: "Can Editing LLMs Inject Harm?"
Source code of "Leaky Thoughts: Large Reasoning Models Are Not Private Thinkers"
Official code for the paper "SafeKey: Amplifying Aha-Moment Insights for Safety Reasoning"
[COLM 2025] LoX: Low-Rank Extrapolation Robustifies LLM Safety Against Fine-tuning
Source code for "Opt-Out: Investigating Entity-Level Unlearning for Large Language Models via Optimal Transport", ACL 2025
Multi-agent debate framework for enhancing LLM safety through red-teaming prompts, feedback-driven learning, long-term memory, and diverse structured debate strategies.
SECA: Semantically Equivalent & Coherent Attacks for Eliciting LLM Hallucinations
Add a description, image, and links to the llm-safety topic page so that developers can more easily learn about it.
To associate your repository with the llm-safety topic, visit your repo's landing page and select "manage topics."