llm-safety

Here are 18 public repositories matching this topic...

cvs-health / uqlm

UQLM: Uncertainty Quantification for Language Models, is a Python package for UQ-based LLM hallucination detection

uncertainty-quantification uncertainty-estimation ai-safety confidence-score hallucination confidence-estimation ai-evaluation llm llm-evaluation llm-safety hallucination-evaluation hallucination-detection hallucination-mitigation llm-hallucination

Updated Aug 5, 2025
Python

confident-ai / deepteam

Star

DeepTeam is a framework to red team LLMs and LLM systems.

llm-safety llm-guardrails llm-red-teaming

Updated Aug 8, 2025
Python

BlueFalconHD / apple_generative_model_safety_decrypted

Star

Decrypted Generative Model safety files for Apple Intelligence containing filters

apple ai safety decryption lldb-script llm llm-safety apple-intelligence

Updated Aug 7, 2025
Python

ThuCCSLab / JailbreakEval

Star

[NDSS'25 Best Technical Poster] A collection of automated evaluators for assessing jailbreak attempts.

llm-safety llm-jailbreaks

Updated Apr 1, 2025
Python

PKU-YuanGroup / Hallucination-Attack

Star

Attack to induce LLMs within hallucinations

nlp machine-learning deep-learning ai-safety adversarial-attacks hallucinations llm llm-safety

Updated May 17, 2024
Python

Babelscape / ALERT

Star

Official repository for the paper "ALERT: A Comprehensive Benchmark for Assessing Large Language Models’ Safety through Red Teaming"

nlp benchmark ai artificial-intelligence nlp-machine-learning red-teaming bias-detection safety-monitoring transformers-models llm llm-evaluation llm-safety llm-safety-benchmark

Updated Sep 20, 2024
Python

declare-lab / resta

Star

Restore safety in fine-tuned language models through task arithmetic

alignment safety alignment-algorithm llm llms llm-safety llms-benchmarking llm-safety-benchmark

Updated Mar 28, 2024
Python

yihedeng9 / DuoGuard

Star

DuoGuard: A Two-Player RL-Driven Framework for Multilingual LLM Guardrails

llm llm-safety guardrail-models

Updated Feb 26, 2025
Python

poloclub / llm-landscape

Star

NeurIPS'24 - LLM Safety Landscape

llm llm-safety safety-basin llm-safety-landscape llm-landscape

Updated Feb 25, 2025
Python

llm-editing / editing-attack

Star

Code and dataset for the paper: "Can Editing LLMs Inject Harm?"

llms knowledge-editing llm-safety

Updated Nov 9, 2024
Python

parameterlab / leaky_thoughts

Star

Source code of "Leaky Thoughts: Large Reasoning Models Are Not Private Thinkers"

privacy research llm llm-safety reasoning-language-models contextual-privacy

Updated Jul 1, 2025
Python

eric-ai-lab / SafeKey

Star

Official code for the paper "SafeKey: Amplifying Aha-Moment Insights for Safety Reasoning"

ai-safety llm-safety large-reasoning-models safety-reasoning

Updated Jun 30, 2025
Python

VITA-Group / LoX

Star

[COLM 2025] LoX: Low-Rank Extrapolation Robustifies LLM Safety Against Fine-tuning

ai low-rank llm-safety

Updated Jul 24, 2025
Python

brightjade / Opt-Out

Star

Source code for "Opt-Out: Investigating Entity-Level Unlearning for Large Language Models via Optimal Transport", ACL 2025

machine-learning privacy entity unlearning llm-safety entity-level-unlearning

Updated Jun 20, 2025
Python

aliasad059 / RedDebate

Star

Multi-agent debate framework for enhancing LLM safety through red-teaming prompts, feedback-driven learning, long-term memory, and diverse structured debate strategies.

multi-agent debate red-teaming long-term-memory llm-safety feedback-learning

Updated Jun 16, 2025
Python

Buyun-Liang / SECA

Star

SECA: Semantically Equivalent & Coherent Attacks for Eliciting LLM Hallucinations

adversarial-attacks large-language-models llm-safety llm-hallucination

Updated May 27, 2025
Python

MAGICS-LAB / XLLM

Star

[USENIX Security 2025] Mind the Inconspicuous: Revealing the Hidden Weakness in Aligned LLMs’ Ethical Boundaries

jailbreak automatic safety eos llm llm-safety

Updated Jul 11, 2025
Python

F20CA-Health1 / safety-benchmarking

Star

safety benchmarking

benchmarking llm llm-safety

Updated Mar 28, 2025
Python

Improve this page

Add a description, image, and links to the llm-safety topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the llm-safety topic, visit your repo's landing page and select "manage topics."

Learn more

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

llm-safety

Here are 18 public repositories matching this topic...

cvs-health / uqlm

confident-ai / deepteam

BlueFalconHD / apple_generative_model_safety_decrypted

ThuCCSLab / JailbreakEval

PKU-YuanGroup / Hallucination-Attack

Babelscape / ALERT

declare-lab / resta

yihedeng9 / DuoGuard

poloclub / llm-landscape

llm-editing / editing-attack

parameterlab / leaky_thoughts

eric-ai-lab / SafeKey

VITA-Group / LoX

brightjade / Opt-Out

aliasad059 / RedDebate

Buyun-Liang / SECA

MAGICS-LAB / XLLM

F20CA-Health1 / safety-benchmarking

Improve this page

Add this topic to your repo