A curated list of awesome AI guardrails.
If you find this list helpful, give it a ⭐ on GitHub, share it, and contribute by submitting a pull request or issue!
Name | Description |
---|---|
security-and-privacy |
Security and privacy guardrails ensure content remains safe, ethical, and devoid of offensive material |
response-and-relevance |
Ensures model responses are accurate, focused, and aligned with user intent |
language-quality |
Ensures high standards of readability, coherence, and clarity |
content-validation |
Ensures factual correctness and logical coherence of content |
logic-validation |
Ensures logical and functional correctness of generated code and data |
Sub Category | Description |
---|---|
inappropriate-content |
Detects and filters inappropriate or explicit content |
offensive-language |
Identifies and filters profane or offensive language |
prompt-injection |
Prevents manipulation attempts through malicious prompts |
sensitive-content |
Flags culturally, politically, or socially sensitive topics |
deepfake-detection |
Detects and filters deepfake content |
pii |
Identifies and filters personally identifiable information |
Models in security-and-privacy
Sub Category | Description |
---|---|
relevance |
Validates semantic relevance between input and output |
prompt-address |
Confirms response correctly addresses user's prompt |
url-validation |
Verifies validity of generated URLs |
factuality |
Cross-references content with external knowledge sources |
refusal |
Refuses to answer questions that are not appropriate or relevant |
Models in response-and-relevance
Name | Size | Task |
---|---|---|
protectai/distilroberta-base-rejection-v1 | 0.0821B |
text-classification |
s-nlp/E5-EverGreen-Multilingual-Small | 0.118B |
text-classification |
lytang/MiniCheck-RoBERTa-Large | 0.4B |
text-classification |
lytang/MiniCheck-Flan-T5-Large | 0.8B |
text-classification |
ibm-granite/granite-guardian-3.1-2b | 2B |
text-classification |
bespokelabs/Bespoke-MiniCheck-7B | 7B |
text-classification |
nvidia/prompt-task-and-complexity-classifier | 0.184B |
text-classification |
PatronusAI/glider | 3.8B |
text-classification |
flowaicom/Flow-Judge-v0.1 | 3.8B |
text-classification |
Sub Category | Description |
---|---|
quality |
Assesses structure, relevance, and coherence of output |
translation-accuracy |
Ensures contextually correct and linguistically accurate translations |
duplicate-elimination |
Detects and removes redundant content |
readability |
Evaluates text complexity for target audience |
Models in language-quality
Name | Size | Task |
---|---|---|
HuggingFaceFW/fineweb-edu-classifier | 0.109B |
text-classification |
nvidia/quality-classifier-deberta | 0.184B |
text-classification |
facebook/nllb-200-distilled-600M | 0.6B |
text-to-text-generation |
nvidia/prompt-task-and-complexity-classifier | 0.184B |
text-classification |
PatronusAI/glider | 3.8B |
text-classification |
flowaicom/Flow-Judge-v0.1 | 3.8B |
text-classification |
Sub Category | Description |
---|---|
competitor-blocking |
Screens for mentions of rival brands or companies |
price-validation |
Validates price-related data against verified sources |
source-verification |
Verifies accuracy of external quotes and references |
gibberish-filter |
Identifies and filters nonsensical or incoherent outputs |
Models in content-validation
Name | Size | Task |
---|---|---|
s-nlp/mdistilbert-base-formality-ranker | 0.142B |
text-classification |
d4data/bias-detection-model | 0.3B |
text-classification |
NousResearch/Minos-v1 | 0.4B |
text-classification |
osmosis-ai/Osmosis-Structure-0.6B | 0.6B |
token-classification |
gliner-community/gliner_small-v2.5 | 0.7B |
token-classification |
Sub Category | Description |
---|---|
sql-validation |
Validates SQL queries for syntax and security |
api-validation |
Ensures API calls conform to OpenAPI standards |
json-validation |
Validates JSON structure and schema |
logical-consistency |
Checks for contradictory or illogical statements |
Name | Size | Category | Sub Category |
---|---|---|---|
s-nlp/mdistilbert-base-formality-ranker | 0.142B |
content-validation |
quality |
d4data/bias-detection-model | 0.3B |
content-validation |
bias |
NousResearch/Minos-v1 | 0.4B |
content-validation |
refusal |
HuggingFaceFW/fineweb-edu-classifier | 0.109B |
language-quality |
quality |
nvidia/quality-classifier-deberta | 0.184B |
language-quality |
quality |
protectai/distilroberta-base-rejection-v1 | 0.0821B |
response-and-relevance |
rejection |
s-nlp/E5-EverGreen-Multilingual-Small | 0.118B |
response-and-relevance |
factuality |
lytang/MiniCheck-RoBERTa-Large | 0.4B |
response-and-relevance |
factuality, logical-consistency, relevance |
lytang/MiniCheck-Flan-T5-Large | 0.8B |
response-and-relevance |
factuality, logical-consistency, relevance |
ibm-granite/granite-guardian-3.1-2b | 2B |
response-and-relevance |
factuality, logical-consistency, relevance |
bespokelabs/Bespoke-MiniCheck-7B | 7B |
response-and-relevance |
factuality, logical-consistency, relevance |
nvidia/prompt-task-and-complexity-classifier | 0.184B |
response-and-relevance, language-quality |
relevance, quality |
PatronusAI/glider | 3.8B |
response-and-relevance, language-quality |
factuality, logical-consistency, relevance, quality |
flowaicom/Flow-Judge-v0.1 | 3.8B |
response-and-relevance, language-quality |
factuality, logical-consistency, relevance, quality |
meta-llama/Llama-Prompt-Guard-2-22M | 0.022B |
security-and-privacy |
prompt-injection, jailbreaks |
eliasalbouzidi/distilbert-nsfw-text-classifier | 0.068B |
security-and-privacy |
inappropriate-content |
meta-llama/Llama-Prompt-Guard-2-86M | 0.086B |
security-and-privacy |
prompt-injection, jailbreaks |
ibm-granite/granite-guardian-hap-125m | 0.125B |
security-and-privacy |
toxicity, hallucination |
ibm-granite/granite-guardian-hap-125m | 0.125B |
security-and-privacy |
toxicity, hallucination |
protectai/deberta-v3-small-prompt-injection-v2 | 0.142B |
security-and-privacy |
prompt-injection |
protectai/deberta-v3-base-prompt-injection-v2 | 0.182B |
security-and-privacy |
prompt-injection |
TostAI/nsfw-text-detection-large | 0.355B |
security-and-privacy |
inappropriate-content |
MoritzLaurer/ModernBERT-large-zeroshot-v2.0 | 0.4B |
security-and-privacy |
inappropriate-content, offensive-language, prompt-injection, sensitive-content |
madhurjindal/Jailbreak-Detector-2-XL | 0.5B |
security-and-privacy |
jailbreaks |
google/shieldgemma-2b | 2B |
security-and-privacy |
inappropriate-content, offensive-language, prompt-injection, sensitive-content |
Name | Size | Category | Sub Category |
---|---|---|---|
osmosis-ai/Osmosis-Structure-0.6B | 0.6B |
content-validation, security-and-privacy |
pii, competitor-blocking |
gliner-community/gliner_small-v2.5 | 0.7B |
content-validation, security-and-privacy |
pii, competitor-blocking |
ai4privacy/llama-ai4privacy-multilingual-categorical-anonymiser-openpii | 0.15B |
security-and-privacy |
pii |
Name | Size | Category | Sub Category |
---|---|---|---|
facebook/nllb-200-distilled-600M | 0.6B |
language-quality |
translation-accuracy |
meta-llama/Llama-3.2-1B-Instruct | 1B |
security-and-privacy |
inappropriate-content, offensive-language, prompt-injection, sensitive-content |
Name | Size | Category | Sub Category |
---|---|---|---|
Marqo/nsfw-image-detection-384 | 0.006B |
security-and-privacy |
inappropriate-content |
Freepik/nsfw_image_detector | 0.086B |
security-and-privacy |
inappropriate-content |
Organika/sdxl-detector | 0.086B |
security-and-privacy |
deepfake-detection |
prithivMLmods/Deep-Fake-Detector-v2-Model | 0.086B |
security-and-privacy |
deepfake-detection |
TostAI/nsfw-image-detection-large | 0.0871B |
security-and-privacy |
inappropriate-content |
Ateeqq/nsfw-image-detection | 0.092B |
security-and-privacy |
inappropriate-content |
Falconsai/nsfw_image_detection | 0.1B |
security-and-privacy |
inappropriate-content |
OpenSafetyLab/ImageGuard | na |
security-and-privacy |
inappropriate-content |
Name | Size | Category | Sub Category |
---|---|---|---|
meta-llama/Llama-Guard-4-12B | 12B |
security-and-privacy |
inappropriate-content, offensive-language, prompt-injection, sensitive-content |
Name | Category | Description |
---|---|---|
guardrails | all |
Adding guardrails to large language models. |
NeMo-Guardrails | all |
NeMo Guardrails is an open-source toolkit for easily adding programmable guardrails to LLM-based conversational systems. |
uqlm | hallucination |
UQLM: Uncertainty Quantification for Language Models, is a Python package for UQ-based LLM hallucination detection. |
llm-guard | all |
The Security Toolkit for LLM Interactions. |
Name | Category | Description |
---|---|---|
Lakera | all |
Lakera is a company that provides a range of AI services. |
Guardrails AI Pro | all |
Guardrails AI Pro is a commercial version of guardrails that provides additional features and support. |
Name | Category | Description |
---|---|---|
lytang/LLM-AggreFact | factuality |
Bias in Bios is a dataset of 100000 bios of people with different biases. |
Entreprise PII Masking | pii |
Entreprise PII Masking are datasets for enterprise PII masking focused on location, work, health, digital and financial information. |
prithivMLmods/OpenDeepfake-Preview | deepfake-detection |
OpenDeepfake-Preview is a dataset of 20K deepfake images. |
eliasalbouzidi/NSFW-Safe-Dataset | nsfw |
NSFW-Safe-Dataset is a dataset for NSFW content detection. |
lmsys/toxic-chat | toxic-chat |
Toxic-Chat is a dataset for toxic chat detection. |
Name | Category | Description |
---|---|---|
Uncertainty Quantification for Language Models: A Suite of Black-Box, White-Box, LLM Judge, and Ensemble Scorers | hallucination |
Uncertainty Quantification for Language Models: A Suite of Black-Box, White-Box, LLM Judge, and Ensemble Scorers |
RAGTruth: A Hallucination Corpus for Developing Trustworthy Retrieval-Augmented Language Models | factuality |
RAGTruth is a dataset of 100000 bios of people with different biases. |
MiniCheck: Efficient Fact-Checking of LLMs on Grounding Documents | factuality |
how to build small fact-checking models that have GPT-4-level performance but for 400x lower cost. |
A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions | hallucination |
A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions |
Granite Guardian: A Guardrail Framework for Large Language Models | all |
Granite Guardian is a guardrail framework for large language models. |
"Do Anything Now": Characterizing and Evaluating In-The-Wild Jailbreak Prompts on Large Language Models | prompt-injection |
"Do Anything Now": Characterizing and Evaluating In-The-Wild Jailbreak Prompts on Large Language Models |
"Tiny-Toxic-Detector: A compact transformer-based model for toxic content detection | toxic-chat |
"Tiny-Toxic-Detector: A compact transformer-based model for toxic content detection |
T2ISafety: Benchmark for Assessing Fairness, Toxicity, and Privacy in Image Generation | toxic-chat |
T2ISafety is a benchmark for assessing fairness, toxicity, and privacy in image generation. |