SV‑TrustEval‑C 🚨🔒

🔍 Overview

SV‑TrustEval‑C is the first reasoning‑based benchmark designed to rigorously evaluate Large Language Models (LLMs) on both structure (control/data flow) and semantic reasoning for vulnerability analysis in C source code. Unlike existing benchmarks that focus on pattern recognition, SV‑TrustEval‑C measures logical consistency, adaptability to code transformations, and real‑world security reasoning across six core tasks.

Our benchmark reveals that current LLMs predominantly rely on superficial pattern matching, exposing critical gaps in their ability to understand complex code relationships and ensure trustworthy vulnerability analysis.

⭐ Key Features

🎯 Dual Reasoning Dimensions: Structure (ControlFlow/DataFlow) & Semantic (Counterfactual/Goal‑Driven/Predictive)
📊 Comprehensive Metrics: Accuracy, Conceptual Distance Sensitivity, Reasoning Consistency
🔄 Plug‑and‑Play Framework: Seamless integration with Hugging Face models
🌐 Open Dataset & Scripts: Fully reproducible; Reliable label accuracy

⚙️ Installation

git clone https://github.com/Jackline97/SV-TrustEval-C.git
cd SV-TrustEval-C
pip install -r requirements.txt

🚀 Quick Start

Before you begin, ensure you have downloaded and preprocessed the dataset from Hugging Face by running:

python Eval_Script/data_preprocessing.py

This step prepares the necessary data for evaluation.

Single-Model Evaluation

python Eval_Script/Test_Script_HF.py \
  --model_name "llama-3.1-8b-instruct" \
  --benchmark_loc "./SV-TrustEval-C-Official-1.0" \
  --result_loc "./results" \
  --temperature 0.0 \
  --inference_mode "zero-shot"

Batch Evaluation

python Eval_Script/Run_Test_script_HF.py

Performance Analysis

python Eval_Script/Run_Eval_script.py \
  --root_folder "./results/LLM_result_zero-shot_0.0" \
  --save_path "./results/eval_score.json"

📋 Benchmark Tasks

Dimension	Task	Description	Statistics
⚙️ Structure	Control Flow	Analyze program control-flow impacts.	1,345 questions
⚙️ Structure	Data Flow	Trace data dependencies and influence.	2,430 questions
🧠 Semantic	Counterfactual	Predict vulnerability under code perturbations.	3,748 questions
🧠 Semantic	Goal-driven	Safely modify code to meet functional goals.	1,159 questions
🧠 Semantic	Predictive	Classify variants by vulnerability impact.	719 questions
🛡️ Safety	Base Code Files	Compare safe vs. unsafe versions of code samples.	377 Safe, 377 Unsafe
⚠️ CWEs	Unique CWEs	Categorize vulnerabilities according to distinct Common Weakness Enumerations (CWEs).	82 unique CWEs

📈 Evaluation Metrics

Accuracy: Task-level correctness
Conceptual Distance Sensitivity: Ability to handle increasing structural complexity
Reasoning Consistency: Logical coherence across related queries

💾 Dataset

Official Dataset (v1.0)

Download the official benchmark directly from Hugging Face:

SV-TrustEval-C Official Dataset
👉 Download Here

Alternatively, use the following command to automatically download and preprocess the dataset:

python Eval_Script/data_preprocessing.py

PrimeVul Benchmark

For the PrimeVul version, please download the file:

SV-TrustEval_primevul.zip

Note:

No preprocessing is required for the PrimeVul benchmark.
The PrimeVul code snippet is not validated through a compilable check due to the absence of a dedicated environment.

📊 Results Structure

results/
└── LLM_result_[mode]_[temp]/
    └── [model_name]/
        ├── ControlFlow/
        ├── DataFlow/
        ├── Counterfactual/
        ├── GoalDriven/
        └── Predictive/

🤖 Supported Models

Meta Llama-3.1-8B-Instruct
Gemma-7B-IT
Mistral-7B-Instruct
CodeQwen1.5-7B
CodeGemma-7B
CodeLlama-13B/7B-Instruct
And more via Hugging Face

📚 Citation

@INPROCEEDINGS {,
author = { Li, Yansong and Branco, Paula and Hoole, Alexander M. and Marwah, Manish and Koduvely, Hari Manassery and Jourdan, Guy-Vincent and Jou, Stephan },
booktitle = { 2025 IEEE Symposium on Security and Privacy (SP) },
title = {{ SV-TrustEval-C: Evaluating Structure and Semantic Reasoning in Large Language Models for Source Code Vulnerability Analysis }},
year = {2025},
volume = {},
ISSN = {2375-1207},
pages = {3014-3032},
abstract = { As Large Language Models (LLMs) evolve in understanding and generating code, accurately evaluating their reliability in analyzing source code vulnerabilities becomes increasingly vital. While studies have examined LLM capabilities in tasks like vulnerability detection and repair, they often overlook the importance of both structure and semantic reasoning crucial for trustworthy vulnerability analysis. To address this gap, we introduce \textsc{SV-TrustEval-C}, a benchmark designed to evaluate LLMs' abilities for vulnerability analysis of code written in the C programming language through two key dimensions: structure reasoning—assessing how models identify relationships between code elements under varying data and control flow complexities; and semantic reasoning—examining their logical consistency in scenarios where code is structurally and semantically perturbed. Our results show that current LLMs are far from satisfactory in understanding complex code relationships and that their vulnerability analyses rely more on pattern matching than on robust logical reasoning. These findings underscore the effectiveness of the \textsc{SV-TrustEval-C} benchmark and highlight critical areas for enhancing the reasoning capabilities and trustworthiness of LLMs in real-world vulnerability analysis tasks. Our initial benchmark dataset is available at \textcolor{blue}{\url{https://huggingface.co/datasets/LLMs4CodeSecurity/SV-TrustEval-C-1.0}} },
keywords = {source code vulnerability;large language model},
doi = {10.1109/SP61157.2025.00191},
url = {https://doi.ieeecomputersociety.org/10.1109/SP61157.2025.00191},
publisher = {IEEE Computer Society},
address = {Los Alamitos, CA, USA},
month =May}

📄 License

Released under the MIT License. See LICENSE for details.

Name		Name	Last commit message	Last commit date
Latest commit History 96 Commits
Eval_Script		Eval_Script
Figures		Figures
LICENSE		LICENSE
README.md		README.md
SV-TrustEval-C_Paper.pdf		SV-TrustEval-C_Paper.pdf
SV-TrustEval_primevul.zip		SV-TrustEval_primevul.zip
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

SV‑TrustEval‑C 🚨🔒

🔍 Overview

⭐ Key Features

⚙️ Installation

🚀 Quick Start

Single-Model Evaluation

Batch Evaluation

Performance Analysis

📋 Benchmark Tasks

📈 Evaluation Metrics

💾 Dataset

Official Dataset (v1.0)

PrimeVul Benchmark

📊 Results Structure

🤖 Supported Models

📚 Citation

📄 License

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

Jackline97/SV-TrustEval-C

Folders and files

Latest commit

History

Repository files navigation

SV‑TrustEval‑C 🚨🔒

🔍 Overview

⭐ Key Features

⚙️ Installation

🚀 Quick Start

Single-Model Evaluation

Batch Evaluation

Performance Analysis

📋 Benchmark Tasks

📈 Evaluation Metrics

💾 Dataset

Official Dataset (v1.0)

PrimeVul Benchmark

📊 Results Structure

🤖 Supported Models

📚 Citation

📄 License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages