Skip to content

KevinCL16/DSDBench

Repository files navigation

DSDBench

DSDBench: Benchmarking LLMs as Data Science Code Debuggers for Multi-Hop and Multi-Bug Errors

β€’ πŸ“– Introduction β€’ πŸŽ‰ News β€’ ✨ DSDBench β€’ πŸš€ Methodology

β€’ ⚑️ Getting Started β€’ βš™οΈ Configuration Details β€’ πŸ“Š Experiment Results β€’ πŸ”Ž Citation β€’ πŸ“ƒ Paper

πŸ“– Introduction

Debugging data science code presents significant challenges, especially when multiple logical errors interact in intricate ways. Existing benchmarks often focus on simple, isolated error scenarios, leaving the debugging of multi-hop, multi-bug errors largely unexplored. DSDBench fills this critical gap by offering a comprehensive dataset and evaluation framework designed to assess and improve large language models (LLMs) in debugging complex, real-world data science code problems.

πŸŽ‰ News

  • March 21, 2024: DSDBench dataset and evaluation framework officially released! 🎊

✨ DSDBench

DSDBench is the first systematic benchmark explicitly created for data science code debugging, featuring:

  • Realistic Errors: Logical and runtime errors that mirror real-world data science workflows.
  • Multi-Hop Debugging: Scenarios where error identification requires tracing back through multiple code execution steps.
  • Multi-Bug Scenarios: Cases involving concurrent errors within a single code snippet.
  • Comprehensive Annotations: Includes 1,117 meticulously annotated examples, clearly labeling cause-effect error lines and runtime error messages.
DSDBench framework

πŸš€ Methodology

Our contributions include:

  • Automated Error Injection: Leveraging advanced LLM techniques to systematically introduce realistic runtime errors.
  • Dynamic Error Annotation: Utilizing runtime tracing (with tools like snoop) to accurately capture cause-effect relationships in errors.
  • Rigorous Evaluation Protocols: Employing a four-dimensional evaluation approach covering cause lines, effect lines, error types, and error messages.

⚑️ Getting Started

To start using DSDBench, follow these installation and execution steps:

πŸ› οΈ Installation

You can install DSDBench and its dependencies using one of the following methods:

  1. Using pip with requirements file:

    pip install -r requirements.txt
  2. Installing as a package (development mode):

    pip install -e .

πŸ”‘ API Configuration

To use DSDBench with language models that require API access (like GPT-4o), you need to configure your API credentials:

  1. Open the configuration file at agents/config/openai.py
  2. Add your API key and base URL:
    API_KEY = 'your-api-key-here'
    BASE_URL = 'https://api.openai.com/v1'  # Default OpenAI URL, change if using a different provider
    temperature = 0  # Adjust model temperature as needed

Note: If you're using a different model provider (like Azure OpenAI), set the appropriate base URL according to your provider's documentation.

πŸ“‚ Project Structure

The DSDBench repository has the following structure:

  • DSDBench/
    • πŸ“ agents/
      • (Agent model implementation directory)
    • πŸ“ config/
      • (Configuration files directory)
      • dabench_quantitative_experiment_config.py
      • single_bug_eval_agent_config.py
      • multi_bug_eval_agent_config.py
      • error_snoop_agent_config.py
      • library_error_inject_agent_config.py
      • weak_llm_direct_analysis_config.py
      • data_annotate_agent_config.py
    • πŸ“ workspace/
      • (Workspace directory)
      • πŸ“ benchmark_evaluation/
        • (Benchmark evaluation directory)
        • bench_final_annotation_single_error.jsonl
        • bench_final_annotation_multi_errors.jsonl
        • compute_single_eval_results.py
        • compute_multi_eval_results.py
      • filter_non_executable_data.py
      • find_multi_hop_data.py
      • merge_final_annotation.py
      • merge_multiple_errors.py
    • workflow_generic.py
      • (Main workflow execution script with command line support)
    • run_single_bug_eval.py
      • (Helper script for single-bug evaluation)
    • run_multi_bug_eval.py
      • (Helper script for multi-bug evaluation)

▢️ Running Evaluations

DSDBench provides helper scripts to simplify the evaluation process:

For single-bug scenarios:

python run_single_bug_eval.py

This command automatically runs the workflow using the single-bug configuration and computes the evaluation results.

For multi-bug scenarios:

python run_multi_bug_eval.py

This command executes the multi-bug workflow and calculates the multi-error evaluation metrics.

πŸ•ΉοΈ Manual Execution

For more control, you can run individual workflow components manually:

For single-bug evaluation:

python workflow_generic.py --config config/single_bug_eval_agent_config.py
cd workspace/benchmark_evaluation
python compute_single_eval_results.py

For multi-bug evaluation:

python workflow_generic.py --config config/multi_bug_eval_agent_config.py
cd workspace/benchmark_evaluation
python compute_multi_eval_results.py

πŸ“ Dataset Creation

To generate datasets from scratch, execute the pipeline steps in the following order:

# First, run the initial data generation workflows
python workflow_generic.py --config config/data_annotate_agent_config.py
python workflow_generic.py --config config/library_error_inject_agent_config.py
python workflow_generic.py --config config/error_snoop_agent_config.py
python workflow_generic.py --config config/weak_llm_direct_analysis_config.py

# Then process the data with our improved utilities
cd workspace

# Filter for executable errors
python filter_non_executable_data.py --input path/to/monitored_errors.jsonl --output path/to/filtered_errors.jsonl

# Find multi-hop errors
python find_multi_hop_data.py --input path/to/filtered_errors.jsonl --output path/to/annotated_errors.jsonl

# Merge annotations from multiple sources
python merge_final_annotation.py --input path/to/file1.jsonl path/to/file2.jsonl --output path/to/bench_final_annotation_single_error.jsonl

# Generate multi-bug scenarios
python merge_multiple_errors.py --input path/to/bench_final_annotation_single_error.jsonl --output path/to/bench_final_annotation_multi_errors.jsonl --samples_per_entry 5

Each utility script supports command-line arguments for flexible input/output path configuration:

  • filter_non_executable_data.py: Filters data to keep only error versions with valid traceback information
  • find_multi_hop_data.py: Identifies cause and effect error lines in traceback output
  • merge_final_annotation.py: Merges multiple JSONL annotation files into a single dataset
  • merge_multiple_errors.py: Generates multi-bug scenarios by combining single-bug errors

βš™οΈ Configuration Details

The configuration files in the config/ directory manage different aspects of the benchmark. Here's a brief overview:

  • single_bug_eval_agent_config.py: Configuration for single-bug evaluation scenarios.
  • multi_bug_eval_agent_config.py: Configuration for multi-bug evaluation scenarios.
  • data_annotate_agent_config.py: Configuration for the data annotation process.
  • library_error_inject_agent_config.py: Configuration for error injection in libraries.
  • error_snoop_agent_config.py: Configuration for error monitoring.
  • weak_llm_direct_analysis_config.py: Configuration for weak LLM error analysis.

To use a specific configuration file when running the workflow, use the --config argument:

python workflow_generic.py --config config/your_chosen_config.py

βš™οΈ Configuration Structure

Each configuration file adheres to a standard structure defined as follows:

AGENT_CONFIG = {
    'workspace': './workspace/path',  # Base workspace directory
    'agents': [
        {
            'name': 'agent_name',     # Name of the agent
            'class': AgentClass,      # The agent class to instantiate
            'prompts': {              # Prompts used by the agent
                'system': SYSTEM_PROMPT,
                'user': USER_PROMPT,
                'eval': EVAL_PROMPT,
                # Other prompts as needed
            },
            'kwargs': {               # Additional agent parameters
                'query': 'Default query',
                # Other parameters as needed
            }
        },
        # Additional agents as needed
    ]
}

WORKFLOW = [
    {
        'agent': 'agent_name',        # Name of the agent to run
        'method': 'method_name',      # Agent method to execute
        'args': {                     # Arguments for the method
            'model_type': 'gpt-4o',   # LLM model to use
            'eval_folder': 'workspace/results'  # Output location
        },
        'input': {'data': 'path/to/input.jsonl'},  # Input data source
        'data_ids': [1, 2, 3],        # Specific data IDs to process
        'data_range": [1, 50],        # Mutual exclusive with 'data_ids', specify a range of data IDs to process
        'output': 'result_name',      # Name for the output
        'output_type': 'analysis'     # Type of output
    },
    # Additional workflow steps as needed
]

βš™οΈ Customizing Agent Parameters

Agents can be customized by modifying the kwargs dictionary within their configuration. Common parameters include:

βš™οΈ Model Selection

The model_type parameter in workflow steps specifies the LLM to be used for evaluation:

  • gpt-4o: OpenAI GPT-4o model.
  • Qwen/Qwen2.5-72B-Instruct: Qwen 2.5 model.
  • deepseek/deepseek-v3: DeepSeek v3 model.
  • and so on, whatever models your API key allows.

πŸ“Š Experiment Results

Evaluations of state-of-the-art LLMs reveal significant challenges in multi-bug debugging scenarios. Key results are summarized below:

Model Cause Line Acc. Effect Line Acc. Error Type Acc. Error Message Acc.
GPT-4o 39.0% 34.3% 30.6% 31.4%
Claude 3.5 43.7% 35.2% 36.3% 34.0%
Deepseek-V3 48.3% 34.5% 35.9% 34.7%

Detailed analysis and ablation studies further emphasize the benchmark's complexity and its value in diagnosing model limitations.

Here is a case study of Large Reasoning Models on DSDBench:

DSDBench Case Study

πŸ”Ž Citation

If DSDBench is helpful in your research, please cite our work using the following BibTeX entry:

@misc{yang2025stoperrorbenchmarkingllms,
      title={Why Stop at One Error? Benchmarking LLMs as Data Science Code Debuggers for Multi-Hop and Multi-Bug Errors}, 
      author={Zhiyu Yang and Shuo Wang and Yukun Yan and Yang Deng},
      year={2025},
      eprint={2503.22388},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2503.22388}, 
}

About

Open source data, annotation and evaluation framework for DSDBench paper.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages