Skip to content

NIH-CARD/biomedsql

Repository files navigation

BiomedSQL: Text-to-SQL for Scientific Reasoning on Biomedical Knowledge Bases

Alt text

Overview of text-to-SQL workflow used to evaluate LLMs on BiomedSQL.

Requirements

We provide conda environment.yml and requirements.txt files for both MacOS and Linux.

To install requirements, we recommend creating a new environment with conda:

conda env create -f mac_environment.yml

Or install via pip:

pip install -r mac_requirements.txt

Environment Setup

BiomedSQL requires the extensive use of both opened a closed source LLMs. The following services are needed to run the full set of experiments:

  • AzureOpenAI (with endpoints for gpt-4o, gpt-4o-mini, and gpt-o3-mini)
  • AzureAI (with an endpoint for Meta-Llama-405B)
  • Gemini (for access to gemini-2.0-flash and gemini-2.0-flash-lite)
  • OpenAI (for access to the general completions() API for use in the Schema Indexing interaction paradigm)
  • Anthtropic (for access to claude-3-7-sonnet)
  • HuggingFace (for access to gated Meta-Llama repositories)

See config/sample.env for a complete list of specific information needed from each provider. Once complete, please move this file to config/.env for seamless use in the current experiment setup.

Benchmark Dataset

Our benchmark dataset and associated database tabular data can be found on HuggingFace.

BigQuery Database Creation

Coming soon we will provide code to create a fresh BigQuery Database from the parquet files available along with the BiomedSQL benchmark on HuggingFace.

Reviewers will be provided with a pre-configured config/service_account.json file for access to the current database.

LLM Experiments

To run the isolated SQL generation experiments for BiomedSQL, run:

python run_llm_experiments.py

Currently we use the following open-source models and detail the following compute requirements to run our experiment pipeline as-is:

  • meta-llama/Llama-3.1-70B-Instruct (three NVIDIA 80GB A100 GPUs)
  • Qwen/Qwen2.5-Coder-32B-Instruct (two NVIDIA 80GB A100 GPUs)
  • Qwen/Qwen2.5-Coder-14B-Instruct (two NVIDIA 80GB A100 GPUs)

We understand that GPU access may differ from user to user, so in order to run our experiments without the need for GPUs, please comment out any models specified with provider: huggingface under the experiment_models section of config/llm_config.yaml.

Interaction Paradigm Experiments

To run the interaction paradigm experiments for BiomedSQL, run:

python run_interaction_experiments.py

Generate Results

To generate results figures and tables after experiments finished, run:

python results.py

Tables will show up in results and plots will show up in results/plots.

Results

On BiomedSQL, GPT-o3-mini is consistently the top-performing model on the variety of experiments performed. However, even when paired with our custom-built text-to-SQL system (BMSQL), it still falls short of domain-expert level performance.

Model name Execution Accuracy Response Quality Rate
GPT-o3-mini-baseline 53.5% 73.3%
GPT-o3-mini-combo 59.0% 77.8%
BMSQL-GPT-o3-mini 62.6% 84.6%

License and Contributing

This respository is under the PolyForm Noncommercial License (Version 1.0.0). To contribute, simply clone the repository and open a pull request!

Relevant Citation

@article{koretsky2025biomedsql,
      title  = {BiomedSQL: Text-to-SQL for Scientific Reasoning on Biomedical Knowledge Bases}, 
      author = {Mathew J. Koretsky and Maya Willey and Adi Asija and Owen Bianchi and Chelsea X. Alvarado and Tanay Nayak and Nicole Kuznetsov and Sungwon Kim and Mike A. Nalls and Daniel Khashabi and Faraz Faghri},
      year   = {2025},
      eprint = {2505.20321},
      archivePrefix = {arXiv},
      primaryClass  = {cs.CL},
      url    = {https://arxiv.org/abs/2505.20321},
      code   = {https://github.com/NIH-CARD/biomedsql},
}

Happy benchmarking!

About

BiomedSQL: A text-to-SQL benchmark for scientific reasoning

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages