BiomedSQL: Text-to-SQL for Scientific Reasoning on Biomedical Knowledge Bases

Overview of text-to-SQL workflow used to evaluate LLMs on BiomedSQL.

Requirements

We provide conda environment.yml and requirements.txt files for both MacOS and Linux.

To install requirements, we recommend creating a new environment with conda:

conda env create -f mac_environment.yml

Or install via pip:

pip install -r mac_requirements.txt

Environment Setup

BiomedSQL requires the extensive use of both opened a closed source LLMs. The following services are needed to run the full set of experiments:

AzureOpenAI (with endpoints for gpt-4o, gpt-4o-mini, and gpt-o3-mini)
AzureAI (with an endpoint for Meta-Llama-405B)
Gemini (for access to gemini-2.0-flash and gemini-2.0-flash-lite)
OpenAI (for access to the general completions() API for use in the Schema Indexing interaction paradigm)
Anthtropic (for access to claude-3-7-sonnet)
HuggingFace (for access to gated Meta-Llama repositories)

See config/sample.env for a complete list of specific information needed from each provider. Once complete, please move this file to config/.env for seamless use in the current experiment setup.

Benchmark Dataset

Our benchmark dataset and associated database tabular data can be found on HuggingFace.

BigQuery Database Creation

Coming soon we will provide code to create a fresh BigQuery Database from the parquet files available along with the BiomedSQL benchmark on HuggingFace.

Reviewers will be provided with a pre-configured config/service_account.json file for access to the current database.

LLM Experiments

To run the isolated SQL generation experiments for BiomedSQL, run:

python run_llm_experiments.py

Currently we use the following open-source models and detail the following compute requirements to run our experiment pipeline as-is:

meta-llama/Llama-3.1-70B-Instruct (three NVIDIA 80GB A100 GPUs)
Qwen/Qwen2.5-Coder-32B-Instruct (two NVIDIA 80GB A100 GPUs)
Qwen/Qwen2.5-Coder-14B-Instruct (two NVIDIA 80GB A100 GPUs)

We understand that GPU access may differ from user to user, so in order to run our experiments without the need for GPUs, please comment out any models specified with provider: huggingface under the experiment_models section of config/llm_config.yaml.

Interaction Paradigm Experiments

To run the interaction paradigm experiments for BiomedSQL, run:

python run_interaction_experiments.py

Generate Results

To generate results figures and tables after experiments finished, run:

python results.py

Tables will show up in results and plots will show up in results/plots.

Results

On BiomedSQL, GPT-o3-mini is consistently the top-performing model on the variety of experiments performed. However, even when paired with our custom-built text-to-SQL system (BMSQL), it still falls short of domain-expert level performance.

Model name	Execution Accuracy	Response Quality Rate
GPT-o3-mini-baseline	53.5%	73.3%
GPT-o3-mini-combo	59.0%	77.8%
BMSQL-GPT-o3-mini	62.6%	84.6%

License and Contributing

This respository is under the PolyForm Noncommercial License (Version 1.0.0). To contribute, simply clone the repository and open a pull request!

Relevant Citation

@article{koretsky2025biomedsql,
      title  = {BiomedSQL: Text-to-SQL for Scientific Reasoning on Biomedical Knowledge Bases}, 
      author = {Mathew J. Koretsky and Maya Willey and Adi Asija and Owen Bianchi and Chelsea X. Alvarado and Tanay Nayak and Nicole Kuznetsov and Sungwon Kim and Mike A. Nalls and Daniel Khashabi and Faraz Faghri},
      year   = {2025},
      eprint = {2505.20321},
      archivePrefix = {arXiv},
      primaryClass  = {cs.CL},
      url    = {https://arxiv.org/abs/2505.20321},
      code   = {https://github.com/NIH-CARD/biomedsql},
}

Happy benchmarking!

Name		Name	Last commit message	Last commit date
Latest commit History 45 Commits
.github		.github
assets		assets
config		config
handlers		handlers
prompts		prompts
utils		utils
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
create_database.py		create_database.py
linux_environment.yml		linux_environment.yml
linux_requirements.txt		linux_requirements.txt
mac_environment.yml		mac_environment.yml
mac_requirements.txt		mac_requirements.txt
results.py		results.py
run_interaction_experiments.py		run_interaction_experiments.py
run_llm_experiments.py		run_llm_experiments.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

BiomedSQL: Text-to-SQL for Scientific Reasoning on Biomedical Knowledge Bases

Requirements

Environment Setup

Benchmark Dataset

BigQuery Database Creation

LLM Experiments

Interaction Paradigm Experiments

Generate Results

Results

License and Contributing

Relevant Citation

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

NIH-CARD/biomedsql

Folders and files

Latest commit

History

Repository files navigation

BiomedSQL: Text-to-SQL for Scientific Reasoning on Biomedical Knowledge Bases

Requirements

Environment Setup

Benchmark Dataset

BigQuery Database Creation

LLM Experiments

Interaction Paradigm Experiments

Generate Results

Results

License and Contributing

Relevant Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages