mRNABench

This repository contains the code for mRNABench, which benchmarks the embedding quality of genomic foundation models on mRNA specific tasks. The mRNABench contains a catalogue of datasets and training split logic which can be used to evaluate the embedding quality of several catalogued models.

Paper: BioRxiv Link
Notebook Example: Colab Notebook
Dataset Repository: HuggingFace Collection

Setup

Several configurations of the mRNABench are available.

Datasets Only

If you are interested in the benchmark datasets only, you can run:

pip install mrna-bench

Base Models

Important

Requirements: PyTorch 2.2.2 and CUDA 12.1+ are required for base models installation.

The inference-capable version of mRNABench that can generate embeddings using most models (except Evo2 and Helix mRNA) can be installed as shown below.

conda create --name mrna_bench python=3.10
conda activate mrna_bench

pip install torch==2.2.2 --index-url https://download.pytorch.org/whl/cu121
pip install mrna-bench[base_models]

Inference with other models will require the installation of the model's dependencies first, which are usually listed on the model's GitHub page (see below).

Post-install

Important

After installation, please run the following in Python to set where data associated with the benchmarks will be stored.

import mrna_bench as mb

path_to_dir_to_store_data = "DESIRED_PATH"
mb.update_data_path(path_to_dir_to_store_data)

path_to_dir_to_store_weights = "DESIRED_PATH_FOR_MODEL_WEIGHTS"
mb.update_model_weights_path(path_to_dir_to_store_weights)

Evo2

Evo2 requires more complicated installation instructions, and can only be run on H100s. See Evo2 Setup.

Dev Mode

Dev mode allows generation of datasets from scratch and includes access to the RNA-Fazal localization dataset. See Dev Mode Setup.

Usage

Datasets can be retrieved using:

import mrna_bench as mb

dataset = mb.load_dataset("go-mf")
data_df = dataset.data_df

The mRNABench can also be used to test out common genomic foundation models:

import torch

import mrna_bench as mb
from mrna_bench.embedder import DatasetEmbedder
from mrna_bench.linear_probe import LinearProbeBuilder

device = torch.device("cuda")

dataset = mb.load_dataset("go-mf")
model = mb.load_model("Orthrus", "orthrus-large-6-track", device)

embedder = DatasetEmbedder(model, dataset)
embeddings = embedder.embed_dataset()
embeddings = embeddings.detach().cpu().numpy()

prober = (LinearProbeBuilder(dataset)
    .fetch_embedding_by_embedding_instance("orthrus-large-6", embeddings)
    .build_splitter("homology", species="human", eval_all_splits=False)
    .build_evaluator("multilabel")
    .set_target("target")
    .build()
)

metrics = prober.run_linear_probe(2541)
print(metrics)

Also see the scripts/ folder for example scripts that uses slurm to embed dataset chunks in parallel for reduce runtime, as well as an example of multi-seed linear probing.

Model Catalog

The models supported by the base_models installation are catalogued below.

RNA Foundation Models

Model Name	Model Versions	Description	Citation
Orthrus	`orthrus-large-6-track` `orthrus-base-4-track`	Mamba-based RNA foundation model pre-trained using contrastive learning on 45M RNA transcripts to capture functional and evolutionary relationships. 6-track version incorporates CDS and splice site information.	[Code] [Paper]
RNA-FM	`rna-fm` `mrna-fm`	Transformer-based RNA foundation model pre-trained using MLM. RNA-FM trained on 23M ncRNA sequences, mRNA-FM trained on mRNA CDS regions using codon tokenizer.	[Github]
SpliceBERT	`SpliceBERT.1024nt` `SpliceBERT-human.510nt` `SpliceBERT.510nt`	Transformer-based RNA foundation model trained on 2M vertebrate mRNA sequences using MLM. Specialized for splice site prediction with human-only and context-length variants.	[Github]
RiNALMo	`rinalmo`	Transformer-based RNA foundation model trained on 36M ncRNA sequences using MLM with modern architectural improvements including RoPE, SwiGLU activations, and Flash Attention.	[Github]
UTR-LM	`utrlm-te_el` `utrlm-mrl` `utrlm-*-utronly`	Transformer-based RNA foundation model specialized for 5'UTR sequences. Pre-trained on random and endogenous UTR sequences from various species. UTR-only variants automatically extract 5'UTRs.	[Github]
3UTRBERT	`utrbert-3mer` `utrbert-4mer` `utrbert-5mer` `utrbert-6mer` `utrbert-*-utronly`	Transformer-based RNA foundation model specialized for 3'UTR regions. Uses k-mer tokenization (3-6mers) and trained on 100k 3'UTR sequences. UTR-only variants automatically extract 3'UTRs.	[Github]
RNA-MSM	`rnamsm`	Structure-aware RNA foundation model trained using multiple sequence alignments from custom structure-based homology mapping across ~4000 RNA families.	[Github]
RNAErnie	`rnaernie`	Transformer-based RNA foundation model trained using MLM with motif-level masking strategy on 23M ncRNA sequences. Uses contiguous token masking to learn RNA motifs.	[Github]
ERNIE-RNA	`ernierna` `ernierna-ss`	Transformer-based RNA foundation model with structural attention bias. Trained on 20M ncRNA sequences with custom attention incorporating RNA base pairing rules. SS version fine-tuned on structural tasks.	[Github]
RNABERT	`rnabert`	Transformer-based RNA foundation model with dual training objectives combining MLM and structural alignment learning. Trained on 80k ncRNA sequences.	[Github]
CodonBERT	`codonbert`	Transformer-based RNA foundation model trained on 10M+ mRNA sequences from mammals, bacteria, and viruses. Specialized for coding regions and mRNA properties.	[Github]
Helix-mRNA	`helix-mrna`	Hybrid Mamba2/Transformer model trained on 26M diverse eukaryotic and viral mRNAs. Features CDS-aware tokenization with special tokens at codon boundaries.	[Github]

DNA Foundation Models

Model Name	Model Versions	Description	Citation
DNABERT2	`dnabert2`	Modern Transformer-based DNA foundation model with BPE tokenization and rotary positional encoding. Pre-trained using MLM on multi-species genomic datasets.	[Github]
DNABERT-S	`dnabert-s`	Species-aware DNA foundation model trained with contrastive learning to encourage species grouping while discouraging cross-species associations. Covers microbial genomes including viruses, fungi, and bacteria.	[Github]
Nucleotide Transformer	`2.5b-multi-species` `2.5b-1000g` `500m-human-ref` `500m-1000g` `v2-50m-multi-species` `v2-100m-multi-species` `v2-250m-multi-species` `v2-500m-multi-species`	Transformer-based DNA foundation model family with 6-mer tokenization. Available in multiple sizes (50M-2.5B parameters) trained on various genomic datasets from human reference to multi-species collections.	[Github]
HyenaDNA	`hyenadna-large-1m-seqlen-hf` `hyenadna-medium-450k-seqlen-hf` `hyenadna-medium-160k-seqlen-hf` `hyenadna-small-32k-seqlen-hf` `hyenadna-tiny-16k-seqlen-d128-hf`	Hyena-based DNA foundation model with near-linear scaling and ultra-long context capability. Pre-trained using next token prediction on human reference genome with various model sizes and sequence lengths.	[Github]
Evo1	`evo-1.5-8k-base` `evo-1-8k-base` `evo-1-131k-base`	StripedHyena-based DNA foundation model trained autoregressively on OpenGenome dataset at single nucleotide, byte-level resolution. Offers near-linear scaling with ultra-long context variants up to 131k nucleotides.	[Github]
Evo2	`evo2_40b` `evo2_40b_base` `evo2_7b` `evo2_7b_base` `evo2_1b_base`	Next-generation StripedHyena2-based DNA foundation model trained on OpenGenome2 dataset. Provides multi-layer embeddings with ultra-long context capability up to 1M nucleotides for large model variants.	[Github]

Baseline Models

Model Name	Model Versions	Description	Citation
NaiveBaseline	`naive-4-track` `naive-6-track`	Non-neural baseline using traditional sequence features including k-mer counts (3-7mers), GC content, and sequence statistics. 6-track version adds CDS length and exon count from structural annotations.	N/A
NaiveMamba	`naive-mamba`	Randomly initialized Mamba model serving as an untrained baseline. Uses 6-track input (sequence + CDS + splice information) with fixed random seed for reproducible comparisons.	N/A

Note

Many of the model wrappers (3UTRBERT, RiNALMo, UTR-LM, RNA-MSM, RNAErnie) use reimplementations from the multimolecule package. See their website for more details.

Adding a new model

All models should inherit from the template EmbeddingModel. Each model file should lazily load dependencies within its __init__ methods so each model can be used individually without install all other models. Models must implement get_model_short_name(model_version) which fetches the internal name for the model. This must be unique for every model version and must not contain underscores. Models should implement either embed_sequence or embed_sequence_sixtrack (see code for method signature). New models should be added to MODEL_CATALOG.

Dataset Catalog

The current datasets catalogued are:

Gene Function Annotation

Dataset Name	Catalogue Identifier	Description	Tasks	Citation
GO Molecular Function	`go-mf`	Classification of the molecular function of a transcript's product as defined by the GO Resource.	`multilabel`	website
GO Biological Process	`go-bp`	Classification of the biological process a transcript's product participates in as defined by the GO Resource.	`multilabel`	website
GO Cellular Component	`go-cc`	Classification of the cellular component where a transcript's product is localized as defined by the GO Resource.	`multilabel`	website

Translation Regulation

Dataset Name	Catalogue Identifier	Description	Tasks	Citation
Mean Ribosome Load (Sugimoto)	`mrl‑sugimoto`	Mean ribosome load (MRL) per transcript isoform as measured in human cells using isoform-resolved ribosome profiling.	`regression`	paper
Mean Ribosome Load (Sample)	`mrl‑sample‑egfp` `mrl‑sample‑mcherry` `mrl‑sample‑designed` `mrl‑sample‑varying`	Mean ribosome load (MRL) measured in an MPRA of randomized and designed 5'UTR regions attached to eGFP or mCherry reporters. Includes various RNA modifications and UTR lengths.	`regression`	paper
Mean Ribosome Load & Half-life	`mrl‑hl‑lbkwk`	Joint prediction of ribosome load and RNA half-life from synthetic mRNA sequences in the Leppek et al. dataset.	`regression`	paper

RNA Stability

Dataset Name	Catalogue Identifier	Description	Tasks	Citation
RNA Half-life (Human)	`rnahl‑human`	RNA half-life of human transcripts measured using time-course RNA-seq after transcription inhibition.	`regression`	paper
RNA Half-life (Mouse)	`rnahl‑mouse`	RNA half-life of mouse transcripts measured using time-course RNA-seq after transcription inhibition.	`regression`	paper

Protein-RNA Interactions

Dataset Name	Catalogue Identifier	Description	Tasks	Citation
eCLIP RBP Binding (K562)	`eclip‑binding‑k562`	RNA-binding protein (RBP) binding sites on mRNA sequences identified using eCLIP-seq in K562 cells. Covers ~80 different RBPs.	`multilabel`	paper
eCLIP RBP Binding (HepG2)	`eclip‑binding‑hepg2`	RNA-binding protein (RBP) binding sites on mRNA sequences identified using eCLIP-seq in HepG2 cells. Covers ~70 different RBPs.	`multilabel`	paper

Subcellular Localization

Dataset Name	Catalogue Identifier	Description	Tasks	Citation
Protein Subcellular Localization	`prot‑loc`	Subcellular localization of transcript protein products based on experimental evidence from the Human Protein Atlas.	`multilabel`	website
RNA Subcellular Localization (Fazal)	`rna‑loc‑fazal`	Subcellular localization of mRNA molecules measured using APEX-seq (proximity labeling + RNA-seq) in human cells.	`multilabel`	paper
RNA Subcellular Localization (Ietswaart)	`rna‑loc‑ietswaart`	Subcellular localization of mRNA molecules in human cells using compartment-specific RNA-seq approaches.	`multilabel`	paper

Variant Effect Prediction

Dataset Name	Catalogue Identifier	Description	Tasks	Citation
VEP TraitGym (Mendelian)	`vep‑traitgym‑mendelian`	Pathogenicity prediction for genetic variants in 3'UTR and 5'UTR regions associated with Mendelian diseases.	`classification`	paper
VEP TraitGym (Complex)	`vep‑traitgym‑complex`	Pathogenicity prediction for genetic variants in 3'UTR and 5'UTR regions associated with complex traits.	`classification`	paper

Adding a new dataset

New datasets should inherit from BenchmarkDataset. Dataset names cannot contain underscores. Each new dataset should download raw data and process it into a dataframe by overriding process_raw_data. This dataframe should store transcript as rows, using string encoding in the sequence column. If homology splitting is required, a column gene containing gene names is required. Six track embedding also requires columns cds and splice. The target column can have any name, as it is specified at time of probing. New datasets should be added to DATASET_CATALOG.

Citation

If you use mRNABench in your research, please cite:

@article{shi_dalal_fradkin_2025_mrnabench,
    author = {Shi, Ruian and Dalal, Taykhoom and Fradkin, Philip and Koyyalagunta, Divya and Chhabria, Simran and Jung, Andrew and Tam, Cyrus and Ceyhan, Defne and Lin, Jessica and Laverty, Kaitlin U. and Baali, Ilyes and Wang, Bo and Morris, Quaid},
    title = {mRNABench: A curated benchmark for mature mRNA property and function prediction},
    elocation-id = {2025.07.05.662870},
    year = {2025},
    doi = {10.1101/2025.07.05.662870},
    publisher = {Cold Spring Harbor Laboratory},
    URL = {https://www.biorxiv.org/content/early/2025/07/08/2025.07.05.662870},
    eprint = {https://www.biorxiv.org/content/early/2025/07/08/2025.07.05.662870.full.pdf},
    journal = {bioRxiv}
}

The original sources for each dataset and model should be cited if used, and can be found above. A substantial number of model implementations use the the multimolecule package: https://huggingface.co/multimolecule; citation information can be found on their HuggingFace.

Evo2 Setup

Inference using Evo2 requires installing the following in its own environment. Note: There may be an issue where the evo_40b models, when downloaded, have their merged checkpoints stored one directory above the HuggingFace hub cache. You may need to manually move the checkpoint into its corresponding snapshot directory: /hub/models--arcinstitute-evo2_40b*/snapshots/snapshot_name/

Hardware Requirements: Evo2 can only be run on H100 GPUs.

conda create --name evo_bench -c conda-forge python=3.11 gxx=12.2.0 -y
conda activate evo_bench

pip install torch==2.6.0+cu124 --index-url https://download.pytorch.org/whl/cu124
pip install vtx==1.0.4
pip install evo2==0.2.0
pip install flash-attn==2.7.4.post1

cd path/to/mRNA/bench
pip install -e .

Dev Mode Setup

Dev mode requires additional dependencies for generating datasets from scratch and accessing the RNA-Fazal localization dataset.

conda create --name mrna_bench_dev python=3.10
conda activate mrna_bench_dev

# Install genome-kit first
conda install -c conda-forge genome_kit=7.1.0
conda install -c conda-forge gcc_linux-64 gxx_linux-64
# Note: You might need to add gcc compilers to LD_LIBRARY_PATH if you encounter linking issues

pip install torch==2.2.2 --index-url https://download.pytorch.org/whl/cu121
pip install mrna-bench[base_models]

Name		Name	Last commit message	Last commit date
Latest commit History 60 Commits
experiments		experiments
mrna_bench		mrna_bench
scripts		scripts
tests		tests
visualization		visualization
.gitignore		.gitignore
.readthedocs.yaml		.readthedocs.yaml
LICENSE		LICENSE
README.md		README.md
precommit.sh		precommit.sh
pyproject.toml		pyproject.toml
pytest.ini		pytest.ini

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

mRNABench

Table of Contents

Setup

Datasets Only

Base Models

Post-install

Evo2

Dev Mode

Usage

Model Catalog

RNA Foundation Models

DNA Foundation Models

Baseline Models

Adding a new model

Dataset Catalog

Gene Function Annotation

Translation Regulation

RNA Stability

Protein-RNA Interactions

Subcellular Localization

Variant Effect Prediction

Adding a new dataset

Citation

Evo2 Setup

Dev Mode Setup

About

Uh oh!

Releases 5

Packages

Uh oh!

Contributors 4

Languages

License

morrislab/mRNABench

Folders and files

Latest commit

History

Repository files navigation

mRNABench

Table of Contents

Setup

Datasets Only

Base Models

Post-install

Evo2

Dev Mode

Usage

Model Catalog

RNA Foundation Models

DNA Foundation Models

Baseline Models

Adding a new model

Dataset Catalog

Gene Function Annotation

Translation Regulation

RNA Stability

Protein-RNA Interactions

Subcellular Localization

Variant Effect Prediction

Adding a new dataset

Citation

Evo2 Setup

Dev Mode Setup

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 5

Packages 0

Uh oh!

Contributors 4

Languages

Packages