Skip to content

morrislab/mRNABench

Repository files navigation

mRNABench

PyPI version Python 3.10+ License: AGPL v3 bioRxiv

image

This repository contains the code for mRNABench, which benchmarks the embedding quality of genomic foundation models on mRNA specific tasks. The mRNABench contains a catalogue of datasets and training split logic which can be used to evaluate the embedding quality of several catalogued models.

Paper: BioRxiv Link
Notebook Example: Colab Notebook
Dataset Repository: HuggingFace Collection

Table of Contents

Setup

Several configurations of the mRNABench are available.

Datasets Only

If you are interested in the benchmark datasets only, you can run:

pip install mrna-bench

Base Models

Important

Requirements: PyTorch 2.2.2 and CUDA 12.1+ are required for base models installation.

The inference-capable version of mRNABench that can generate embeddings using most models (except Evo2 and Helix mRNA) can be installed as shown below.

conda create --name mrna_bench python=3.10
conda activate mrna_bench

pip install torch==2.2.2 --index-url https://download.pytorch.org/whl/cu121
pip install mrna-bench[base_models]

Inference with other models will require the installation of the model's dependencies first, which are usually listed on the model's GitHub page (see below).

Post-install

Important

After installation, please run the following in Python to set where data associated with the benchmarks will be stored.

import mrna_bench as mb

path_to_dir_to_store_data = "DESIRED_PATH"
mb.update_data_path(path_to_dir_to_store_data)

path_to_dir_to_store_weights = "DESIRED_PATH_FOR_MODEL_WEIGHTS"
mb.update_model_weights_path(path_to_dir_to_store_weights)

Evo2

Evo2 requires more complicated installation instructions, and can only be run on H100s. See Evo2 Setup.

Dev Mode

Dev mode allows generation of datasets from scratch and includes access to the RNA-Fazal localization dataset. See Dev Mode Setup.

Usage

Datasets can be retrieved using:

import mrna_bench as mb

dataset = mb.load_dataset("go-mf")
data_df = dataset.data_df

The mRNABench can also be used to test out common genomic foundation models:

import torch

import mrna_bench as mb
from mrna_bench.embedder import DatasetEmbedder
from mrna_bench.linear_probe import LinearProbeBuilder

device = torch.device("cuda")

dataset = mb.load_dataset("go-mf")
model = mb.load_model("Orthrus", "orthrus-large-6-track", device)

embedder = DatasetEmbedder(model, dataset)
embeddings = embedder.embed_dataset()
embeddings = embeddings.detach().cpu().numpy()

prober = (LinearProbeBuilder(dataset)
    .fetch_embedding_by_embedding_instance("orthrus-large-6", embeddings)
    .build_splitter("homology", species="human", eval_all_splits=False)
    .build_evaluator("multilabel")
    .set_target("target")
    .build()
)

metrics = prober.run_linear_probe(2541)
print(metrics)

Also see the scripts/ folder for example scripts that uses slurm to embed dataset chunks in parallel for reduce runtime, as well as an example of multi-seed linear probing.

Model Catalog

The models supported by the base_models installation are catalogued below.

RNA Foundation Models

Model Name Model Versions Description Citation
Orthrus orthrus-large-6-track
orthrus-base-4-track
Mamba-based RNA foundation model pre-trained using contrastive learning on 45M RNA transcripts to capture functional and evolutionary relationships. 6-track version incorporates CDS and splice site information. [Code] [Paper]
RNA-FM rna-fm
mrna-fm
Transformer-based RNA foundation model pre-trained using MLM. RNA-FM trained on 23M ncRNA sequences, mRNA-FM trained on mRNA CDS regions using codon tokenizer. [Github]
SpliceBERT SpliceBERT.1024nt
SpliceBERT-human.510nt
SpliceBERT.510nt
Transformer-based RNA foundation model trained on 2M vertebrate mRNA sequences using MLM. Specialized for splice site prediction with human-only and context-length variants. [Github]
RiNALMo rinalmo Transformer-based RNA foundation model trained on 36M ncRNA sequences using MLM with modern architectural improvements including RoPE, SwiGLU activations, and Flash Attention. [Github]
UTR-LM utrlm-te_el
utrlm-mrl
utrlm-*-utronly
Transformer-based RNA foundation model specialized for 5'UTR sequences. Pre-trained on random and endogenous UTR sequences from various species. UTR-only variants automatically extract 5'UTRs. [Github]
3UTRBERT utrbert-3mer
utrbert-4mer
utrbert-5mer
utrbert-6mer
utrbert-*-utronly
Transformer-based RNA foundation model specialized for 3'UTR regions. Uses k-mer tokenization (3-6mers) and trained on 100k 3'UTR sequences. UTR-only variants automatically extract 3'UTRs. [Github]
RNA-MSM rnamsm Structure-aware RNA foundation model trained using multiple sequence alignments from custom structure-based homology mapping across ~4000 RNA families. [Github]
RNAErnie rnaernie Transformer-based RNA foundation model trained using MLM with motif-level masking strategy on 23M ncRNA sequences. Uses contiguous token masking to learn RNA motifs. [Github]
ERNIE-RNA ernierna
ernierna-ss
Transformer-based RNA foundation model with structural attention bias. Trained on 20M ncRNA sequences with custom attention incorporating RNA base pairing rules. SS version fine-tuned on structural tasks. [Github]
RNABERT rnabert Transformer-based RNA foundation model with dual training objectives combining MLM and structural alignment learning. Trained on 80k ncRNA sequences. [Github]
CodonBERT codonbert Transformer-based RNA foundation model trained on 10M+ mRNA sequences from mammals, bacteria, and viruses. Specialized for coding regions and mRNA properties. [Github]
Helix-mRNA helix-mrna Hybrid Mamba2/Transformer model trained on 26M diverse eukaryotic and viral mRNAs. Features CDS-aware tokenization with special tokens at codon boundaries. [Github]

DNA Foundation Models

Model Name Model Versions Description Citation
DNABERT2 dnabert2 Modern Transformer-based DNA foundation model with BPE tokenization and rotary positional encoding. Pre-trained using MLM on multi-species genomic datasets. [Github]
DNABERT-S dnabert-s Species-aware DNA foundation model trained with contrastive learning to encourage species grouping while discouraging cross-species associations. Covers microbial genomes including viruses, fungi, and bacteria. [Github]
Nucleotide Transformer 2.5b-multi-species
2.5b-1000g
500m-human-ref
500m-1000g
v2-50m-multi-species
v2-100m-multi-species
v2-250m-multi-species
v2-500m-multi-species
Transformer-based DNA foundation model family with 6-mer tokenization. Available in multiple sizes (50M-2.5B parameters) trained on various genomic datasets from human reference to multi-species collections. [Github]
HyenaDNA hyenadna-large-1m-seqlen-hf
hyenadna-medium-450k-seqlen-hf
hyenadna-medium-160k-seqlen-hf
hyenadna-small-32k-seqlen-hf
hyenadna-tiny-16k-seqlen-d128-hf
Hyena-based DNA foundation model with near-linear scaling and ultra-long context capability. Pre-trained using next token prediction on human reference genome with various model sizes and sequence lengths. [Github]
Evo1 evo-1.5-8k-base
evo-1-8k-base
evo-1-131k-base
StripedHyena-based DNA foundation model trained autoregressively on OpenGenome dataset at single nucleotide, byte-level resolution. Offers near-linear scaling with ultra-long context variants up to 131k nucleotides. [Github]
Evo2 evo2_40b
evo2_40b_base
evo2_7b
evo2_7b_base
evo2_1b_base
Next-generation StripedHyena2-based DNA foundation model trained on OpenGenome2 dataset. Provides multi-layer embeddings with ultra-long context capability up to 1M nucleotides for large model variants. [Github]

Baseline Models

Model Name Model Versions Description Citation
NaiveBaseline naive-4-track
naive-6-track
Non-neural baseline using traditional sequence features including k-mer counts (3-7mers), GC content, and sequence statistics. 6-track version adds CDS length and exon count from structural annotations. N/A
NaiveMamba naive-mamba Randomly initialized Mamba model serving as an untrained baseline. Uses 6-track input (sequence + CDS + splice information) with fixed random seed for reproducible comparisons. N/A

Note

Many of the model wrappers (3UTRBERT, RiNALMo, UTR-LM, RNA-MSM, RNAErnie) use reimplementations from the multimolecule package. See their website for more details.

Adding a new model

All models should inherit from the template EmbeddingModel. Each model file should lazily load dependencies within its __init__ methods so each model can be used individually without install all other models. Models must implement get_model_short_name(model_version) which fetches the internal name for the model. This must be unique for every model version and must not contain underscores. Models should implement either embed_sequence or embed_sequence_sixtrack (see code for method signature). New models should be added to MODEL_CATALOG.

Dataset Catalog

The current datasets catalogued are:

Gene Function Annotation

Dataset Name Catalogue Identifier Description Tasks Citation
GO Molecular Function go-mf Classification of the molecular function of a transcript's product as defined by the GO Resource. multilabel website
GO Biological Process go-bp Classification of the biological process a transcript's product participates in as defined by the GO Resource. multilabel website
GO Cellular Component go-cc Classification of the cellular component where a transcript's product is localized as defined by the GO Resource. multilabel website

Translation Regulation

Dataset Name Catalogue Identifier Description Tasks Citation
Mean Ribosome Load (Sugimoto) mrl‑sugimoto Mean ribosome load (MRL) per transcript isoform as measured in human cells using isoform-resolved ribosome profiling. regression paper
Mean Ribosome Load (Sample) mrl‑sample‑egfp
mrl‑sample‑mcherry
mrl‑sample‑designed
mrl‑sample‑varying
Mean ribosome load (MRL) measured in an MPRA of randomized and designed 5'UTR regions attached to eGFP or mCherry reporters. Includes various RNA modifications and UTR lengths. regression paper
Mean Ribosome Load & Half-life mrl‑hl‑lbkwk Joint prediction of ribosome load and RNA half-life from synthetic mRNA sequences in the Leppek et al. dataset. regression paper

RNA Stability

Dataset Name Catalogue Identifier Description Tasks Citation
RNA Half-life (Human) rnahl‑human RNA half-life of human transcripts measured using time-course RNA-seq after transcription inhibition. regression paper
RNA Half-life (Mouse) rnahl‑mouse RNA half-life of mouse transcripts measured using time-course RNA-seq after transcription inhibition. regression paper

Protein-RNA Interactions

Dataset Name Catalogue Identifier Description Tasks Citation
eCLIP RBP Binding (K562) eclip‑binding‑k562 RNA-binding protein (RBP) binding sites on mRNA sequences identified using eCLIP-seq in K562 cells. Covers ~80 different RBPs. multilabel paper
eCLIP RBP Binding (HepG2) eclip‑binding‑hepg2 RNA-binding protein (RBP) binding sites on mRNA sequences identified using eCLIP-seq in HepG2 cells. Covers ~70 different RBPs. multilabel paper

Subcellular Localization

Dataset Name Catalogue Identifier Description Tasks Citation
Protein Subcellular Localization prot‑loc Subcellular localization of transcript protein products based on experimental evidence from the Human Protein Atlas. multilabel website
RNA Subcellular Localization (Fazal) rna‑loc‑fazal Subcellular localization of mRNA molecules measured using APEX-seq (proximity labeling + RNA-seq) in human cells. multilabel paper
RNA Subcellular Localization (Ietswaart) rna‑loc‑ietswaart Subcellular localization of mRNA molecules in human cells using compartment-specific RNA-seq approaches. multilabel paper

Variant Effect Prediction

Dataset Name Catalogue Identifier Description Tasks Citation
VEP TraitGym (Mendelian) vep‑traitgym‑mendelian Pathogenicity prediction for genetic variants in 3'UTR and 5'UTR regions associated with Mendelian diseases. classification paper
VEP TraitGym (Complex) vep‑traitgym‑complex Pathogenicity prediction for genetic variants in 3'UTR and 5'UTR regions associated with complex traits. classification paper

Adding a new dataset

New datasets should inherit from BenchmarkDataset. Dataset names cannot contain underscores. Each new dataset should download raw data and process it into a dataframe by overriding process_raw_data. This dataframe should store transcript as rows, using string encoding in the sequence column. If homology splitting is required, a column gene containing gene names is required. Six track embedding also requires columns cds and splice. The target column can have any name, as it is specified at time of probing. New datasets should be added to DATASET_CATALOG.

Citation

If you use mRNABench in your research, please cite:

@article{shi_dalal_fradkin_2025_mrnabench,
    author = {Shi, Ruian and Dalal, Taykhoom and Fradkin, Philip and Koyyalagunta, Divya and Chhabria, Simran and Jung, Andrew and Tam, Cyrus and Ceyhan, Defne and Lin, Jessica and Laverty, Kaitlin U. and Baali, Ilyes and Wang, Bo and Morris, Quaid},
    title = {mRNABench: A curated benchmark for mature mRNA property and function prediction},
    elocation-id = {2025.07.05.662870},
    year = {2025},
    doi = {10.1101/2025.07.05.662870},
    publisher = {Cold Spring Harbor Laboratory},
    URL = {https://www.biorxiv.org/content/early/2025/07/08/2025.07.05.662870},
    eprint = {https://www.biorxiv.org/content/early/2025/07/08/2025.07.05.662870.full.pdf},
    journal = {bioRxiv}
}

The original sources for each dataset and model should be cited if used, and can be found above. A substantial number of model implementations use the the multimolecule package: https://huggingface.co/multimolecule; citation information can be found on their HuggingFace.

Evo2 Setup

Inference using Evo2 requires installing the following in its own environment. Note: There may be an issue where the evo_40b models, when downloaded, have their merged checkpoints stored one directory above the HuggingFace hub cache. You may need to manually move the checkpoint into its corresponding snapshot directory: /hub/models--arcinstitute-evo2_40b*/snapshots/snapshot_name/

Hardware Requirements: Evo2 can only be run on H100 GPUs.

conda create --name evo_bench -c conda-forge python=3.11 gxx=12.2.0 -y
conda activate evo_bench

pip install torch==2.6.0+cu124 --index-url https://download.pytorch.org/whl/cu124
pip install vtx==1.0.4
pip install evo2==0.2.0
pip install flash-attn==2.7.4.post1

cd path/to/mRNA/bench
pip install -e .

Dev Mode Setup

Dev mode requires additional dependencies for generating datasets from scratch and accessing the RNA-Fazal localization dataset.

conda create --name mrna_bench_dev python=3.10
conda activate mrna_bench_dev

# Install genome-kit first
conda install -c conda-forge genome_kit=7.1.0
conda install -c conda-forge gcc_linux-64 gxx_linux-64
# Note: You might need to add gcc compilers to LD_LIBRARY_PATH if you encounter linking issues

pip install torch==2.2.2 --index-url https://download.pytorch.org/whl/cu121
pip install mrna-bench[base_models]