This repository contains the code for mRNABench, which benchmarks the embedding quality of genomic foundation models on mRNA specific tasks. The mRNABench contains a catalogue of datasets and training split logic which can be used to evaluate the embedding quality of several catalogued models.
Paper: BioRxiv Link
Notebook Example: Colab Notebook
Dataset Repository: HuggingFace Collection
Several configurations of the mRNABench are available.
If you are interested in the benchmark datasets only, you can run:
pip install mrna-bench
Important
Requirements: PyTorch 2.2.2 and CUDA 12.1+ are required for base models installation.
The inference-capable version of mRNABench that can generate embeddings using most models (except Evo2 and Helix mRNA) can be installed as shown below.
conda create --name mrna_bench python=3.10
conda activate mrna_bench
pip install torch==2.2.2 --index-url https://download.pytorch.org/whl/cu121
pip install mrna-bench[base_models]
Inference with other models will require the installation of the model's dependencies first, which are usually listed on the model's GitHub page (see below).
Important
After installation, please run the following in Python to set where data associated with the benchmarks will be stored.
import mrna_bench as mb
path_to_dir_to_store_data = "DESIRED_PATH"
mb.update_data_path(path_to_dir_to_store_data)
path_to_dir_to_store_weights = "DESIRED_PATH_FOR_MODEL_WEIGHTS"
mb.update_model_weights_path(path_to_dir_to_store_weights)
Evo2 requires more complicated installation instructions, and can only be run on H100s. See Evo2 Setup.
Dev mode allows generation of datasets from scratch and includes access to the RNA-Fazal localization dataset. See Dev Mode Setup.
Datasets can be retrieved using:
import mrna_bench as mb
dataset = mb.load_dataset("go-mf")
data_df = dataset.data_df
The mRNABench can also be used to test out common genomic foundation models:
import torch
import mrna_bench as mb
from mrna_bench.embedder import DatasetEmbedder
from mrna_bench.linear_probe import LinearProbeBuilder
device = torch.device("cuda")
dataset = mb.load_dataset("go-mf")
model = mb.load_model("Orthrus", "orthrus-large-6-track", device)
embedder = DatasetEmbedder(model, dataset)
embeddings = embedder.embed_dataset()
embeddings = embeddings.detach().cpu().numpy()
prober = (LinearProbeBuilder(dataset)
.fetch_embedding_by_embedding_instance("orthrus-large-6", embeddings)
.build_splitter("homology", species="human", eval_all_splits=False)
.build_evaluator("multilabel")
.set_target("target")
.build()
)
metrics = prober.run_linear_probe(2541)
print(metrics)
Also see the scripts/
folder for example scripts that uses slurm to embed dataset chunks in parallel for reduce runtime, as well as an example of multi-seed linear probing.
The models supported by the base_models
installation are catalogued below.
Model Name | Model Versions | Description | Citation |
---|---|---|---|
Orthrus | orthrus-large-6-track orthrus-base-4-track |
Mamba-based RNA foundation model pre-trained using contrastive learning on 45M RNA transcripts to capture functional and evolutionary relationships. 6-track version incorporates CDS and splice site information. | [Code] [Paper] |
RNA-FM | rna-fm mrna-fm |
Transformer-based RNA foundation model pre-trained using MLM. RNA-FM trained on 23M ncRNA sequences, mRNA-FM trained on mRNA CDS regions using codon tokenizer. | [Github] |
SpliceBERT | SpliceBERT.1024nt SpliceBERT-human.510nt SpliceBERT.510nt |
Transformer-based RNA foundation model trained on 2M vertebrate mRNA sequences using MLM. Specialized for splice site prediction with human-only and context-length variants. | [Github] |
RiNALMo | rinalmo |
Transformer-based RNA foundation model trained on 36M ncRNA sequences using MLM with modern architectural improvements including RoPE, SwiGLU activations, and Flash Attention. | [Github] |
UTR-LM | utrlm-te_el utrlm-mrl utrlm-*-utronly |
Transformer-based RNA foundation model specialized for 5'UTR sequences. Pre-trained on random and endogenous UTR sequences from various species. UTR-only variants automatically extract 5'UTRs. | [Github] |
3UTRBERT | utrbert-3mer utrbert-4mer utrbert-5mer utrbert-6mer utrbert-*-utronly |
Transformer-based RNA foundation model specialized for 3'UTR regions. Uses k-mer tokenization (3-6mers) and trained on 100k 3'UTR sequences. UTR-only variants automatically extract 3'UTRs. | [Github] |
RNA-MSM | rnamsm |
Structure-aware RNA foundation model trained using multiple sequence alignments from custom structure-based homology mapping across ~4000 RNA families. | [Github] |
RNAErnie | rnaernie |
Transformer-based RNA foundation model trained using MLM with motif-level masking strategy on 23M ncRNA sequences. Uses contiguous token masking to learn RNA motifs. | [Github] |
ERNIE-RNA | ernierna ernierna-ss |
Transformer-based RNA foundation model with structural attention bias. Trained on 20M ncRNA sequences with custom attention incorporating RNA base pairing rules. SS version fine-tuned on structural tasks. | [Github] |
RNABERT | rnabert |
Transformer-based RNA foundation model with dual training objectives combining MLM and structural alignment learning. Trained on 80k ncRNA sequences. | [Github] |
CodonBERT | codonbert |
Transformer-based RNA foundation model trained on 10M+ mRNA sequences from mammals, bacteria, and viruses. Specialized for coding regions and mRNA properties. | [Github] |
Helix-mRNA | helix-mrna |
Hybrid Mamba2/Transformer model trained on 26M diverse eukaryotic and viral mRNAs. Features CDS-aware tokenization with special tokens at codon boundaries. | [Github] |
Model Name | Model Versions | Description | Citation |
---|---|---|---|
DNABERT2 | dnabert2 |
Modern Transformer-based DNA foundation model with BPE tokenization and rotary positional encoding. Pre-trained using MLM on multi-species genomic datasets. | [Github] |
DNABERT-S | dnabert-s |
Species-aware DNA foundation model trained with contrastive learning to encourage species grouping while discouraging cross-species associations. Covers microbial genomes including viruses, fungi, and bacteria. | [Github] |
Nucleotide Transformer | 2.5b-multi-species 2.5b-1000g 500m-human-ref 500m-1000g v2-50m-multi-species v2-100m-multi-species v2-250m-multi-species v2-500m-multi-species |
Transformer-based DNA foundation model family with 6-mer tokenization. Available in multiple sizes (50M-2.5B parameters) trained on various genomic datasets from human reference to multi-species collections. | [Github] |
HyenaDNA | hyenadna-large-1m-seqlen-hf hyenadna-medium-450k-seqlen-hf hyenadna-medium-160k-seqlen-hf hyenadna-small-32k-seqlen-hf hyenadna-tiny-16k-seqlen-d128-hf |
Hyena-based DNA foundation model with near-linear scaling and ultra-long context capability. Pre-trained using next token prediction on human reference genome with various model sizes and sequence lengths. | [Github] |
Evo1 | evo-1.5-8k-base evo-1-8k-base evo-1-131k-base |
StripedHyena-based DNA foundation model trained autoregressively on OpenGenome dataset at single nucleotide, byte-level resolution. Offers near-linear scaling with ultra-long context variants up to 131k nucleotides. | [Github] |
Evo2 | evo2_40b evo2_40b_base evo2_7b evo2_7b_base evo2_1b_base |
Next-generation StripedHyena2-based DNA foundation model trained on OpenGenome2 dataset. Provides multi-layer embeddings with ultra-long context capability up to 1M nucleotides for large model variants. | [Github] |
Model Name | Model Versions | Description | Citation |
---|---|---|---|
NaiveBaseline | naive-4-track naive-6-track |
Non-neural baseline using traditional sequence features including k-mer counts (3-7mers), GC content, and sequence statistics. 6-track version adds CDS length and exon count from structural annotations. | N/A |
NaiveMamba | naive-mamba |
Randomly initialized Mamba model serving as an untrained baseline. Uses 6-track input (sequence + CDS + splice information) with fixed random seed for reproducible comparisons. | N/A |
Note
Many of the model wrappers (3UTRBERT, RiNALMo, UTR-LM, RNA-MSM, RNAErnie) use reimplementations from the multimolecule
package. See their website for more details.
All models should inherit from the template EmbeddingModel
. Each model file should lazily load dependencies within its __init__
methods so each model can be used individually without install all other models. Models must implement get_model_short_name(model_version)
which fetches the internal name for the model. This must be unique for every model version and must not contain underscores. Models should implement either embed_sequence
or embed_sequence_sixtrack
(see code for method signature). New models should be added to MODEL_CATALOG
.
The current datasets catalogued are:
Dataset Name | Catalogue Identifier | Description | Tasks | Citation |
---|---|---|---|---|
GO Molecular Function | go-mf |
Classification of the molecular function of a transcript's product as defined by the GO Resource. | multilabel |
website |
GO Biological Process | go-bp |
Classification of the biological process a transcript's product participates in as defined by the GO Resource. | multilabel |
website |
GO Cellular Component | go-cc |
Classification of the cellular component where a transcript's product is localized as defined by the GO Resource. | multilabel |
website |
Dataset Name | Catalogue Identifier | Description | Tasks | Citation |
---|---|---|---|---|
Mean Ribosome Load (Sugimoto) | mrl‑sugimoto |
Mean ribosome load (MRL) per transcript isoform as measured in human cells using isoform-resolved ribosome profiling. | regression |
paper |
Mean Ribosome Load (Sample) | mrl‑sample‑egfp mrl‑sample‑mcherry mrl‑sample‑designed mrl‑sample‑varying |
Mean ribosome load (MRL) measured in an MPRA of randomized and designed 5'UTR regions attached to eGFP or mCherry reporters. Includes various RNA modifications and UTR lengths. | regression |
paper |
Mean Ribosome Load & Half-life | mrl‑hl‑lbkwk |
Joint prediction of ribosome load and RNA half-life from synthetic mRNA sequences in the Leppek et al. dataset. | regression |
paper |
Dataset Name | Catalogue Identifier | Description | Tasks | Citation |
---|---|---|---|---|
RNA Half-life (Human) | rnahl‑human |
RNA half-life of human transcripts measured using time-course RNA-seq after transcription inhibition. | regression |
paper |
RNA Half-life (Mouse) | rnahl‑mouse |
RNA half-life of mouse transcripts measured using time-course RNA-seq after transcription inhibition. | regression |
paper |
Dataset Name | Catalogue Identifier | Description | Tasks | Citation |
---|---|---|---|---|
eCLIP RBP Binding (K562) | eclip‑binding‑k562 |
RNA-binding protein (RBP) binding sites on mRNA sequences identified using eCLIP-seq in K562 cells. Covers ~80 different RBPs. | multilabel |
paper |
eCLIP RBP Binding (HepG2) | eclip‑binding‑hepg2 |
RNA-binding protein (RBP) binding sites on mRNA sequences identified using eCLIP-seq in HepG2 cells. Covers ~70 different RBPs. | multilabel |
paper |
Dataset Name | Catalogue Identifier | Description | Tasks | Citation |
---|---|---|---|---|
Protein Subcellular Localization | prot‑loc |
Subcellular localization of transcript protein products based on experimental evidence from the Human Protein Atlas. | multilabel |
website |
RNA Subcellular Localization (Fazal) | rna‑loc‑fazal |
Subcellular localization of mRNA molecules measured using APEX-seq (proximity labeling + RNA-seq) in human cells. | multilabel |
paper |
RNA Subcellular Localization (Ietswaart) | rna‑loc‑ietswaart |
Subcellular localization of mRNA molecules in human cells using compartment-specific RNA-seq approaches. | multilabel |
paper |
Dataset Name | Catalogue Identifier | Description | Tasks | Citation |
---|---|---|---|---|
VEP TraitGym (Mendelian) | vep‑traitgym‑mendelian |
Pathogenicity prediction for genetic variants in 3'UTR and 5'UTR regions associated with Mendelian diseases. | classification |
paper |
VEP TraitGym (Complex) | vep‑traitgym‑complex |
Pathogenicity prediction for genetic variants in 3'UTR and 5'UTR regions associated with complex traits. | classification |
paper |
New datasets should inherit from BenchmarkDataset
. Dataset names cannot contain underscores. Each new dataset should download raw data and process it into a dataframe by overriding process_raw_data
. This dataframe should store transcript as rows, using string encoding in the sequence
column. If homology splitting is required, a column gene
containing gene names is required. Six track embedding also requires columns cds
and splice
. The target column can have any name, as it is specified at time of probing. New datasets should be added to DATASET_CATALOG
.
If you use mRNABench in your research, please cite:
@article{shi_dalal_fradkin_2025_mrnabench,
author = {Shi, Ruian and Dalal, Taykhoom and Fradkin, Philip and Koyyalagunta, Divya and Chhabria, Simran and Jung, Andrew and Tam, Cyrus and Ceyhan, Defne and Lin, Jessica and Laverty, Kaitlin U. and Baali, Ilyes and Wang, Bo and Morris, Quaid},
title = {mRNABench: A curated benchmark for mature mRNA property and function prediction},
elocation-id = {2025.07.05.662870},
year = {2025},
doi = {10.1101/2025.07.05.662870},
publisher = {Cold Spring Harbor Laboratory},
URL = {https://www.biorxiv.org/content/early/2025/07/08/2025.07.05.662870},
eprint = {https://www.biorxiv.org/content/early/2025/07/08/2025.07.05.662870.full.pdf},
journal = {bioRxiv}
}
The original sources for each dataset and model should be cited if used, and can be found above. A substantial number of model implementations use the the multimolecule
package: https://huggingface.co/multimolecule; citation information can be found on their HuggingFace.
Inference using Evo2 requires installing the following in its own environment. Note: There may be an issue where the evo_40b models, when downloaded, have their merged checkpoints stored one directory above the HuggingFace hub cache. You may need to manually move the checkpoint into its corresponding snapshot directory: /hub/models--arcinstitute-evo2_40b*/snapshots/snapshot_name/
Hardware Requirements: Evo2 can only be run on H100 GPUs.
conda create --name evo_bench -c conda-forge python=3.11 gxx=12.2.0 -y
conda activate evo_bench
pip install torch==2.6.0+cu124 --index-url https://download.pytorch.org/whl/cu124
pip install vtx==1.0.4
pip install evo2==0.2.0
pip install flash-attn==2.7.4.post1
cd path/to/mRNA/bench
pip install -e .
Dev mode requires additional dependencies for generating datasets from scratch and accessing the RNA-Fazal localization dataset.
conda create --name mrna_bench_dev python=3.10
conda activate mrna_bench_dev
# Install genome-kit first
conda install -c conda-forge genome_kit=7.1.0
conda install -c conda-forge gcc_linux-64 gxx_linux-64
# Note: You might need to add gcc compilers to LD_LIBRARY_PATH if you encounter linking issues
pip install torch==2.2.2 --index-url https://download.pytorch.org/whl/cu121
pip install mrna-bench[base_models]