This repository contains codes and pre-trained models for RNA feature extraction and secondary structure prediction model (ERNIE-RNA). ERNIE-RNA is superior to the tested RNA feature extraction models (including RNA-FM) in the feature extraction task, and its effect in the secondary structure prediction task is better than RNAfold, UNI-RNA and others. You can find more details about ERNIE-RNA in our paper, ERNIE-RNA: An RNA Language Model with Structure-enhanced Representations
Table of contents
First, download the repository and create the environment.
git clone https://github.com/Bruce-ywj/ERNIE-RNA.git
cd ./ERNIE-RNA
conda env create -f environment.yml
Then, activate the "ERNIE-RNA" environment.
conda activate ERNIE-RNA
Since the fairseq library depends on C++ code, you need to install Microsoft C++ Build Tools first:
https://visualstudio.microsoft.com/visual-cpp-build-tools/
Open the installer and check "Desktop development with C++" on the left panel.
Then run:
conda env create -f environment_CPU_windows.yml
Run:
conda env create -f environment_CPU_mac.yml
At this step, the torch version is not explicitly specified.
Since the default parameters of the load function have changed in newer versions, you can either downgrade to version 1.10.0 after the environment is installed, or add the following patch after import torch:
import torch as _torch
__orig_load = _torch.load
def _patched_load(*args, **kwargs):
kwargs.setdefault("weights_only", False) # Restore the default behavior before version 2.5
return __orig_load(*args, **kwargs)
_torch.load = _patched_load
There are two subfolders in the model folder, each folder has a link, and you can download the model in the link to the same directory. Or you can download both models from our drive
python extract_embedding.py --seqs_path='./data/test_seqs.txt' --device='cuda:0'
The model path parameters are set by default and do not need to be changed.
The corresponding feature extraction code is inside this file, and the sequence in the file can be modified when used.
In this file, you can use ERNIE-RNA (twod_mlm) for feature extraction.
Features include cls, tokens, atten_map.
ERNIE-RNA provides powerful RNA secondary structure prediction capabilities, supporting model parameters from various training datasets and simultaneously providing both fine-tuned model and zero-shot prediction results.
python predict_ss_rna.py --dataset_name bpRNA-1m --seqs_path={fasta_dir} --save_path={output_dir} --device=0
--seqs_path
: Path to the FASTA file containing RNA sequences--save_path
: Directory path for output CT files--dataset_name
: RNA structure finetune dataset name, used to automatically select the corresponding model parameter file--ss_rna_checkpoint
: Path to the fine-tuned model parameter file (required when not using--trainset_name
)--device
: GPU device ID (0, 1, 2...) or CPU (-1)
Training Dataset | Model Parameter File | Application Scenario |
---|---|---|
bpRNA-1m |
ERNIE-RNA_attn-map_ss_prediction_bpRNA-1m_checkpoint.pt | General RNA structure prediction (bpRNA-1m refered to bpRNA-1m (80)))) |
RNAStralign |
ERNIE-RNA_attn-map_ss_prediction_RNAStralign_checkpoint.pt | General RNA structure prediction |
RIVAS |
ERNIE-RNA_attn-map_ss_prediction_RIVAS_checkpoint.pt | Reproduction of RIVAS results |
RNA3DB |
ERNIE-RNA_attn-map_ss_prediction_RNA3DB_checkpoint.pt | Reproduction of RNA3DB-2D results |
bpRNA-new |
ERNIE-RNA_attn-map_frozen_ss_prediction_bpRNA-1m_checkpoint.pt | Novel RNA structure prediction (Note: This is the ERNIE-RNA attn-map frozen model trained on the bpRNA-1m dataset, bpRNA-new do not serve as the trainset)) |
bpRNA-1m_RNAstralign |
ERNIE-RNA_attn-map_ss_prediction_bpRNA-1m-all_and_RNAStralign_checkpoint.pt | General RNA structure prediction (Note: Used all bpRNA-1m) and RNAStralign trainset sequences, excluding various(eg. RIVAS, RNA3DB) datasets' valid/test sequences, for training) |
For each input sequence, ERNIE-RNA generates two structure files in CT format:
{sequence_name}_finetune_prediction.ct
: Prediction results from the model fine-tuned on the specified training dataset{sequence_name}_zeroshot_prediction.ct
: Zero-shot prediction results using the pre-trained model (without fine-tuning)
Note: Regardless of which
dataset_name
is selected, the system will output additional zero-shot prediction results. Zero-shot prediction results have not been fine-tuned on any RNA structure training set and may remain the SAME output regardless of thedataset_name
.
- For sequences from common Rfam families or RNA families included in the bpRNA-1m and RNAStralign training sets, we recommend using
bpRNA-1m_RNAstralign
,bpRNA-1m
, orRNAStralign
parameters - For sequences that may belong to unknown RNA families, we recommend trying
bpRNA-new
orRNA3DB
parameters, or referring to the zero-shot prediction results output alongside each finetuned model's predictions
- Prediction using bpRNA-1m training set parameters:
python predict_ss_rna.py --dataset_name bpRNA-1m --device 0 --seqs_path ./data/ss_prediction/bpRNA-1m_testseqs.fasta --save_path ./results/ernie_rna_ss_prediction/bpRNA-1m_test_results/
- Prediction using RNA3DB training set parameters:
python predict_ss_rna.py --dataset_name RNA3DB --device 0 --seqs_path ./data/ss_prediction/rna3db_testseqs.fasta --save_path ./results/ernie_rna_ss_prediction/rna3db_test_results/
- Prediction using bpRNA-1m training set but performed best in bpRNA-new test parameters:
python predict_ss_rna.py --dataset_name bpRNA-new --device 0 --seqs_path ./data/ss_prediction/bpRNA-new_testseqs.fasta --save_path ./results/ernie_rna_ss_prediction/bpRNA-new_test_results/
This section describes how to use ERNIE-RNA to predict RNA 3D closeness maps. This functionality relies on the pre-trained ERNIE-RNA model as a feature extractor and a downstream model head specifically fine-tuned for the 3D closeness task. The recommended downstream model architecture is based on ERNIE-RNA's attention maps.
Usage Example:
To predict 3D closeness for RNA sequences in a FASTA file and visualize the results:
python predict_3d_clossness.py \
--input_rna_file ./results/ernie_rna_3d_clossness/example.fasta \
--device cuda:0 \
--visualize
This section describes how to use ERNIE-RNA to predict mean ribosome loading (MRL) for 5' UTR RNA sequences, a key measure of translation efficiency.
python predict_MRL.py \
--data_roots ./data/MRL_data/seqs.fasta \
--device 0
--data_roots
: Path to input FASTA file containing 5'UTR sequences (default:./data/MRL_data/seqs.fasta
)--bert_path
: Path to ERNIE-RNA pre-trained model checkpoint (default:./checkpoint/ERNIE-RNA_checkpoint/ERNIE-RNA_pretrain.pt
)--model_root
: Path to fine-tuned MRL prediction model weights (default:./checkpoint/ERNIE-RNA_UTR_MRL_checkpoint/ERNIE-RNA-UTR_ML_CNN_checkpoint.pt
)--scaler_root
: Path to scaler file for normalization (default:./checkpoint/ERNIE-RNA_UTR_MRL_checkpoint/scaler.save
)--output_dir
: Directory to save prediction results (default:./results/ernie_rna_utr_mrl
)--device
: GPU device ID to use (default: 0, use -1 for CPU)
If you find the models useful in your research, please cite our work:
ERNIE-RNA: An RNA Language Model with Structure-enhanced Representations
Yin W, Zhang Z, He L, et al. ERNIE-RNA: An RNA Language Model with Structure-enhanced Representations[J]. bioRxiv, 2024: 2024.03. 17.585376.
We use fairseq sequence modeling framework to train our RNA language modeling. We very appreciate this excellent work!
This source code is licensed under the MIT license found in the LICENSE
file
in the root directory of this source tree.