RiNALMo

Property	Value
Parameter Count	651M
Model Type	BERT-style MLM
License	AGPL-3.0
Paper	arXiv:2403.00043
Architecture	33 layers, 1280 hidden size, 20 heads

What is RiNALMo?

RiNALMo is a sophisticated pre-trained language model designed specifically for non-coding RNA (ncRNA) sequence analysis. Built on the BERT architecture, it leverages masked language modeling to understand and predict RNA sequence patterns across a massive dataset of 36 million unique ncRNA sequences.

Implementation Details

The model employs a deep architecture with 33 layers, 1280 hidden dimensions, and 20 attention heads. It was trained on 7 NVIDIA A100 GPUs using a carefully curated dataset combining RNAcentral, Rfam, Ensembl Genome Browser, and Nucleotide databases.

Pre-training uses 15% token masking with specialized replacement strategies
Implements sequence clustering for diverse batch sampling
Supports maximum sequence length of 1022 tokens
Includes specialized preprocessing for RNA sequences (U/T conversion)

Core Capabilities

Masked language modeling for RNA sequences
Feature extraction for downstream tasks
Sequence-level classification and regression
Nucleotide-level prediction
Contact prediction for RNA structure analysis

Frequently Asked Questions

Q: What makes this model unique?

RiNALMo stands out for its specialized focus on RNA sequences and its comprehensive training on diverse RNA databases, making it particularly effective for RNA structure prediction tasks. The model's architecture and training approach ensure high-quality representation learning for RNA sequences.

Q: What are the recommended use cases?

The model is ideal for RNA sequence analysis tasks, including structure prediction, sequence classification, and feature extraction. It can be fine-tuned for specific downstream tasks in RNA research and analysis.

rinalmo