RiNALMo
Property | Value |
---|---|
Parameter Count | 651M |
Model Type | BERT-style MLM |
License | AGPL-3.0 |
Paper | arXiv:2403.00043 |
Architecture | 33 layers, 1280 hidden size, 20 heads |
What is RiNALMo?
RiNALMo is a sophisticated pre-trained language model designed specifically for non-coding RNA (ncRNA) sequence analysis. Built on the BERT architecture, it leverages masked language modeling to understand and predict RNA sequence patterns across a massive dataset of 36 million unique ncRNA sequences.
Implementation Details
The model employs a deep architecture with 33 layers, 1280 hidden dimensions, and 20 attention heads. It was trained on 7 NVIDIA A100 GPUs using a carefully curated dataset combining RNAcentral, Rfam, Ensembl Genome Browser, and Nucleotide databases.
- Pre-training uses 15% token masking with specialized replacement strategies
- Implements sequence clustering for diverse batch sampling
- Supports maximum sequence length of 1022 tokens
- Includes specialized preprocessing for RNA sequences (U/T conversion)
Core Capabilities
- Masked language modeling for RNA sequences
- Feature extraction for downstream tasks
- Sequence-level classification and regression
- Nucleotide-level prediction
- Contact prediction for RNA structure analysis
Frequently Asked Questions
Q: What makes this model unique?
RiNALMo stands out for its specialized focus on RNA sequences and its comprehensive training on diverse RNA databases, making it particularly effective for RNA structure prediction tasks. The model's architecture and training approach ensure high-quality representation learning for RNA sequences.
Q: What are the recommended use cases?
The model is ideal for RNA sequence analysis tasks, including structure prediction, sequence classification, and feature extraction. It can be fine-tuned for specific downstream tasks in RNA research and analysis.