Caduceus DNA Sequence Model
Property | Value |
---|---|
Parameter Count | 7.73M |
License | Apache-2.0 |
Paper | arXiv:2403.03234 |
Sequence Length | 131,072 |
Architecture | 16 MambaDNA layers, 256 hidden dimension |
What is caduceus-ps_seqlen-131k_d_model-256_n_layer-16?
This is a specialized DNA sequence modeling transformer developed by the Kuleshov Group, designed for long-range DNA sequence analysis. The model features reverse complement (RC) equivariance, eliminating the need for RC data augmentation during both pre-training and fine-tuning phases.
Implementation Details
The model was pre-trained on the human reference genome using sequences of length 131,072 for 50,000 steps, with each step processing approximately 1 million base pairs. Its architecture consists of 16 MambaDNA layers with a hidden dimension of 256.
- Reverse complement equivariant architecture
- Double-sized hidden state compared to non-RC models
- Supports masked language modeling
- Flexible downstream task adaptation
Core Capabilities
- Long-range DNA sequence modeling up to 131k base pairs
- Bi-directional sequence processing
- Efficient masked language modeling
- Support for both pre-trained usage and custom training
Frequently Asked Questions
Q: What makes this model unique?
The model's reverse complement equivariance capability sets it apart, allowing it to process DNA sequences without requiring explicit data augmentation for reverse complements. This makes it particularly efficient for genomic analysis tasks.
Q: What are the recommended use cases?
The model is ideal for DNA sequence analysis tasks, particularly those requiring long-range understanding of genomic sequences. It's especially useful for masked language modeling in genomics and can be fine-tuned for specific downstream applications in computational biology.