Caduceus DNA Sequence Model

Property	Value
Parameter Count	7.73M
License	Apache-2.0
Paper	arXiv:2403.03234
Sequence Length	131,072
Architecture	16 MambaDNA layers, 256 hidden dimension

What is caduceus-ps_seqlen-131k_d_model-256_n_layer-16?

This is a specialized DNA sequence modeling transformer developed by the Kuleshov Group, designed for long-range DNA sequence analysis. The model features reverse complement (RC) equivariance, eliminating the need for RC data augmentation during both pre-training and fine-tuning phases.

Implementation Details

The model was pre-trained on the human reference genome using sequences of length 131,072 for 50,000 steps, with each step processing approximately 1 million base pairs. Its architecture consists of 16 MambaDNA layers with a hidden dimension of 256.

Reverse complement equivariant architecture
Double-sized hidden state compared to non-RC models
Supports masked language modeling
Flexible downstream task adaptation

Core Capabilities

Long-range DNA sequence modeling up to 131k base pairs
Bi-directional sequence processing
Efficient masked language modeling
Support for both pre-trained usage and custom training

Frequently Asked Questions

Q: What makes this model unique?

The model's reverse complement equivariance capability sets it apart, allowing it to process DNA sequences without requiring explicit data augmentation for reverse complements. This makes it particularly efficient for genomic analysis tasks.

Q: What are the recommended use cases?

The model is ideal for DNA sequence analysis tasks, particularly those requiring long-range understanding of genomic sequences. It's especially useful for masked language modeling in genomics and can be fine-tuned for specific downstream applications in computational biology.

caduceus-ps_seqlen-131k_d_model-256_n_layer-16