Nucleotide Transformer v2 500M Multi-Species
Property | Value |
---|---|
Parameter Count | 498M parameters |
Model Type | Masked Language Model |
License | CC-BY-NC-SA-4.0 |
Paper | View Paper |
Training Data | 850 diverse species genomes |
What is nucleotide-transformer-v2-500m-multi-species?
This is a sophisticated foundation model developed by InstaDeep, NVIDIA, and TUM for DNA sequence analysis. It represents a significant advancement in genomic research, trained on an extensive dataset of 850 diverse species genomes, totaling 174B nucleotides. Unlike traditional approaches that rely on single reference genomes, this model leverages a broad spectrum of genetic information from various organisms.
Implementation Details
The model implements a transformer architecture with several innovative features, including rotary positional embeddings and Gated Linear Units. It was trained using 8 A100 80GB GPUs on 900B tokens, with an effective batch size of 1M tokens and sequences of 1000 tokens length. The training utilized the Adam optimizer with carefully tuned hyperparameters (β1 = 0.9, β2 = 0.999, ε=1e-8).
- Uses a 6-mer tokenization strategy with a vocabulary size of 4,105
- Implements BERT-style masking with 15% token masking rate
- Features rotary positional embeddings for improved sequence understanding
- Supports maximum sequence length of 1,000 tokens
Core Capabilities
- DNA sequence analysis and prediction
- Molecular phenotype prediction
- Multi-species genome analysis
- Masked sequence completion
- Generation of sequence embeddings
Frequently Asked Questions
Q: What makes this model unique?
This model's uniqueness lies in its multi-species approach, incorporating genetic information from 850 different species rather than focusing on a single reference genome. This broad training basis enables more robust and accurate molecular phenotype predictions compared to traditional methods.
Q: What are the recommended use cases?
The model is particularly suited for genomic research, DNA sequence analysis, and molecular phenotype prediction. It can be used for tasks such as sequence completion, pattern recognition in DNA sequences, and generating meaningful embeddings for downstream genomic tasks.