Nucleotide Transformer v2 500M Multi-Species

Property	Value
Parameter Count	498M parameters
Model Type	Masked Language Model
License	CC-BY-NC-SA-4.0
Paper	View Paper
Training Data	850 diverse species genomes

What is nucleotide-transformer-v2-500m-multi-species?

This is a sophisticated foundation model developed by InstaDeep, NVIDIA, and TUM for DNA sequence analysis. It represents a significant advancement in genomic research, trained on an extensive dataset of 850 diverse species genomes, totaling 174B nucleotides. Unlike traditional approaches that rely on single reference genomes, this model leverages a broad spectrum of genetic information from various organisms.

Implementation Details

The model implements a transformer architecture with several innovative features, including rotary positional embeddings and Gated Linear Units. It was trained using 8 A100 80GB GPUs on 900B tokens, with an effective batch size of 1M tokens and sequences of 1000 tokens length. The training utilized the Adam optimizer with carefully tuned hyperparameters (β1 = 0.9, β2 = 0.999, ε=1e-8).

Uses a 6-mer tokenization strategy with a vocabulary size of 4,105
Implements BERT-style masking with 15% token masking rate
Features rotary positional embeddings for improved sequence understanding
Supports maximum sequence length of 1,000 tokens

Core Capabilities

DNA sequence analysis and prediction
Molecular phenotype prediction
Multi-species genome analysis
Masked sequence completion
Generation of sequence embeddings

Frequently Asked Questions

Q: What makes this model unique?

This model's uniqueness lies in its multi-species approach, incorporating genetic information from 850 different species rather than focusing on a single reference genome. This broad training basis enables more robust and accurate molecular phenotype predictions compared to traditional methods.

Q: What are the recommended use cases?

The model is particularly suited for genomic research, DNA sequence analysis, and molecular phenotype prediction. It can be used for tasks such as sequence completion, pattern recognition in DNA sequences, and generating meaningful embeddings for downstream genomic tasks.