nucleotide-transformer-v2-500m-multi-species

Maintained By
InstaDeepAI

Nucleotide Transformer v2 500M Multi-Species

PropertyValue
Parameter Count498M parameters
Model TypeMasked Language Model
LicenseCC-BY-NC-SA-4.0
PaperView Paper
Training Data850 diverse species genomes

What is nucleotide-transformer-v2-500m-multi-species?

This is a sophisticated foundation model developed by InstaDeep, NVIDIA, and TUM for DNA sequence analysis. It represents a significant advancement in genomic research, trained on an extensive dataset of 850 diverse species genomes, totaling 174B nucleotides. Unlike traditional approaches that rely on single reference genomes, this model leverages a broad spectrum of genetic information from various organisms.

Implementation Details

The model implements a transformer architecture with several innovative features, including rotary positional embeddings and Gated Linear Units. It was trained using 8 A100 80GB GPUs on 900B tokens, with an effective batch size of 1M tokens and sequences of 1000 tokens length. The training utilized the Adam optimizer with carefully tuned hyperparameters (β1 = 0.9, β2 = 0.999, ε=1e-8).

  • Uses a 6-mer tokenization strategy with a vocabulary size of 4,105
  • Implements BERT-style masking with 15% token masking rate
  • Features rotary positional embeddings for improved sequence understanding
  • Supports maximum sequence length of 1,000 tokens

Core Capabilities

  • DNA sequence analysis and prediction
  • Molecular phenotype prediction
  • Multi-species genome analysis
  • Masked sequence completion
  • Generation of sequence embeddings

Frequently Asked Questions

Q: What makes this model unique?

This model's uniqueness lies in its multi-species approach, incorporating genetic information from 850 different species rather than focusing on a single reference genome. This broad training basis enables more robust and accurate molecular phenotype predictions compared to traditional methods.

Q: What are the recommended use cases?

The model is particularly suited for genomic research, DNA sequence analysis, and molecular phenotype prediction. It can be used for tasks such as sequence completion, pattern recognition in DNA sequences, and generating meaningful embeddings for downstream genomic tasks.

🍰 Interesting in building your own agents?
PromptLayer provides Huggingface integration tools to manage and monitor prompts with your whole team. Get started here.