GENA-LM BERT Base
Property | Value |
---|---|
Architecture | BERT-based Transformer |
Maximum Sequence Length | 512 tokens (≈4500 nucleotides) |
Hidden Size | 768 |
Layers | 12 |
Attention Heads | 12 |
Vocabulary Size | 32,000 |
Paper | bioRxiv |
What is gena-lm-bert-base?
GENA-LM is a groundbreaking foundational model designed specifically for processing long DNA sequences. This BERT-based model represents a significant advancement over previous DNA language models, particularly in its ability to handle sequences up to 4500 nucleotides in length using efficient BPE tokenization.
Implementation Details
The model employs a modified Transformer architecture with Pre-Layer normalization and was trained on the latest T2T human genome assembly. It underwent 500,000 iterations of pre-training using a masked language modeling approach, masking 15% of tokens following BigBird methodology.
- BPE tokenization instead of traditional k-mers
- Extended sequence length capability (4500 nucleotides)
- Pre-trained on T2T human genome assembly
- Modified Transformer with Pre-Layer normalization
Core Capabilities
- Long DNA sequence processing
- Masked Language Modeling for DNA
- Sequence classification tasks
- Token classification capabilities
- Question-answering functionalities
Frequently Asked Questions
Q: What makes this model unique?
GENA-LM stands out for its ability to process much longer DNA sequences than previous models like DNABERT, using BPE tokenization instead of k-mers, and its training on the latest T2T human genome assembly.
Q: What are the recommended use cases?
The model is particularly suited for DNA sequence analysis tasks, including sequence classification, token classification, and general DNA sequence understanding. It can be fine-tuned for specific genomic analysis tasks and supports various downstream applications in genomics research.