gena-lm-bert-base

gena-lm-bert-base

AIRI-Institute

GENA-LM: Pre-trained transformer model for long DNA sequences (up to 4500 nucleotides), utilizing BPE tokenization and trained on T2T human genome assembly

PropertyValue
ArchitectureBERT-based Transformer
Maximum Sequence Length512 tokens (≈4500 nucleotides)
Hidden Size768
Layers12
Attention Heads12
Vocabulary Size32,000
PaperbioRxiv

What is gena-lm-bert-base?

GENA-LM is a groundbreaking foundational model designed specifically for processing long DNA sequences. This BERT-based model represents a significant advancement over previous DNA language models, particularly in its ability to handle sequences up to 4500 nucleotides in length using efficient BPE tokenization.

Implementation Details

The model employs a modified Transformer architecture with Pre-Layer normalization and was trained on the latest T2T human genome assembly. It underwent 500,000 iterations of pre-training using a masked language modeling approach, masking 15% of tokens following BigBird methodology.

  • BPE tokenization instead of traditional k-mers
  • Extended sequence length capability (4500 nucleotides)
  • Pre-trained on T2T human genome assembly
  • Modified Transformer with Pre-Layer normalization

Core Capabilities

  • Long DNA sequence processing
  • Masked Language Modeling for DNA
  • Sequence classification tasks
  • Token classification capabilities
  • Question-answering functionalities

Frequently Asked Questions

Q: What makes this model unique?

GENA-LM stands out for its ability to process much longer DNA sequences than previous models like DNABERT, using BPE tokenization instead of k-mers, and its training on the latest T2T human genome assembly.

Q: What are the recommended use cases?

The model is particularly suited for DNA sequence analysis tasks, including sequence classification, token classification, and general DNA sequence understanding. It can be fine-tuned for specific genomic analysis tasks and supports various downstream applications in genomics research.

Socials
PromptLayer
Company
All services online
Location IconPromptLayer is located in the heart of New York City
PromptLayer © 2026