ProstT5
Property | Value |
---|---|
License | MIT |
Model Type | Encoder-decoder (T5) |
Base Model | ProtT5-XL-U50 |
Author | Rostlab |
What is ProstT5?
ProstT5 is an advanced protein language model that bridges the gap between protein sequences and structures. Built upon the ProtT5-XL-U50 architecture, it has been fine-tuned on 17M high-quality protein structures from AlphaFoldDB. The model's unique capability lies in its ability to translate between protein sequences (amino acids) and structural representations (3Di-tokens).
Implementation Details
The model employs a two-phase training approach: First, it learns to represent 3Di-tokens through span-denoising, then it's trained for bidirectional translation between sequences and structures. It uses special tokens ("
- Supports both feature extraction and sequence-structure translation
- Utilizes DeepSpeed stage-2 with gradient accumulation
- Implements mixed half-precision (bf16) and PyTorch2.0's torchInductor compiler
- Processing speed: ~0.1s per protein for embeddings, 0.6-2.5s for translation
Core Capabilities
- Protein sequence to structure translation ("folding")
- Structure to sequence translation ("inverse folding")
- Feature extraction for both amino acid and 3Di sequences
- Remote homology detection through Foldseek integration
Frequently Asked Questions
Q: What makes this model unique?
ProstT5's ability to perform bidirectional translation between protein sequences and structures, while maintaining the capability to generate meaningful embeddings for both modalities, sets it apart from traditional protein language models.
Q: What are the recommended use cases?
The model excels in protein structure prediction, sequence design based on structural constraints, and generating protein embeddings for downstream tasks like remote homology detection and protein function prediction.