ProtT5-XL-UniRef50
Property | Value |
---|---|
Architecture | T5-based (Modified T5-3B) |
Parameters | 3 billion |
Training Data | UniRef50 (45M protein sequences) |
Paper | Research Paper |
What is prot_t5_xl_uniref50?
ProtT5-XL-UniRef50 is a state-of-the-art protein language model based on the T5 architecture, specifically designed for understanding and analyzing protein sequences. Trained on 45 million protein sequences from UniRef50, this model employs a modified masked language modeling approach to capture the intricate patterns and properties of protein sequences.
Implementation Details
The model utilizes a Bart-like MLM denoising objective, masking 15% of amino acids during training. It was trained on a TPU Pod V2-256 for 991.5k steps using the ProtT5-XL-BFD model as an initial checkpoint. The model processes uppercase amino acids only and handles rare amino acids (U,Z,O,B) by mapping them to 'X'.
- Sequence length: 512 tokens
- Batch size: 2,000
- Optimizer: AdaFactor with inverse square root learning rate
- Training infrastructure: 936 nodes (5616 GPUs)
Core Capabilities
- Secondary structure prediction (3-states: 81-87% accuracy, 8-states: 70-77% accuracy)
- Subcellular localization prediction (81% accuracy)
- Membrane protein prediction (91% accuracy)
- Feature extraction for downstream tasks
- Fine-tuning capabilities for specific applications
Frequently Asked Questions
Q: What makes this model unique?
This model stands out for its scale (3B parameters) and its ability to capture biophysical properties of proteins through self-supervised learning, effectively learning the "grammar" of protein sequences without human labeling.
Q: What are the recommended use cases?
The model excels in protein feature extraction, secondary structure prediction, and can be fine-tuned for specific downstream tasks in protein analysis. It's particularly effective when used as an encoder for feature extraction rather than using decoder features.