ProtT5-XL-UniRef50

Property	Value
Architecture	T5-based (Modified T5-3B)
Parameters	3 billion
Training Data	UniRef50 (45M protein sequences)
Paper	Research Paper

What is prot_t5_xl_uniref50?

ProtT5-XL-UniRef50 is a state-of-the-art protein language model based on the T5 architecture, specifically designed for understanding and analyzing protein sequences. Trained on 45 million protein sequences from UniRef50, this model employs a modified masked language modeling approach to capture the intricate patterns and properties of protein sequences.

Implementation Details

The model utilizes a Bart-like MLM denoising objective, masking 15% of amino acids during training. It was trained on a TPU Pod V2-256 for 991.5k steps using the ProtT5-XL-BFD model as an initial checkpoint. The model processes uppercase amino acids only and handles rare amino acids (U,Z,O,B) by mapping them to 'X'.

Sequence length: 512 tokens
Batch size: 2,000
Optimizer: AdaFactor with inverse square root learning rate
Training infrastructure: 936 nodes (5616 GPUs)

Core Capabilities

Secondary structure prediction (3-states: 81-87% accuracy, 8-states: 70-77% accuracy)
Subcellular localization prediction (81% accuracy)
Membrane protein prediction (91% accuracy)
Feature extraction for downstream tasks
Fine-tuning capabilities for specific applications

Frequently Asked Questions

Q: What makes this model unique?

This model stands out for its scale (3B parameters) and its ability to capture biophysical properties of proteins through self-supervised learning, effectively learning the "grammar" of protein sequences without human labeling.

Q: What are the recommended use cases?

The model excels in protein feature extraction, secondary structure prediction, and can be fine-tuned for specific downstream tasks in protein analysis. It's particularly effective when used as an encoder for feature extraction rather than using decoder features.