prot_t5_xl_uniref50

Maintained By
Rostlab

ProtT5-XL-UniRef50

PropertyValue
ArchitectureT5-based (Modified T5-3B)
Parameters3 billion
Training DataUniRef50 (45M protein sequences)
PaperResearch Paper

What is prot_t5_xl_uniref50?

ProtT5-XL-UniRef50 is a state-of-the-art protein language model based on the T5 architecture, specifically designed for understanding and analyzing protein sequences. Trained on 45 million protein sequences from UniRef50, this model employs a modified masked language modeling approach to capture the intricate patterns and properties of protein sequences.

Implementation Details

The model utilizes a Bart-like MLM denoising objective, masking 15% of amino acids during training. It was trained on a TPU Pod V2-256 for 991.5k steps using the ProtT5-XL-BFD model as an initial checkpoint. The model processes uppercase amino acids only and handles rare amino acids (U,Z,O,B) by mapping them to 'X'.

  • Sequence length: 512 tokens
  • Batch size: 2,000
  • Optimizer: AdaFactor with inverse square root learning rate
  • Training infrastructure: 936 nodes (5616 GPUs)

Core Capabilities

  • Secondary structure prediction (3-states: 81-87% accuracy, 8-states: 70-77% accuracy)
  • Subcellular localization prediction (81% accuracy)
  • Membrane protein prediction (91% accuracy)
  • Feature extraction for downstream tasks
  • Fine-tuning capabilities for specific applications

Frequently Asked Questions

Q: What makes this model unique?

This model stands out for its scale (3B parameters) and its ability to capture biophysical properties of proteins through self-supervised learning, effectively learning the "grammar" of protein sequences without human labeling.

Q: What are the recommended use cases?

The model excels in protein feature extraction, secondary structure prediction, and can be fine-tuned for specific downstream tasks in protein analysis. It's particularly effective when used as an encoder for feature extraction rather than using decoder features.

🍰 Interesting in building your own agents?
PromptLayer provides Huggingface integration tools to manage and monitor prompts with your whole team. Get started here.