prot_t5_xl_uniref50

prot_t5_xl_uniref50

Rostlab

A powerful protein language model trained on 45M sequences from UniRef50, based on T5-3B architecture with 3B parameters for protein feature extraction and fine-tuning.

PropertyValue
ArchitectureT5-based (Modified T5-3B)
Parameters3 billion
Training DataUniRef50 (45M protein sequences)
PaperResearch Paper

What is prot_t5_xl_uniref50?

ProtT5-XL-UniRef50 is a state-of-the-art protein language model based on the T5 architecture, specifically designed for understanding and analyzing protein sequences. Trained on 45 million protein sequences from UniRef50, this model employs a modified masked language modeling approach to capture the intricate patterns and properties of protein sequences.

Implementation Details

The model utilizes a Bart-like MLM denoising objective, masking 15% of amino acids during training. It was trained on a TPU Pod V2-256 for 991.5k steps using the ProtT5-XL-BFD model as an initial checkpoint. The model processes uppercase amino acids only and handles rare amino acids (U,Z,O,B) by mapping them to 'X'.

  • Sequence length: 512 tokens
  • Batch size: 2,000
  • Optimizer: AdaFactor with inverse square root learning rate
  • Training infrastructure: 936 nodes (5616 GPUs)

Core Capabilities

  • Secondary structure prediction (3-states: 81-87% accuracy, 8-states: 70-77% accuracy)
  • Subcellular localization prediction (81% accuracy)
  • Membrane protein prediction (91% accuracy)
  • Feature extraction for downstream tasks
  • Fine-tuning capabilities for specific applications

Frequently Asked Questions

Q: What makes this model unique?

This model stands out for its scale (3B parameters) and its ability to capture biophysical properties of proteins through self-supervised learning, effectively learning the "grammar" of protein sequences without human labeling.

Q: What are the recommended use cases?

The model excels in protein feature extraction, secondary structure prediction, and can be fine-tuned for specific downstream tasks in protein analysis. It's particularly effective when used as an encoder for feature extraction rather than using decoder features.

Socials
PromptLayer
Company
All services online
Location IconPromptLayer is located in the heart of New York City
PromptLayer © 2026