materials.smi-ted

Maintained By
ibm

SMI-TED: Chemical Language Foundation Model

PropertyValue
LicenseApache 2.0
PaperarXiv:2407.20267
Training Data91M SMILES molecules from PubChem
Model Variants289M and 8x289M parameters

What is materials.smi-ted?

SMI-TED (SMILES-based Transformer Encoder-Decoder) is IBM's advanced foundation model designed specifically for chemical language processing. It represents a significant advancement in molecular representation learning, trained on a massive dataset of 91 million SMILES samples from PubChem, equivalent to 4 billion molecular tokens.

Implementation Details

The model employs a sophisticated transformer-based architecture with both encoder and decoder components. It's implemented in PyTorch and supports various formats including safetensors. The training process involves two strategic approaches: masked language modeling for encoder training and an encoder-decoder strategy for SMILES reconstruction.

  • Pre-trained on canonicalized SMILES with max length of 202 tokens
  • Utilizes both 95/5/0 and 100/0/0 training splits
  • Supports multiple molecular representations including SMILES, SELFIES, and 3D structures
  • Available in two variants: 289M and 8x289M parameters

Core Capabilities

  • Quantum property prediction
  • Molecular feature extraction
  • SMILES reconstruction and generation
  • Classification and regression tasks
  • State-of-the-art performance on MoleculeNet benchmarks

Frequently Asked Questions

Q: What makes this model unique?

SMI-TED stands out for its comprehensive pre-training on an extensive curated dataset of 91M molecules and its versatility in handling various molecular representation tasks. Its dual-variant architecture allows for flexible deployment based on computational requirements.

Q: What are the recommended use cases?

The model excels in chemical property prediction, molecular feature extraction, and SMILES reconstruction. It's particularly suitable for research in materials science, drug discovery, and chemical engineering where accurate molecular representation and property prediction are crucial.

🍰 Interesting in building your own agents?
PromptLayer provides Huggingface integration tools to manage and monitor prompts with your whole team. Get started here.