SMI-TED: Chemical Language Foundation Model
Property | Value |
---|---|
License | Apache 2.0 |
Paper | arXiv:2407.20267 |
Training Data | 91M SMILES molecules from PubChem |
Model Variants | 289M and 8x289M parameters |
What is materials.smi-ted?
SMI-TED (SMILES-based Transformer Encoder-Decoder) is IBM's advanced foundation model designed specifically for chemical language processing. It represents a significant advancement in molecular representation learning, trained on a massive dataset of 91 million SMILES samples from PubChem, equivalent to 4 billion molecular tokens.
Implementation Details
The model employs a sophisticated transformer-based architecture with both encoder and decoder components. It's implemented in PyTorch and supports various formats including safetensors. The training process involves two strategic approaches: masked language modeling for encoder training and an encoder-decoder strategy for SMILES reconstruction.
- Pre-trained on canonicalized SMILES with max length of 202 tokens
- Utilizes both 95/5/0 and 100/0/0 training splits
- Supports multiple molecular representations including SMILES, SELFIES, and 3D structures
- Available in two variants: 289M and 8x289M parameters
Core Capabilities
- Quantum property prediction
- Molecular feature extraction
- SMILES reconstruction and generation
- Classification and regression tasks
- State-of-the-art performance on MoleculeNet benchmarks
Frequently Asked Questions
Q: What makes this model unique?
SMI-TED stands out for its comprehensive pre-training on an extensive curated dataset of 91M molecules and its versatility in handling various molecular representation tasks. Its dual-variant architecture allows for flexible deployment based on computational requirements.
Q: What are the recommended use cases?
The model excels in chemical property prediction, molecular feature extraction, and SMILES reconstruction. It's particularly suitable for research in materials science, drug discovery, and chemical engineering where accurate molecular representation and property prediction are crucial.