SMI-TED: Chemical Language Foundation Model

Property	Value
License	Apache 2.0
Paper	arXiv:2407.20267
Training Data	91M SMILES molecules from PubChem
Model Variants	289M and 8x289M parameters

What is materials.smi-ted?

SMI-TED (SMILES-based Transformer Encoder-Decoder) is IBM's advanced foundation model designed specifically for chemical language processing. It represents a significant advancement in molecular representation learning, trained on a massive dataset of 91 million SMILES samples from PubChem, equivalent to 4 billion molecular tokens.

Implementation Details

The model employs a sophisticated transformer-based architecture with both encoder and decoder components. It's implemented in PyTorch and supports various formats including safetensors. The training process involves two strategic approaches: masked language modeling for encoder training and an encoder-decoder strategy for SMILES reconstruction.

Pre-trained on canonicalized SMILES with max length of 202 tokens
Utilizes both 95/5/0 and 100/0/0 training splits
Supports multiple molecular representations including SMILES, SELFIES, and 3D structures
Available in two variants: 289M and 8x289M parameters

Core Capabilities

Quantum property prediction
Molecular feature extraction
SMILES reconstruction and generation
Classification and regression tasks
State-of-the-art performance on MoleculeNet benchmarks

Frequently Asked Questions

Q: What makes this model unique?

SMI-TED stands out for its comprehensive pre-training on an extensive curated dataset of 91M molecules and its versatility in handling various molecular representation tasks. Its dual-variant architecture allows for flexible deployment based on computational requirements.

Q: What are the recommended use cases?

The model excels in chemical property prediction, molecular feature extraction, and SMILES reconstruction. It's particularly suitable for research in materials science, drug discovery, and chemical engineering where accurate molecular representation and property prediction are crucial.

materials.smi-ted