SqueezeBERT-Uncased
Property | Value |
---|---|
License | BSD |
Paper | SqueezeBERT Paper |
Training Data | BookCorpus, Wikipedia |
Architecture | BERT-based with grouped convolutions |
What is squeezebert-uncased?
SqueezeBERT-uncased is an innovative transformer model that optimizes BERT's architecture for mobile devices. It maintains BERT's core functionality while replacing traditional fully-connected layers with grouped convolutions, resulting in significantly faster inference times - specifically 4.3x faster than bert-base-uncased on a Google Pixel 3 smartphone.
Implementation Details
The model is pretrained using the LAMB optimizer with specific hyperparameters: a global batch size of 8192, learning rate of 2.5e-3, and warmup proportion of 0.28. The training process involves 56k steps with a maximum sequence length of 128, followed by 6k steps with a sequence length of 512. The model uses Masked Language Modeling (MLM) and Sentence Order Prediction (SOP) objectives during pretraining.
- Case-insensitive tokenization
- Efficient grouped convolution architecture
- Optimized for mobile deployment
- No distillation used in pretraining
Core Capabilities
- Fast inference on mobile devices
- Text classification tasks
- Masked language modeling
- Sentence order prediction
Frequently Asked Questions
Q: What makes this model unique?
SqueezeBERT's main innovation is its use of grouped convolutions instead of fully-connected layers, making it significantly more efficient for mobile deployment while maintaining BERT-like performance.
Q: What are the recommended use cases?
The model is particularly well-suited for mobile applications requiring BERT-like capabilities. For text classification tasks, it's recommended to use the squeezebert-mnli-headless variant as a starting point.