SqueezeBERT-Uncased

Property	Value
License	BSD
Paper	SqueezeBERT Paper
Training Data	BookCorpus, Wikipedia
Architecture	BERT-based with grouped convolutions

What is squeezebert-uncased?

SqueezeBERT-uncased is an innovative transformer model that optimizes BERT's architecture for mobile devices. It maintains BERT's core functionality while replacing traditional fully-connected layers with grouped convolutions, resulting in significantly faster inference times - specifically 4.3x faster than bert-base-uncased on a Google Pixel 3 smartphone.

Implementation Details

The model is pretrained using the LAMB optimizer with specific hyperparameters: a global batch size of 8192, learning rate of 2.5e-3, and warmup proportion of 0.28. The training process involves 56k steps with a maximum sequence length of 128, followed by 6k steps with a sequence length of 512. The model uses Masked Language Modeling (MLM) and Sentence Order Prediction (SOP) objectives during pretraining.

Case-insensitive tokenization
Efficient grouped convolution architecture
Optimized for mobile deployment
No distillation used in pretraining

Core Capabilities

Fast inference on mobile devices
Text classification tasks
Masked language modeling
Sentence order prediction

Frequently Asked Questions

Q: What makes this model unique?

SqueezeBERT's main innovation is its use of grouped convolutions instead of fully-connected layers, making it significantly more efficient for mobile deployment while maintaining BERT-like performance.

Q: What are the recommended use cases?

The model is particularly well-suited for mobile applications requiring BERT-like capabilities. For text classification tasks, it's recommended to use the squeezebert-mnli-headless variant as a starting point.