XtremeDistil-l6-h384-uncased
Property | Value |
---|---|
Parameter Count | 22 million |
Speedup vs BERT-base | 5.3x |
Architecture | 6 layers, 384 hidden size, 12 attention heads |
Paper | arXiv:2106.04563 |
What is xtremedistil-l6-h384-uncased?
XtremeDistil-l6-h384-uncased is a highly efficient transformer model developed by Microsoft that combines task transfer learning with multi-stage distillation techniques. This model achieves remarkable performance while being significantly smaller than BERT-base, demonstrating strong capabilities across various NLP tasks.
Implementation Details
The model implements a compressed architecture with 6 transformer layers, 384 hidden dimensions, and 12 attention heads. It leverages advanced distillation techniques from both XtremeDistil and MiniLM approaches, resulting in a model that achieves 86.6% average score across GLUE benchmarks and SQuAD-2.
- 22M parameters (vs 109M in BERT-base)
- 5.3x inference speedup
- Tested compatibility with TensorFlow 2.3.1, Transformers 4.1.1, PyTorch 1.6.0
Core Capabilities
- Strong performance on MNLI (85.4%), QNLI (90.3%), and QQP (91.0%)
- Exceptional results on RTE (80.9%) and MRPC (90.0%)
- Robust SQuAD2 performance (76.6%)
- Task-agnostic design suitable for various NLP applications
Frequently Asked Questions
Q: What makes this model unique?
The model's uniqueness lies in its task-agnostic distillation approach that maintains high performance while achieving significant size reduction. It represents an optimal balance between model efficiency and accuracy.
Q: What are the recommended use cases?
This model is particularly well-suited for production environments where computational efficiency is crucial but high performance is still required. It excels in various NLP tasks including natural language inference, question answering, and text classification.