XtremeDistil-l6-h384-uncased

Property	Value
Parameter Count	22 million
Speedup vs BERT-base	5.3x
Architecture	6 layers, 384 hidden size, 12 attention heads
Paper	arXiv:2106.04563

What is xtremedistil-l6-h384-uncased?

XtremeDistil-l6-h384-uncased is a highly efficient transformer model developed by Microsoft that combines task transfer learning with multi-stage distillation techniques. This model achieves remarkable performance while being significantly smaller than BERT-base, demonstrating strong capabilities across various NLP tasks.

Implementation Details

The model implements a compressed architecture with 6 transformer layers, 384 hidden dimensions, and 12 attention heads. It leverages advanced distillation techniques from both XtremeDistil and MiniLM approaches, resulting in a model that achieves 86.6% average score across GLUE benchmarks and SQuAD-2.

22M parameters (vs 109M in BERT-base)
5.3x inference speedup
Tested compatibility with TensorFlow 2.3.1, Transformers 4.1.1, PyTorch 1.6.0

Core Capabilities

Strong performance on MNLI (85.4%), QNLI (90.3%), and QQP (91.0%)
Exceptional results on RTE (80.9%) and MRPC (90.0%)
Robust SQuAD2 performance (76.6%)
Task-agnostic design suitable for various NLP applications

Frequently Asked Questions

Q: What makes this model unique?

The model's uniqueness lies in its task-agnostic distillation approach that maintains high performance while achieving significant size reduction. It represents an optimal balance between model efficiency and accuracy.

Q: What are the recommended use cases?

This model is particularly well-suited for production environments where computational efficiency is crucial but high performance is still required. It excels in various NLP tasks including natural language inference, question answering, and text classification.