DeepSeek-R1-Distill-Llama-3B
Property | Value |
---|---|
Base Model | Llama-3.2-3B |
Model Type | AutoModelForCausalLM |
Context Length | 2048 tokens |
Hugging Face | Link |
What is DeepSeek-R1-Distill-Llama-3B?
DeepSeek-R1-Distill-Llama-3B is a distilled version of the DeepSeek-R1 model, built on the Llama-3.2-3B architecture. This model represents a significant advancement in efficient language modeling, specifically designed to maintain powerful reasoning capabilities while reducing computational requirements through distillation.
Implementation Details
The model implements several technical optimizations including LoRA fine-tuning with r=16 and alpha=32, utilizing flash attention and gradient checkpointing for efficient training. It employs a specialized chat template system compatible with Llama3 formatting and includes custom tokenization with specific special tokens.
- Optimized using paged_adamw_8bit optimizer
- Supports both 8-bit and 4-bit quantization
- Features custom system prompts with thinking tags
- Implements cosine learning rate scheduling
Core Capabilities
- Strong performance on IFEval with 70.93% accuracy
- Balanced performance across multiple benchmarks (23.27 average)
- Specialized reasoning capabilities with structured output format
- Efficient inference with support for various precision levels
Frequently Asked Questions
Q: What makes this model unique?
The model's distinctive feature is its specialized format for reasoning, using explicit think tags to structure its thought process, combined with efficient distillation from a larger model while maintaining strong performance.
Q: What are the recommended use cases?
This model is particularly well-suited for applications requiring structured reasoning, mathematical comparisons, and general language understanding tasks where computational efficiency is important.