Llama-3.2-3B-Instruct-ONNX
Property | Value |
---|---|
License | MIT + Llama 3.2 Community License |
Supported Languages | 8 (en, de, fr, it, pt, hi, es, th) |
Framework | ONNX Runtime |
Development | ONNX Runtime, Microsoft |
What is Llama-3.2-3B-Instruct-ONNX?
Llama-3.2-3B-Instruct-ONNX is an optimized version of Meta's Llama-3.2-3B-Instruct model, converted to ONNX format for enhanced inference performance. This implementation offers significant speed improvements, achieving up to 39x faster performance than PyTorch compile on A100 GPUs and 1.25x faster than llama.cpp on standard CPU configurations.
Implementation Details
The model has been specifically optimized for both CPU and GPU deployment, featuring int4 quantization techniques via RTN. It supports multiple hardware configurations including A100 GPUs, standard CPU environments, and AMD processors.
- Supports both CPU and GPU execution with int4 quantization
- Optimized for DirectX 12-capable GPUs with 4GB+ RAM
- CUDA support for NVIDIA GPUs with Compute Capability >= 7.0
- Comprehensive multilingual support across 8 languages
Core Capabilities
- High-performance text generation and instruction following
- Efficient inference across various hardware platforms
- Integrated with ONNX Runtime Generate() API for easy deployment
- Supports both mobile and server-side deployment
Frequently Asked Questions
Q: What makes this model unique?
This model stands out for its optimized performance through ONNX Runtime integration, offering significant speed improvements over standard implementations while maintaining model quality through careful quantization.
Q: What are the recommended use cases?
The model is ideal for applications requiring efficient text generation and instruction following, particularly in production environments where performance optimization is crucial. It's especially suitable for deployment across various hardware configurations, from mobile devices to high-performance servers.