Meta-Llama-3.2-3B-Instruct-ONNX-INT4
Property | Value |
---|---|
Base Model | Meta-Llama/Llama-3.2-3B-Instruct |
License | NVIDIA Open Model License |
Supported Languages | English, German, French, Italian, Portuguese, Hindi, Spanish, Thai |
Hardware Requirements | NVIDIA Ampere and newer GPUs (6GB+ VRAM) |
What is Meta-Llama-3.2-3B-Instruct-ONNX-INT4?
This is a quantized version of Meta's Llama-3.2-3B-Instruct model, optimized using NVIDIA's TensorRT Model Optimizer. The model has been converted to INT4 precision to reduce memory footprint while maintaining performance, making it ideal for efficient deployment on NVIDIA GPUs.
Implementation Details
The model utilizes an optimized transformer architecture and has been converted through a three-step process: from PyTorch bfloat16 to ONNX FP16, and finally to ONNX INT4 using AWQ quantization. It's specifically designed for Windows environments and uses the Onnxruntime-GenAI-DirectML backend for inference.
- Architecture: Auto-regressive transformer-based language model
- Quantization: INT4 precision using AWQ
- MMLU Accuracy: 57.71% (5-shot)
- Inference Backend: Onnxruntime-GenAI-DirectML
Core Capabilities
- Multi-language support across 8 languages
- Optimized for commercial and research applications
- Efficient inference on NVIDIA Ampere and newer GPUs
- Reduced memory footprint through INT4 quantization
- Compatible with Windows operating systems
Frequently Asked Questions
Q: What makes this model unique?
This model stands out due to its INT4 quantization, which significantly reduces memory requirements while maintaining good performance. It's specifically optimized for NVIDIA GPUs and provides a practical solution for deploying large language models in resource-constrained environments.
Q: What are the recommended use cases?
The model is suitable for both commercial and research applications requiring efficient natural language processing. Its multi-language support and optimized performance make it ideal for applications where memory efficiency and processing speed are crucial, particularly on Windows systems with NVIDIA GPUs.